46

I have large 3-column files (~10,000 lines) and I would like to remove lines when the contents of the third column of that line appear in the third column of another line. The files' sizes make sort a bit cumbersome, and I can't use something like the below code because the entire lines aren't identical; just the contents of column 3.

awk '!seen[$0]++' filename
αғsнιη
  • 41,407
Zach C
  • 469

2 Answers2

51

Just change your awk command to the column you want to remove duplicated lines based on that column (in your case third column):

awk '!seen[$3]++' filename

This command is telling awk which lines to print. The variable $3 holds the entire contents of column 3 and square brackets are array access. So, for each third column of line in filename, the node of the array named seen is incremented and the line printed if the content of that node(column3) was not (!) previously set. By doing this, always the first lines (unique by the third column) will be kept.

Above will work if your columns in input file are delimited with Spaces/Tabs, if that is something else, you will need to tell it to awk with its -F option. So, for example if columns delimited with comma(,) and wants to remove lines base on third column, use the command as following:

awk -F',' '!seen[$3]++' filename
αғsнιη
  • 41,407
31

sort command is already optimized to handle huge files. So, you could very well use the sort command on your file as,

sort -u -t' ' -k3,3 file
  • -u - print only the unique lines.
  • -t - specify the delimiter. Here in this example, I just use the space as delimiter.
  • -k3,3 - sort on 3rd field.

You could refer to this answer which suggests that GNU sort is in fact the better approach for sorting big files. In your case, I think even without -parallel, you could achieve your end result without much of time delay.

Ramesh
  • 39,297
  • 3
    Was about to comment snarkily that -u would only remove duplicate lines, not duplicate keys... but I'm wrong. – Randoms Feb 15 '18 at 04:40
  • @Ramesh it does the job but sorting changes the sequence of lines that I guess not expected always. – Bharat Apr 26 '18 at 07:28
  • @Ramesh-Why do we need -k3,3 instead of just -k3? – Kathmandude Apr 04 '20 at 13:07
  • @PhilCoulson - -k3 would sort using the string starting with the 3rd key rather than sort ONLY using the 3rd key. Try printf 'a b\na c\n' | sort -k1,1 -u vs printf 'a b\na c\n' | sort -k1 -u. – Ed Morton Sep 05 '21 at 11:51
  • @Bharat GNU sort has a -s option to handle that. – Ed Morton Sep 05 '21 at 11:52
  • @Ramesh there's no indication in the question that using -t' ' would be useful or even the right thing to do, it'd break if the input was tab-separated for example. – Ed Morton Sep 05 '21 at 11:53