9

Suppose you have a file like this:

NW_006521251.1  428 84134
NW_006521251.1  511 84135
NW_006521038.1  202 84155
NW_006521038.1  1743 84153
NW_006521038.1  1743 84154
NW_006520495.1  198 84159
NW_006520086.1  473 84178
NW_006520086.1  511 84180

I want to keep the unique rows based on columns 1 and 2 (i.e. not just column two as this number may repeat under a different label in column one).

Such that I get this as output (removes the second repeat of NW_006521038.1 1743 from the list):

    NW_006521251.1  428 84134
    NW_006521251.1  511 84135
    NW_006521038.1  202 84155
    NW_006521038.1  1743 84153
    NW_006520495.1  198 84159
    NW_006520086.1  473 84178
    NW_006520086.1  511 84180

Is there a way to do this with awk? Using uniq file doesn't work.

Age87
  • 559

2 Answers2

20

There is a "famous" awk idiom for exactly this. You want to do:

awk '!seen[$1,$2]++' file

That creates an associative array "seen" with the 2 columns as the key. Use the post-increment operator so that, for the first time you encounter that key, the value is zero. The use the negation operator for a "true" result the first time you see the key.

glenn jackman
  • 85,964
7

If you don't mind that the output is sorted:

sort -u -k1,2 file
  • -u - unique
  • -k1,2 - use fields 1 and 2 together as the key