Sort column by duplicates and keep the first occurrence

Question

I have a file that is as follows

1:A
2:B
3:A

I need the output to be:

1:A
2:B

As the third entries second column contains A, like the first entry does, it removes it. Also it needs to be case sensitive.

It's an extremely large file so time saving would be nice.

I have tried this but it seems to only print the unique lines

sort -u -t':' -k3,3 file

With -k3,3 you're telling sort to sort by the 3rd field when the input has 2 fields. — Ed Morton, Sep 05 '21 at 02:20

Greenonline · Accepted Answer · 2021-09-05T05:22:00.373

Using `sort`

As Ed states in his comment, your sort command is sorting on the third field, when in fact you only have two fields (the : is the field separator). So to fix it, replace 3 with 2 for the key.

However, then the original record order in the source file gets messed up, when the records being sorted by their key value rather than by the line/record number:

$ sort -u -t':' -k2,2 test.txt 
1:A
2:B
6:C
5:a
4:b
$

Which is probably not what you want. Nevertheless, this is easily fixed by piping the output through sort again:

$ sort -u -t':' -k2,2 test.txt | sort 
1:A
2:B
4:b
5:a
6:C
$

Note: As you say that you have a large file then, in order to speed things up, you may want to consider using the --parallel flag¹:

sort --parallel=<n> -u -t':' -k2,2 test.txt | sort --parallel=<n>

When <n> is the number of cores that you have available.

Using `awk`

Expanding upon your example file, if the original data is in a file called test.txt, like this:

1:A
2:B
3:A
4:b
5:a
6:C

and, again, treating the : as a field separator, then you could use awk².

For example this line:

awk 'BEGIN{FS=":"}{if (!seen[$2]++)print $0}' test.txt

Gives the following result:

$ awk 'BEGIN{FS=":"}{if (!seen[$2]++)print $0}' test.txt 
1:A
2:B
4:b
5:a
6:C
$

You can see how this works by looking at the logic, using

$ awk 'BEGIN{FS=":"}{print !seen[$2]++}' test.txt 
1
1
0
1
1
1
$

First, the field separator is specified with FS=":".
Second, the negation operator gives a "true" result for a second field entry that hasn't yet been seen.
Finally, the print $0 prints the whole record, i.e. the current line.

Putting this into a shell script³ rather than an awk script gives:

#!/bin/sh
awk -F':' '
  (!seen[$2]++) {
    print $0
  }
' "$1"

References:

¹ This answer to How to sort big files?

² This answer to Keeping unique rows based on information from 2 of three columns

³ This answer to Specify other flags in awk script header

Sort column by duplicates and keep the first occurrence

1 Answers1

Using sort

Using awk

Using `sort`

Using `awk`