Using sort
As Ed states in his comment, your sort
command is sorting on the third field, when in fact you only have two fields (the :
is the field separator). So to fix it, replace 3
with 2
for the key.
However, then the original record order in the source file gets messed up, when the records being sorted by their key value rather than by the line/record number:
$ sort -u -t':' -k2,2 test.txt
1:A
2:B
6:C
5:a
4:b
$
Which is probably not what you want. Nevertheless, this is easily fixed by piping the output through sort
again:
$ sort -u -t':' -k2,2 test.txt | sort
1:A
2:B
4:b
5:a
6:C
$
Note: As you say that you have a large file then, in order to speed things up, you may want to consider using the --parallel
flag1:
sort --parallel=<n> -u -t':' -k2,2 test.txt | sort --parallel=<n>
When <n>
is the number of cores that you have available.
Using awk
Expanding upon your example file, if the original data is in a file called test.txt
, like this:
1:A
2:B
3:A
4:b
5:a
6:C
and, again, treating the :
as a field separator, then you could use awk
2.
For example this line:
awk 'BEGIN{FS=":"}{if (!seen[$2]++)print $0}' test.txt
Gives the following result:
$ awk 'BEGIN{FS=":"}{if (!seen[$2]++)print $0}' test.txt
1:A
2:B
4:b
5:a
6:C
$
You can see how this works by looking at the logic, using
$ awk 'BEGIN{FS=":"}{print !seen[$2]++}' test.txt
1
1
0
1
1
1
$
- First, the field separator is specified with
FS=":"
.
- Second, the negation operator gives a "true" result for a second field entry that hasn't yet been seen.
- Finally, the
print $0
prints the whole record, i.e. the current line.
Putting this into a shell script3 rather than an awk
script gives:
#!/bin/sh
awk -F':' '
(!seen[$2]++) {
print $0
}
' "$1"
References:
1 This answer to How to sort big files?
2 This answer to Keeping unique rows based on information from 2 of three columns
3 This answer to Specify other flags in awk script header
-k3,3
you're telling sort to sort by the 3rd field when the input has 2 fields. – Ed Morton Sep 05 '21 at 02:20