2

I have a file that is as follows

1:A
2:B
3:A

I need the output to be:

1:A
2:B

As the third entries second column contains A, like the first entry does, it removes it. Also it needs to be case sensitive.

It's an extremely large file so time saving would be nice.

I have tried this but it seems to only print the unique lines

sort -u -t':' -k3,3 file
Greenonline
  • 1,851
  • 7
  • 17
  • 23
jayboy
  • 23

1 Answers1

3

Using sort

As Ed states in his comment, your sort command is sorting on the third field, when in fact you only have two fields (the : is the field separator). So to fix it, replace 3 with 2 for the key.

However, then the original record order in the source file gets messed up, when the records being sorted by their key value rather than by the line/record number:

$ sort -u -t':' -k2,2 test.txt 
1:A
2:B
6:C
5:a
4:b
$

Which is probably not what you want. Nevertheless, this is easily fixed by piping the output through sort again:

$ sort -u -t':' -k2,2 test.txt | sort 
1:A
2:B
4:b
5:a
6:C
$

Note: As you say that you have a large file then, in order to speed things up, you may want to consider using the --parallel flag1:

sort --parallel=<n> -u -t':' -k2,2 test.txt | sort --parallel=<n>

When <n> is the number of cores that you have available.

Using awk

Expanding upon your example file, if the original data is in a file called test.txt, like this:

1:A
2:B
3:A
4:b
5:a
6:C

and, again, treating the : as a field separator, then you could use awk2.

For example this line:

awk 'BEGIN{FS=":"}{if (!seen[$2]++)print $0}' test.txt

Gives the following result:

$ awk 'BEGIN{FS=":"}{if (!seen[$2]++)print $0}' test.txt 
1:A
2:B
4:b
5:a
6:C
$

You can see how this works by looking at the logic, using

$ awk 'BEGIN{FS=":"}{print !seen[$2]++}' test.txt 
1
1
0
1
1
1
$
  • First, the field separator is specified with FS=":".
  • Second, the negation operator gives a "true" result for a second field entry that hasn't yet been seen.
  • Finally, the print $0 prints the whole record, i.e. the current line.

Putting this into a shell script3 rather than an awk script gives:

#!/bin/sh

awk -F':' ' (!seen[$2]++) { print $0 } ' "$1"

References:

1 This answer to How to sort big files?

2 This answer to Keeping unique rows based on information from 2 of three columns

3 This answer to Specify other flags in awk script header

Greenonline
  • 1,851
  • 7
  • 17
  • 23