Using 1st[N] characters when checking for duplicate

Question

I have a set of Data in file:

AAAPOL.0001  
AAAPOL.0002  
AAAPRO.0001  
AAAPRO.0002  
AAAPRO.0003  
AAAPRO.0004  
AAAXEL.0002  
AAAJOK.1111  
AAAJOK.2222

I only need the first occurrence using the pattern of the 1st 6 characters so I need to know how to check for the duplicate/uniqueness that will only match the 1st 6 characters.

The command should return this from the data above:

AAAPOL.0001   
AAAPRO.0001   
AAAXEL.0002   
AAAJOK.1111

I do not have access to the uniq -w option.

John1024 · Answer 1 · 2016-10-13T16:12:11.620

Using awk

In your examples, the first six characters are followed by a period. If that is always true, then:

$ awk -F. '!c[$1]++' File
AAAPOL.0001
AAAPRO.0001
AAAXEL.0002
AAAJOK.1111

This works by using . as a field separator and keeping track of the number of times that the first field has appeared already.

If that is not the case, then:

$ awk '!c[substr($0, 1, 6)]++' File
AAAPOL.0001
AAAPRO.0001
AAAXEL.0002
AAAJOK.1111

substr($0, 1, 6) is the first six characters of the line. Associative array c keeps track of the number of times that we have seen those first six characters. Thus, if c[substr($0, 1, 6)] is non-zero, we have already seen those characters and the line should not be printed. In awk, non-zero means true. So, we invert the test with !: this means that !c[substr($0, 1, 6)] is true if those six characters have not been seen before. The trailing ++ updates the count in c before we read the next line.

Using uniq

For reference for those who, unlike the OP, have access to a version of uniq with the -w option, then:

$ uniq -w6 File
AAAPOL.0001
AAAPRO.0001
AAAXEL.0002
AAAJOK.1111

score 1 · Answer 2 · answered Oct 13 '16 at 08:50

1

If you don't mind about the order of lines to be changed, you can use sort -u with the sort key set to those first 6 characters:

sort -u -k 1,1.6

Or to the part before the .:

sort -t . -u -k 1,1

answered Oct 13 '16 at 08:50

Stéphane Chazelas

544,893

Using 1st[N] characters when checking for duplicate

2 Answers2

Using awk

Using uniq