Remove entire row in a file if first column is repeated

Question

I have a file containing two columns and 10 million rows. The first column contains many repeated values, but there is a distinct value in column 2. I want to remove the repeated rows and want to keep only one using awk. Note: the file is sorted with values in column 1. For example:

1.123 -4.0
2.234 -3.5
2.234 -3.1
2.234 -2.0
4.432 0.0
5.123 +0.2
8.654 +0.5
8.654 +0.8
8.654 +0.9
.
.
.
.

Expected output

1.123 -4.0
2.234 -3.5
4.432 0.0
5.123 +0.2
8.654 +0.5
.
.
.
.

You may find this recent question useful http://unix.stackexchange.com/q/159695/65304 — steeldriver, Oct 08 '14 at 10:44

score 15 · Accepted Answer · edited Apr 13 '17 at 12:36

15

A few ways:

awk
```
awk '!a[$1]++' file
```
This is a very condensed way of writing this:
```
awk '{if(! a[$1]){print; a[$1]++}}' file
```
So, if the current first field ($1) is not in the a array, print the line and add the 1st field to a. Next time we see that field, it will be in the array and so will not be printed.
Perl
```
perl -ane '$k{$F[0]}++ or print' file
```
or
```
perl -ane 'print if !$k{$F[0]}++' file
```
This is basically the same as the awk one. The -n causes perl to read the input file line by line and apply the script provided by -e to each line. The -a will automatically split each line on whitespace and save the resulting fields in the @F array. Finally, the first field is added to the %k hash and if it is not already there, the line is printed. The same thing could be written as
```
perl -e 'while(<>){
            @F=split(/\s+/); 
            print unless defined($k{$F[0]}); 
            $k{$F[0]}++;
         }' file
```
Coreutils
```
rev file | uniq -f 1 | rev
```
This method works by first reversing the lines in file so that if a line is 12 345 it'll now be 543 21. We then use uniq -f 1 to ignore the first field, that is to say, the column that 543 is in. There are fields within file. Using uniq here has the effect of filtering out any duplicate lines, keeping only 1 of each. Lastly we put the lines back into their original order with another reverse.
GNU sort (as suggested by @StéphaneChazelas)
```
sort -buk1,1
```
The -b flag ignores leading whitespace and the -u means print only unique fields. The clever bit is the -k1,1. The -k flag sets the field to sort on. It takes the general format of -k POS1[,POS2] which means only look at fields POS1 through POS2 when sorting. So, -k1,1 means only look at the 1st field. Depending on your data, you might want to also add one of these options:
```
 -g, --general-numeric-sort
      compare according to general numerical value
 -n, --numeric-sort
      compare according to string numerical value
```

edited Apr 13 '17 at 12:36

Community

1

answered Oct 08 '14 at 10:59

terdon

242,166

Hmm so that's what rev is used for... Thanks. – lgeorget Oct 08 '14 at 11:04
I try the awk solution abovve and getting the message a[: Event not found. – Nilesh Oct 08 '14 at 11:06
also after this message when I press up key then the command is changed to awk '$1]++' filename – Nilesh Oct 08 '14 at 11:08
@Nilesh did you copy the exact command or did you use double quotes (") instead of single quotes (')? – terdon Oct 08 '14 at 11:10
I use exactly same command with single quotes. – Nilesh Oct 08 '14 at 11:15
@Nilesh what OS, what shell and what awk are you using? This works exactly as expected (and in fact is a classic awk one liner) on Linux/bash/GNU awk. – terdon Oct 08 '14 at 11:17
@terdon second command is working fine (awk '{if(! a[$1]){print; a[$1]++}}' filename). while first one is not (awk '!a[$1]++' filename). I am on SunOS (uname gave me this). how to know about which shell? – Nilesh Oct 08 '14 at 11:20
@Nilesh try echo $SHELL or, to be certain, ps -p $$ | tail -n 1 | awk '{print $NF}'. Also give the awk version. In future, please always mention your OS, there are major differences between the GNU tools and their various UNIX equivalents. – terdon Oct 08 '14 at 11:27
@Nilesh OK, I just tested and I will guess that you're using csh. If so, you need to escape the ! so use this one: awk '\!a[$1]++' file – terdon Oct 08 '14 at 11:28
Great answer. Perhaps -n needs to be added to sort invocation to force numeric sorting. Also, would sort -bukn1,1 be able to take advantage of pre-sorted input? If not , split the input file into chunks using split and then pass to sort -m I guess – iruvar Oct 08 '14 at 15:42
1

@1_CR good point, thanks. I added the descriptions of the two relevant options. I don't know the details but some quick testing suggests that sort -u is significantly faster on presorted data, yes. – terdon Oct 08 '14 at 15:47

score 5 · Answer 2 · answered Oct 08 '14 at 10:54

5

If the first column is always 5 characters long, you can simply use uniq:

uniq -w 5 file

If not, use awk:

awk '$1!=a{print $0; a=$1}' file

The first one would definitely be faster with a huge file.

answered Oct 08 '14 at 10:54

chaos

48,171

Remove entire row in a file if first column is repeated

2 Answers2

Linked

Related