How to extract the first row for each entry in the first column?

Question

I have a file with hundreds of rows:

Chr01:19967945-19972643 HanXRQChr01g0004001 1   4698    4698    0.0 8676    100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1   10470   10470   0.0 19335   100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:23001231-23011701 HanXRQChr01g0004391 5938    6078    141 7.25e-55    220 95.035  locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1   20472   20472   0.0 37805   100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein
Chr01:38759426-38779934 HanXRQChr15g0474141 7163    7204    42  1.96e-08    67.6    95.238  locus_tag=HanXRQChr15g0474141 gn=IQD29 begin=37205639 end=37211555 len=5917 chr=HanXRQChr15 strand=-1 sp=Helianthus annuus def=Probable IQ-domain 29
Chr01:38759426-38779934 HanXRQChr15g0474141 7003    7043    41  7.05e-08    65.8    95.122  locus_tag=HanXRQChr15g0474141 gn=IQD29 begin=37205639 end=37211555 len=5917 chr=HanXRQChr15 strand=-1 sp=Helianthus annuus def=Probable IQ-domain 29

Some of these rows are unique based on the first column, like the first row Chr01:19967945-1997264, and for some others I have multiple rows based on the first column like Chr01:23001231-23011701.

For each value in the first column, I want to keep only the first row because the first row contains the best values for some other parameters in columns 6 and 7 and 8.

My desired output would be

Chr01:19967945-19972643 HanXRQChr01g0004001 1   4698    4698    0.0 8676    100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1   10470   10470   0.0 19335   100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1   20472   20472   0.0 37805   100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein

Possible duplicate of Remove lines based on duplicates within one column without sort — , Sep 20 '18 at 22:16

Benjamin W. · Answer 1 · 2018-09-20T22:44:30.457

2

You can use awk to keep track of the first fields you've already seen:

awk '!seen[$1]++' infile

This uses a hash, seen, keyed by the first field ($1). We check if the post-incremented value of seen[$1] is falsy, i.e., when we encounter a new value, seen[$1]++ returns 0 and !seen[$1]++ is true; if we've seen the value already seen[$1]++ returns something larger than 0 and !seen[$1]++ is false.

The default operation for when the condition is true is to print the whole line ({ print $0 }), which is what we want here, so we don't have to spell it out.

This does the same, in a more verbose but easier to understand way:

awk 'seen[$1] == 0 {
         ++seen[$1]
         print $0
     }' infile

edited Sep 20 '18 at 22:44

answered Sep 20 '18 at 21:56

Benjamin W.

501

Order of precedence: ++ happens first, but it's the post-increment operator, so it returns 0 the first time the key is seen; then the ! operator comes in with lower precendence -- the first time !n is true when n==0, all subsequent times, !n is false when n>0. Reference: https://www.gnu.org/software/gawk/manual/html_node/Precedence.html – glenn jackman Sep 20 '18 at 22:31
@glennjackman Is that additional explanation, or pointing out a mistake? It works as I want it to. – Benjamin W. Sep 20 '18 at 22:37
@glennjackman Oh, I get it. My explanation is not correct. – Benjamin W. Sep 20 '18 at 22:39
...updated the explanation. – Benjamin W. Sep 20 '18 at 22:44

Kusalananda · Answer 2 · 2018-09-21T11:09:02.040

$ sort -u -s -k1,1 file
Chr01:19967945-19972643 HanXRQChr01g0004001 1   4698    4698    0.0 8676    100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1   10470   10470   0.0 19335   100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1   20472   20472   0.0 37805   100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein

This sort command would only consider the first whitespace-delimited field as the sorting key and return the data sorted with duplicate keys removed (the first unique key found will be returned). The -s tells sort to use a "stable" sorting algorithm, i.e. one that does not change the order of records with identical keys (I'm not 100% sure this is needed but it seems reasonable to use it).

How to extract the first row for each entry in the first column?

2 Answers2