How to remove lines with identical content in the same column with awk

Question

I've got massive file looking similar to that:

H2,3,5,9,ef,ty,i;
H2,7,5,6,rt,hg,j;
T2,5,5,0,207,3.7,00,...,2023:46:18:14:31,76;
T2,5,5,0,207,3.5,00,...,2023:46:18:14:31,76;
T2,5,5,0,119,3.5,00,...,2023:46:18:14:32,10;
T2,5,5,0,207,3.5,00,...,2023:46:18:14:32,15;
T2,5,5,0,186,3.4,00,...,2023:46:18:14:32,16;
T2,5,5,0,207,4.6,00,...,2023:46:18:14:32,30;
....

I need to get rid of lines:

Starting from T2,5,5,0,207
Having the repeating time mark in field 15

and leave all other lines untouched.

I tried that in different combinations but none of what I checked worked so far:

awk -F ',' ' x!=$15 { if ($1 == T2 && $5 == 207) {x=$15; print$0} else print$0} ' test > test1

I would really appreciate any advice!! Thanks

There's a few different things you might mean by that description. Make sure to provide concise, testable sample input and expected output that demonstrate your needs. — Ed Morton, Feb 18 '23 at 12:07
If the lines you want to remove are sequential and the entire line is duplicate, you might want to use uniq instead. — user10489, Feb 18 '23 at 13:48
Should the removed records have T2,5,5,0,207 specifically as their first fields, or should any record with non-unique five first fields be deleted? Is the second condition dependent or independent of the first condition? I.e., should all records with duplicated timestamps be deleted, or only those that also has duplicated five first fields? — Kusalananda, Feb 18 '23 at 13:53
Was 2023:46:18:14:31 or 76; the "repeating time mark in field 15"? You omitted a whole load of fields so it was impossible to count. (You could have referenced field 9 or maybe field 10 so that the description matched the data.) Next time you have a question please try to ensure you have example data that can be used for testing — Chris Davies, Feb 18 '23 at 18:17

Gilles Quénot · Accepted Answer · 2023-02-19T16:08:41.300

3

Try this:

$ awk -F, '!seen[$1,$2,$3,$4,$5,$8]++' file

Output

H2,3,5,9,ef,ty,i;
H2,7,5,6,rt,hg,j;
T2,5,5,0,207,3.7,00,...,2023:46:18:14:31,76;
T2,5,5,0,119,3.5,00,...,2023:46:18:14:32,10;
T2,5,5,0,186,3.4,00,...,2023:46:18:14:32,16;

Explanations

the default behavior of awk on a true condition is to print, that's why it's not needed here
the !seen[x]++ is a shorthand to do uniq operation. Check here

Portability

Works with:

gawk
mawk
busybox awk
nawk (default freeBSD awk)

And all awk implementations, thanks Ed Morton

Original snippet for the records:

awk -F, '
     ($1=="T2" && $2==5 && $3==5 && $4==0 && $5==207 && !seen[$8]++) ||
    !($1=="T2" && $2==5 && $3==5 && $4==0 && $5==207)
' file

edited Feb 19 '23 at 16:08

answered Feb 18 '23 at 12:02

Gilles Quénot

33,867

It worked, excellent! Did not manage to decrypt the code yet but I will for sure. Many thanks – user531977 Feb 18 '23 at 12:55
1

Isn't the concatenation of field values into a single key potentially ambiguous? Compare printf '%s\n' 1,2,34,5 1,23,4,5 | awk -F, '!seen[$1$2$3$4]++' and printf '%s\n' 1,2,34,5 1,23,4,5 | awk -F, '!seen[$1,$2,$3,$4]++' for example – steeldriver Feb 18 '23 at 18:37
1

also it's mentioned lines starting with T2,5,5,0,207, (representing fields #1~#5) not all the others lines which you wrongly did like that. – αғsнιη Feb 18 '23 at 18:44
Thanks @steeldriver, added , between fields. I added $8 to remove duplicate dates. – Gilles Quénot Feb 18 '23 at 18:57
@GillesQuénot TBH I'm not sure how portable the a[$1,$2,$3,$4] syntax is - it's supported explicitly in GAWK (providing pseudo multidimensional array capability) and seems to work in MAWK though – steeldriver Feb 18 '23 at 19:02
Added Portability section – Gilles Quénot Feb 18 '23 at 19:49
2

@steeldriver that pseudo-multi-dimensional array syntax a[$1,$2,$3,$4] is portable to all awks, it's the alternative arrays of arrays syntax a[$1][$2][$3][$4] that's non-portable and requires GNU awk (and possibly some other modern awks that share gawk's code and features now, e.g. mawk 2, but it's not part of POSIX). – Ed Morton Feb 19 '23 at 01:26
2

@EdMorton thanks that's good to know – steeldriver Feb 19 '23 at 16:00
$ awk -F, '!seen[$1,$2,$3,$4,$5,$8]++' did not work correctly since file is more complicated than I posted earlier and quite a few lines satisfy to mentioned conditiones so I have to make sure that at least $1 and $5 matches to initial strings "T2" and 207 accordingly. – user531977 Feb 20 '23 at 07:19
Feel free to adapt it – Gilles Quénot Feb 20 '23 at 07:31

αғsнιη · Answer 2 · 2023-02-18T19:05:37.267

Does this what you want?

$ awk -F, '/^T2,5,5,0,207,/ && seen[$15]++{ next }1' infile
H2,3,5,9,ef,ty,i;
H2,7,5,6,rt,hg,j;
T2,5,5,0,207,3.7,00,...,2023:46:18:14:31,76;
T2,5,5,0,119,3.5,00,...,2023:46:18:14:32,10;
T2,5,5,0,207,3.5,00,...,2023:46:18:14:32,15;
T2,5,5,0,186,3.4,00,...,2023:46:18:14:32,16;
....

print first line have seen with T2,5,5,0,207 matching at first five fields and with whatever timestamp in its field #15 then skip other similar lines with the same five starting fields and same timestamp as previously seen; it will also print all other lines unconditionally.

nezabudka · Answer 3 · 2023-02-18T13:49:27.043

0

I realized that I need to get rid of duplicates by fields:

awk -F, '!(/^T2,5,5,0,207/ && A[$(NF-1)]++)' file
H2,3,5,9,ef,ty,i;
H2,7,5,6,rt,hg,j;
T2,5,5,0,207,3.7,00,...,2023:46:18:14:31,76;
T2,5,5,0,119,3.5,00,...,2023:46:18:14:32,10;
T2,5,5,0,207,3.5,00,...,2023:46:18:14:32,15;
T2,5,5,0,186,3.4,00,...,2023:46:18:14:32,16;

A[$(NF-1)] - in this file A[$9]

edited Feb 18 '23 at 13:49

answered Feb 18 '23 at 13:37

nezabudka

2,428
6
15

How to remove lines with identical content in the same column with awk

3 Answers3

Try this:

Output

Explanations

Portability