Remove duplicate lines with a twist

Question

Okay so I want remove duplicate lines but it's a bit more complicated than that..

I have a file named users.txt, example of file is:

 users:email@email.com
 users1:email@email.com

Now due to a bug in my system people were able to register with the same email as someone else, so I want to remove if lines have the same email more than once, example of issue:

 user:display:email@email.com
 user2:email@email.com
 user3:email@email.com
 user4:email@email.com

Notice how user, user2, user3, user4 all have the same email.. well I want to remove user2, user3, user4 but keep user.. or vice versa ( first one to be picked up by request ) remove any other lines containing same email..

so if

 email@email.com is in 20 lines remove 19
 spam@spam.com is in 555 lines remove 554
 and so fourth..

It looks like most lines are two fields; what is that display value doing there? — Wildcard, Mar 01 '17 at 23:53

Wildcard · Accepted Answer · 2017-03-02T00:26:00.320

This calls for Awk. Since the field you want to check is the first field of each line, just reference $1.

awk -F: '! ($1 in seen) {print; seen[$1]}' users.txt

You can "golf" this to reduce it considerably:

awk -F: '!a[$1]++' users.txt

The longer form is more or less self-explanatory; you build an associative array using each email address as an index, without bothering to assign a value. Then you can just check if the email address has been "seen" before (i.e., if the associative array has a particular email address as an index already), and print the whole line if not.

The shorter form is actually doing more or less the same thing, but requires more explanation for the shorter code.

The postfix ++ operator acts on a variable after the expression is evaluated, so we'll come back to that later.

In Awk, 0 means false and non-zero means true. ! is for negation and reverses the truth value.

Appearing as it does outside of curly brackets, the expression is interpreted as a boolean expression, with an associated action (in curly brackets) to be performed if the expression is true. Since no action is explicitly stated, the default (implicit) action of printing the whole line is used, if the expression evaluates to true (non-zero).

Essentially, this retrieves the value in the associative array a which is pointed to be the email address (first field) as its index—or creates that value initialized as 0 if not already present, interprets a 0 as false or non-zero as true, inverts this truth value and prints the whole line if the result is "truthy," and then increments the value stored in the associative array at that point.

A common enough Awk idiom, actually, but I wouldn't fault you for using the longer more explicit version. :)

there's only 2 columns I made a mistake one line sorry, also just tried what you said but it still outputs duplicate user emails :/
example

12345@yahoo.com:7841329

12345@yahoo.com:asa123456 12345@yahoo.com:9830023 12345@yahoo.com:Jdhftgddtdi 12345@yahoo.com:a12345678 12345@yahoo.com:892000

12345@yahoo.com:28071976

12345@yahoo.com:9716 — yolo, Mar 01 '17 at 23:58
@yolo, which column has the email address? Update your question; I can't tell what's going on in your comment. (Your original post shows the email in the last field of each line.) — Wildcard, Mar 02 '17 at 00:16

agc · Answer 2 · 2017-03-03T01:46:40.177

2

Use GNU datamash to group input by the 2nd field, and keep only the first line of each grouping:
```
datamash -t':' -g 2 rmdup 2 < users.txt
```
As a comment from don_crissti notes, sort can do it, but while it returns the desired results, it may also reorder the output:
```
sort -t':' -k 2,2 -u users.txt
```

The above code assumes users.txt is sorted by the 2nd field, then the first field.

edited Mar 03 '17 at 01:46

answered Mar 02 '17 at 04:25

agc

7,223

datamash will also "reorder" the output (because you have to sort first) unless your input is already sorted; sure, this is irrelevant as you can always restore the order regardless of the tool used to process the file – don_crissti Mar 02 '17 at 17:59
@don_crissti, If so, is that still the case with the rmdup version, and if so could you give an example input? I've used sed -n 's/:e/:ae/p' users.txt | sponge -a users.txt to add a little variety to the input, and then tried cat <(sed -n 's/:e/:o/p' foo) foo | sponge foo, and find that datamash keeps the order, but sort does not. – agc Mar 02 '17 at 18:12
I don't understand your comment. Doesn't datamash require sorted input ? So, unless the input is already sorted you will have to sort it and as a result the lines in the output may be reordered. I don't know what kind of example I should post... Maybe rmdup just works without the need to sort but in that case they should update the man page and specify that rmdup doesn't require sorted input. – don_crissti Mar 02 '17 at 18:21
I don't have access to datamash right now so my comment above may not make sense but as far as I can remember the man page was pretty clear that datamash expects the input to be sorted. If it works for you without the need to sort lines it means some OPS can and will work without having to sort the lines first. – don_crissti Mar 02 '17 at 18:28
@don_crissti, Thanks, for the clarifications. Agreed that man datamash is insufficient, the web page manual is better, but even that... having to look under -g to learn why printf 'a\t1\nb\t1\nc\t2\nd\t1\ne\t2\n' | datamash rmdup 2 only prints the a and c lines is just poor documentation. But the functions are still handy... – agc Mar 03 '17 at 01:44

Remove duplicate lines with a twist

2 Answers2