How to remove unique strings from a textfile?

Question

Sorry guys I had to edit my example, because I didn't express my query properly. Let's say I have the .txt file:

Happy sad
Happy sad
Happy sad
Sad happy
Happy sad
Happy sad
Mad sad
Mad happy
Mad happy

And I want to delete any string that is unique. Leaving the file with:

Happy sad
Happy sad
Happy sad
Happy sad
Happy sad
Mad happy
Mad happy

I understand that sort is able to get rid of duplicates (sort file.txt | uniq), so is there anyway we can do the opposite in bash using a command? Or would I just need to figure out a while loop for it? BTW uniq -D file.txt > output.txt doesn't work.

What should it give on the output of printf '%s\n' a b b c a c? Only b, or a b b c a c as they're all dupplicated even if not contiguous? Would b b a a c c like in @Kusalananda's answer where output is delayed until the second occurrence be acceptable? — Stéphane Chazelas, Nov 05 '20 at 13:34

Kusalananda · Accepted Answer · 2020-11-05T13:18:55.420

Using awk:

$ awk 'seen[$0]++; seen[$0] == 2' file
Happy sad
Happy sad
Happy sad
Happy sad
Happy sad
Mad happy
Mad happy

This uses the text of each line as the key into the associative array seen. The first seen[$0]++ will cause a line that has been seen before to be printed since the value associated with the line will be non-zero on the second and subsequent times the line is seen. The seen[$0] == 2 causes the line to be printed again if this is the second time the line has been seen (without this, you'll miss one occurrence of each duplicated line).

This is related to awk '!seen[$0]++' which is sometimes used to remove duplicates without sorting (see e.g. How does awk '!a[$0]++' work?).

To only get one copy of the duplicated lines:

awk 'seen[$0]++ == 1' file

or,

sort file | uniq -d

Note that this can reorder lines, i.e. (space=newline) "a b b b a a" would result in "b b b a a a" — Felix Dombek, Nov 08 '20 at 00:29

score 8 · Answer 2 · answered Nov 05 '20 at 13:39

8

If the duplicates may not be contiguous and you need to preserve the order in the input, you could do it with awk and two passes, one to count the number of occurrences and one to print the lines that have been seen to occur more than once in the first pass:

awk 'second_pass {if (c[$0] > 1) print; next}
     {c[$0]++}' file.txt second_pass=1 file.txt

answered Nov 05 '20 at 13:39

Stéphane Chazelas

544,893

Wouldn't the FNR==NR-approach work in this case (and be more compact, too)? – AdminBee Nov 05 '20 at 13:50
6

@AdminBee, I don't like the FNR==NR code pattern as in the general case, it fails if the first file is empty. Not a problem here, but I prefer advertising the safer flag-based approach. I also find using that second_pass flag here makes the code easier to understand. – Stéphane Chazelas Nov 05 '20 at 14:02
Valid point; I didn't consider the case where the first file might be empty. And I agree, the flag can make the code easier to read (and sometimes there is no other portable way, e.g. in "third pass" scenarios) . – AdminBee Nov 05 '20 at 14:13
1

@AdminBee. Related: Bypass a nawk snippet if the input file is empty – Stéphane Chazelas Nov 05 '20 at 14:23
Also possible to have the first pass store all the input in an array indexed by NR, and traverse that in an END block (instead of reading the file again). Best to check the file size < 200MB or so first. – Paul_Pedant Nov 05 '20 at 15:19

score 3 · Answer 3 · answered Nov 05 '20 at 10:47

3

From man uniq:

-D print all duplicate lines

You can achieve your goal like so:

uniq -D file.txt

answered Nov 05 '20 at 10:47

Panki

6,664

4

Note that -D is a non-standard GNU extension, though now supported by a few other implementations including ast-open's and FreeBSD's – Stéphane Chazelas Nov 05 '20 at 10:50
4

Also note that it only reports duplicate lines that are contiguous. printf '%s\n' a b a | uniq -D will print nothing as the two a lines are not contiguous. It may very well be what the OP wants though. – Stéphane Chazelas Nov 05 '20 at 10:51
Note that some versions of uniq (e.g. on macOS) use a lowercase -d – jcaron Nov 08 '20 at 00:34
this appears to work in the OP's example, but if you delete the last Happy sad (6th line) in his input, it gives you the wrong answer. (I am assuming duplicates should be detected even if they are not contiguous, though OP did not actually say that and his example sufficiently illustrate what he wants in that edge case.) – Nov 09 '20 at 03:30

score 0 · Answer 4 · answered Nov 09 '20 at 03:25

this is probably a Linux-only solution, since it uses uniq's -u option. You could get around that by using uniq -c then filtering for ^ *1 etc. if you're running some other flavour.

sort < in | uniq --unique | grep --invert-match --line-regexp --fixed-strings --file - in

The first 2 stages will put out

Mad sad
Sad happy

and the next stage will remove lines that match exactly those lines. I picked the longer options for clarity; I myself rarely use them The short form would be sort < in | uniq -u | grep -v -x -F -f - in

How to remove unique strings from a textfile?

4 Answers4