finding all non-unique lines in a file

Question

I'm trying to use uniq to find all non-unique lines in a file. By non-unique, I mean any line that I have already seen on the previous line. I thought that the "-D" option would do this:

-D     print all duplicate lines

But instead of just printing the duplicate lines, it prints all the lines when there is more than one. I want to only print the second and subsequent copies of a line.

How can I do this?

Stéphane Chazelas · Answer 1 · 2019-11-07T22:49:21.827

With the GNU or ast-open implementation of uniq:

uniq -D -u < input

(-D itself is non-standard), though note that it's the last duplicate that it removes, not the first (which makes a difference if you also use -i, -w or -f)

Portably, you could always use awk:

awk 'NR > 1 && $0 "" == previous ""; {previous = $0}' < input

(the concatenation with "" being to force a string comparison even if operands look like numbers)

To only compare the first 9 characters (note that -w is also a GNU extension and (currently) works with bytes, not characters (despite what its document says)):

awk '{current = substr($0, 1, 9)}
     NR > 1 && current == previous
     {previous = current}' < input

(no need for "" concatenation in that case as substr() returns a string).

In a UTF-8 locale, on the output of

printf '%s\n' StéphaneChazelas StéphaneUNIX StéphaneUnix

It gives StéphaneUnix as expected while uniq -w9 -D -u (with GNU uniq) gives StéphaneChazelas and StéphaneUNIX as Stéphane is 8 characters but 9 bytes in UTF-8 whilst ast-open uniq gives StéphaneUNIX only (awk skips the first occurrence, uniq removes the last occurrence).

With awk, you can also report all duplicate lines even when they're not adjacent with:

 awk 'seen[$0]++' < input

(note that it stores all the unique lines in memory in a hash table though).

Or to consider only the first 9 characters:

 awk 'seen[substr($0, 1, 9)]++' < input

I'll try that, thanks! in my case i'm actually using the --check-chars option as well, so something like substr($0,9) == substr(previous,9) I guess (although that doesn't match anything for some reason) — Michael, Nov 07 '19 at 22:25

score 2 · Answer 2 · answered Nov 07 '19 at 22:25

2

You want the lowercase -d option of the GNU version.

# printf "a\na\na\nb\nb\nc\n" | uniq -d
a
b

answered Nov 07 '19 at 22:25

davolfman

602

that's only printing one copy when there are duplicates. if there are 9 instances of a line, i want it to print the last 8. – Michael Nov 07 '19 at 22:26
Note that -d is standard. It's -D that's not. – Stéphane Chazelas Nov 07 '19 at 22:29
I misunderstood. At this point you might need to use somthing a bit more Turing complete to solve the problem (like awk above, or shell scripting) – davolfman Nov 07 '19 at 22:39

score 0 · Answer 3 · answered Nov 07 '19 at 22:42

A solution is to use -c of uniq , and after remove what you want

e444$ (   echo a ; echo a ; echo b ; echo d ; echo d ; echo e )  | uniq -c
  2 a
  1 b
  2 d
  1 e

a and d are in duplicate and band e are in double

e444$ (   echo a ; echo a ; echo b ; echo d ; echo d ; echo e )  | uniq -c  \             
              | sed -E '/^ *1 .$/d;s/^ *[0-9]+ //'

explanation of the sed expression :

/^ *1 .$/d will delete all uniques lines

s/^ *[0-9]+ // will remove the counter

finding all non-unique lines in a file

3 Answers3