2

I'm trying to use uniq to find all non-unique lines in a file. By non-unique, I mean any line that I have already seen on the previous line. I thought that the "-D" option would do this:

-D     print all duplicate lines

But instead of just printing the duplicate lines, it prints all the lines when there is more than one. I want to only print the second and subsequent copies of a line.

How can I do this?

Michael
  • 544

3 Answers3

2

With the GNU or ast-open implementation of uniq:

uniq -D -u < input

(-D itself is non-standard), though note that it's the last duplicate that it removes, not the first (which makes a difference if you also use -i, -w or -f)

Portably, you could always use awk:

awk 'NR > 1 && $0 "" == previous ""; {previous = $0}' < input

(the concatenation with "" being to force a string comparison even if operands look like numbers)

To only compare the first 9 characters (note that -w is also a GNU extension and (currently) works with bytes, not characters (despite what its document says)):

awk '{current = substr($0, 1, 9)}
     NR > 1 && current == previous
     {previous = current}' < input

(no need for "" concatenation in that case as substr() returns a string).

In a UTF-8 locale, on the output of

printf '%s\n' StéphaneChazelas StéphaneUNIX StéphaneUnix

It gives StéphaneUnix as expected while uniq -w9 -D -u (with GNU uniq) gives StéphaneChazelas and StéphaneUNIX as Stéphane is 8 characters but 9 bytes in UTF-8 whilst ast-open uniq gives StéphaneUNIX only (awk skips the first occurrence, uniq removes the last occurrence).

With awk, you can also report all duplicate lines even when they're not adjacent with:

 awk 'seen[$0]++' < input

(note that it stores all the unique lines in memory in a hash table though).

Or to consider only the first 9 characters:

 awk 'seen[substr($0, 1, 9)]++' < input
2

You want the lowercase -d option of the GNU version.

# printf "a\na\na\nb\nb\nc\n" | uniq -d
a
b
davolfman
  • 602
0

A solution is to use -c of uniq , and after remove what you want

e444$ (   echo a ; echo a ; echo b ; echo d ; echo d ; echo e )  | uniq -c
  2 a
  1 b
  2 d
  1 e 

a and d are in duplicate and band e are in double

e444$ (   echo a ; echo a ; echo b ; echo d ; echo d ; echo e )  | uniq -c  \             
              | sed -E '/^ *1 .$/d;s/^ *[0-9]+ //'

explanation of the sed expression :

/^ *1 .$/d will delete all uniques lines

s/^ *[0-9]+ // will remove the counter

EchoMike444
  • 3,165