How is uniq not unique enough that there is also uniq --unique?

Question

Here are commands on a random file from pastebin:

wget -qO - http://pastebin.com/0cSPs9LR | wc -l
350
wget -qO - http://pastebin.com/0cSPs9LR | sort -u | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq -u | wc -l
258

The man pages are not clear on what the -u flag is doing. Any advice?

Try sort | uniq -d | wc -l and you might spot the difference. :) — stoeff, Jun 18 '15 at 10:27

score 59 · Accepted Answer · answered Jun 18 '15 at 10:26

59

uniq with -u skips any lines that have duplicates. Thus:

$ printf "%s\n" 1 1 2 3 | uniq
1
2
3
$ printf "%s\n" 1 1 2 3 | uniq -u
2
3

Usually, uniq prints lines at most once (assuming sorted input). This option actually prints lines which are truly unique (having not appeared again).

answered Jun 18 '15 at 10:26

muru

72,889

14

That is, uniq could be called distinct, since it prints all distinct lines, whereas uniq -u prints all unique lines. – Steve Jessop Jun 18 '15 at 12:41
It's not truly unique with GNU uniq in some locale. – cuonglm Jun 18 '15 at 12:57
I must have read the accepted answer a number of times, but it didn't sink in. Your example and paragraph after it make it very clear (and going back and re-reading the accepted answer, I get that too) :) – Madivad Feb 06 '16 at 22:34

score 47 · Answer 2 · answered Jun 18 '15 at 11:47

47

Short version:

uniq, without -u, makes every line of the output unique.
uniq -u only prints every unique line from the input.

Slightly longer version:

uniq is for dealing with files that have lines duplicated, and only when those lines appear successively in the input. So, for its purposes, a unique line is one that is not duplicated immediately.

(uniq has a very limited short-term memory; it will never remember whether a line appeared earlier in the input, unless it was the immediately previous line -- this is why uniq is very often paired with sort.)

When it encounters a run of duplicate lines, uniq, without the -u arg, prints one copy of that line. (It makes every line of the output unique).

With the -u argument, it prints zero copies of that line -- runs of duplicates just get omitted from the output.

answered Jun 18 '15 at 11:47

Ian Clelland

571

1

I really wish there was an option to not require sorting. But it would require keeping the whole file in memory (or doing lots of bookkeeping with hashes and offsets if the source is a normal file) – Random832 Jun 18 '15 at 12:15
3

@Random832: and it would require deciding which of the dupes to keep (first, last, something else, configurable), and that decision would affect the algorithm globally. Hassle. – Steve Jessop Jun 18 '15 at 12:44
1

@Random832: if it's just about the number of characters to type, you can use sort -u instead of sort | uniq. – oliver Jun 19 '15 at 12:53
@oliver I've occasionally wanted an ability to keep the first instance of any line without rearranging them, and written scripts to do so. – Random832 Jun 19 '15 at 13:00
@SteveJessop That's already a decision that needs to be made, isn't it? Can't two lines compare as equal without actually being identical, for example if a Unicode file is not normalised and two equal lines use different normalisation forms? – hvd Jun 19 '15 at 14:39
1

@hvd: if your version of uniq does normalisation and collation, yes. But even then it's only a local consideration -- you know where in the sorted output the line will appear, and just have to select which of several adjacent lines to keep. If the input isn't sorted then the decision affects the whole operation of uniqifying, for example if you're going to keep the last duplicate then you can't output anything until you've read the last line of the input... – Steve Jessop Jun 19 '15 at 15:00
@SteveJessop That's a very good point, thanks for the clarification. – hvd Jun 19 '15 at 15:05

cuonglm · Answer 3 · 2015-06-18T16:08:41.017

18

uniq POSIX spec described it clearly:

-u
    Suppress the writing of lines that are repeated in the input.

-u option make uniq not to print repeated lines.

Most uniq implementations used bytes comparison, while GNU uniq used collation order to filter duplicated lines. So it can produce wrong result in some locales, example in en_US.UTF-8 locale:

$ printf '%b\n' '\U2460' '\U2461' | uniq
①

and -u gave you no lines:

$ printf '%b\n' '\U2460' '\U2461' | uniq -u
<blank>

So you should set locale to C to get bytes comparison:

$ printf '%b\n' '\U2460' '\U2461' | LC_ALL=C uniq
①
②

edited Jun 18 '15 at 16:08

answered Jun 18 '15 at 10:33

cuonglm

153,898

3

Note that what is wrong here is not as much uniq (though apparently the intent of POSIX was that it should do byte comparison instead of strcoll() comparison as in sort -u) as those locales that erroneously have ① sorting the same as ②. At least GNU uniq is consistent with sort -u. – Stéphane Chazelas Jun 18 '15 at 16:27
@StéphaneChazelas - where in the spec is that made apparent? – mikeserv Jun 19 '15 at 05:33
About uniq required to do memcmp/strcmp as opposed to strcoll, that is not very apparent to me but that was to Geoff. About the GNU locales having ① sorting the same as ②, that's clearly a bug as there's no reason why they should sort the same. That's allowed by POSIX but there's some change coming. – Stéphane Chazelas Jun 20 '15 at 07:29

score 12 · Answer 4 · answered Jun 18 '15 at 23:02

normal:

echo "a b a b c c c" | tr ' ' '\n'
a
b
a
b
c
c
c

uniq : no two subsequent repeating lines

echo "a b a b c c c" | tr ' ' '\n' | uniq
a
b
a
b
c

sorted

echo "a b a b c c c" | tr ' ' '\n' | sort
a
a
b
b
c
c
c

sort -u : no two repeating lines

echo "a b a b c c c" | tr ' ' '\n' | sort -u
a
b
c

sort / uniq: all distinct

echo "a b a b c c c" | tr ' ' '\n' | sort | uniq
a
b
c

counts distinct occurrences

echo "a b a b c c c" | tr ' ' '\n' | sort | uniq -c
2 a
2 b
3 c

only lines which are not repeated (not sorted first)

echo "a b a b c c c" | tr ' ' '\n' | uniq -u
a
b
a
b

only lines which are not repeated (after sorting)

echo "a b a b c c c Z" | tr ' ' '\n' | sort | uniq -u
Z

uniq -d : only print duplicate lines, one for each group

echo "a b a b c c c" | tr ' ' '\n' | uniq -d
c

.. counted

echo "a b a b c c c" | tr ' ' '\n' | uniq -dc
3 c

How is uniq not unique enough that there is also uniq --unique?

4 Answers4

Short version:

Slightly longer version:

Linked

Related