Check all lines of a file are unique

Question

I have a text file containing lines like this:

This is a thread  139737522087680
This is a thread  139737513694976
This is a thread  139737505302272
This is a thread  139737312270080
.
.
.
This is a thread  139737203164928
This is a thread  139737194772224
This is a thread  139737186379520

How can I be sure of the uniqueness of every line?

NOTE: The goal is to test the file, not to modify it if duplicate lines are present.

Do you want to check if all lines are unique, or do you want to remove any duplicates? — 8bittree, Jul 06 '18 at 18:56

score 25 · Answer 1 · edited Jul 08 '18 at 02:05

25

Awk solution:

awk 'a[$0]++{print "dupes"; exit(1)}' file && echo "no dupes"

edited Jul 08 '18 at 02:05

Zombo

1
5
44
63

answered Jul 06 '18 at 16:58

iruvar

16,725

4

+1 The accepted answer reads through the whole file twice, while this stops as soon as it encounters a duplicate line in one read. This will also work with piped input, while the other needs files it can re-read. – JoL Jul 06 '18 at 17:26
Couldn't you shove the echo into END? – Ignacio Vazquez-Abrams Jul 06 '18 at 22:32
@IgnacioVazquez-Abrams, it doesn't work (at least in GNU awk). Code in END is executed on the way out even when exit() is called – iruvar Jul 06 '18 at 22:35
2

@IgnacioVazquez-Abrams There's really no point in the echo. Doing && echo or || echo is a convention in answers to indicate that a command does the right thing with the exit status code. The important thing is the exit(1). Ideally, you'd use this like if has_only_unique_lines file; then ..., not if [[ $(has_only_unique_lines file) = "no dupes" ]]; then ..., that'd be silly. – JoL Jul 07 '18 at 01:10
2

Where other answers reads the file twice to save memory, this will read the whole file into memory, if there are no dupes. – Kusalananda Jul 08 '18 at 17:29
@IgnacioVazquez-Abrams You can, if you set a flag when you find a dupe and switch what you output in END according to this. – Kusalananda Jul 08 '18 at 17:30
1

@Kusalananda While this will read the whole file into memory when there are no dupes, using sort will too, regardless of whether there are dupes or not, right? How is that saving memory? – JoL Jul 09 '18 at 00:50
@JoL sorting an arbitrary sized file is done in chunks. When all chunks are individually sorted, they are merge-sorted to create the result. – Kusalananda Jul 09 '18 at 05:27
@Kusalananda How does that relate to the reads, though? It still has to read all chunks into memory before outputting anything in case the last line should go first. I would guess that for piped input that's larger than available memory, sort might save it to a file and work with offset references, re-reading the file multiple times, but for anything smaller, I'd think it would opt for keeping it in memory. – JoL Jul 09 '18 at 15:30
@JoL GNU sort implements this: https://en.wikipedia.org/wiki/External_sorting – Kusalananda Jul 09 '18 at 15:35

jesse_b · Answer 2 · 2018-07-06T22:29:21.323

24

Using sort/uniq:

sort input.txt | uniq

To check only for duplicate lines use the -d option for uniq. This will show only lines that are duplicate, if none it will show nothing:

sort input.txt | uniq -d

edited Jul 06 '18 at 22:29

answered Jul 06 '18 at 16:32

jesse_b

37,005

This is my goto. Not sure what the other, higher-voted answers offer that this one doesn't. – user1717828 Jul 06 '18 at 20:00
1

It's good alternative to remove duplicates. – Soner from The Ottoman Empire Jul 06 '18 at 20:37
2

This doesn't do what he wants. He wants to know if there are duplicates, not remove them. – Barmar Jul 06 '18 at 22:08
@Barmar: While it does seem that way the question is still unclear. As well as OPs comment attempting to clarify it. – jesse_b Jul 06 '18 at 22:09
There's a pending edit that adds more clarification. – Barmar Jul 06 '18 at 22:24
It says "The goal is to test the file, not to modify the file if duplicate lines are present." – Barmar Jul 06 '18 at 22:24
Clarification was added by a user other than OP. OP says he wants to ensure the file is unique. Ensuring it's unique means removing duplicates. – jesse_b Jul 06 '18 at 22:25
@Jesse_b - Yeah this question has turned into a mess, IMO. Now we have a thrash of answers which drives me nuts 8--). The OP didn't make the edit, someone else did, I just approved it but it seems the consensus of what the OP was asking is what's there. I should've just closed this as unclear to begin w/ and pushed it back on the OP. – slm Jul 06 '18 at 22:33

score 24 · Accepted Answer · edited Jul 10 '18 at 14:20

24

[ "$(wc -l < input)" -eq "$(sort -u input | wc -l)" ] && echo all unique

edited Jul 10 '18 at 14:20

Kusalananda

333,661

answered Jul 06 '18 at 16:36

Jeff Schaller

67,283
35
116
255

Exactly what I would have said, except with uniq instead of sort -u – Nonny Moose Jul 08 '18 at 18:42
1

If the input is not already sorted, uniq would be a big mistake; it only deduplicates adjacent lines! – alexis Jul 09 '18 at 13:26
1

If one is interested in the culprits, a sort <file> | uniq -d would print the duplicates. – Rolf Jul 10 '18 at 08:08

slm · Answer 4 · 2018-07-06T22:45:10.580

TLDR

The original question was unclear, and read that the OP simply wanted a unique version of the contents of a file. That's shown below. In the since updated form of the question, the OP is now stating that he/she simply wants to know if the contents of the file is unique or not.

Test if file's contents is unique or not

You can simply use sort to verify if a file is unique or contains duplicates like so:

$ sort -uC input.txt && echo "unique" || echo "duplicates"

Example

Say I have these two files:

duplicate sample file

$ cat dup_input.txt
This is a thread  139737522087680
This is a thread  139737513694976
This is a thread  139737505302272
This is a thread  139737312270080
This is a thread  139737203164928
This is a thread  139737194772224
This is a thread  139737186379520

unique sample file

$  cat uniq_input.txt
A
B
C
D

Now when we analyze these files we can tell if they're unique or contain duplicates:

test duplicates file

$ sort -uC dup_input.txt && echo "unique" || echo "duplicates"
duplicates

test unique file

$ sort -uC uniq_input.txt && echo "unique" || echo "duplicates"
unique

Original question (unique contents of file)

Can be done with just sort:

$ sort -u input.txt
This is a thread  139737186379520
This is a thread  139737194772224
This is a thread  139737203164928
This is a thread  139737312270080
This is a thread  139737505302272
This is a thread  139737513694976
This is a thread  139737522087680

score 3 · Answer 5 · edited Jul 08 '18 at 02:06

I usually sort the file, then use uniq to count the number of duplicates, then I sort once more see the duplicates at the bottom of the list.

I added one duplicate to the examples you provided:

$ sort thread.file | uniq -c | sort
      1 This is a thread  139737186379520
      1 This is a thread  139737194772224
      1 This is a thread  139737203164928
      1 This is a thread  139737312270080
      1 This is a thread  139737513694976
      1 This is a thread  139737522087680
      2 This is a thread  139737505302272

Since I haven't read the man page for uniq in awhile, I took a quick look for any alternatives. The following eliminates the need for the second sort, if you just want to see duplicates:

$ sort thread.file | uniq -d
This is a thread  139737505302272

It's a good alternative indeed. #rez – Soner from The Ottoman Empire Jul 06 '18 at 17:02 — Soner from The Ottoman Empire, Jul 06 '18 at 17:02

score 2 · Answer 6 · answered Jul 06 '18 at 19:35

If there are no duplicates, all lines are unique:

[ "$(sort file | uniq -d)" ] && echo "some line(s) is(are) repeated"

Description: Sort the file lines to make repeated lines consecutive (sort)
Extract all consecutive lines that are equal (uniq -d).
If there is any output of the command above ([...]), then (&&) print a message.

score 2 · Answer 7 · edited Jul 08 '18 at 02:05

2

This would not be complete without a Perl answer!

$ perl -ne 'print if ++$a{$_} == 2' yourfile

This will print each non-unique line once: so if it prints nothing, then the file has all unique lines.

edited Jul 08 '18 at 02:05

slm

369,824

answered Jul 08 '18 at 01:48

frapadingue

121

Kusalananda · Answer 8 · 2018-07-10T14:20:51.917

1

Using cmp and sort in bash:

cmp -s <( sort file ) <( sort -u file ) && echo 'All lines are unique'

or

if cmp -s <( sort file ) <( sort -u file )
then
    echo 'All lines are unique'
else
    echo 'At least one line is duplicated'
fi

This would sort the file twice though, just like the accepted answer.