Sort -u without sorting but with better uniqueness?

Question

I don't want to sort my file, just filter out duplicate lines, maintaining the original ordering. Is there a way to use sort's unique function without it's sort function (something like cat -u would give if it existed)? Just using uniq without sort does nothing worthwhile, because uniq only looks at adjacent lines, so a file has to be sorted first.

Also, incidentally, what in hell is the difference between uniq and uniq --unique? Here are commands on a random file from pastebin:

wget -qO - http://pastebin.com/0cSPs9LR | wc -l
350
wget -qO - http://pastebin.com/0cSPs9LR | sort -u | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq -u | wc -l
258

In summary:

How do I filter duplicates greedily without sorting?
How is uniq not unique enough that there is also uniq --unique?

p.s. This question looks like a duplicate of the following q's, but it isn't:

Don't use sort or uniq at all. And "How is uniq not unique enough that there is also uniq --unique?" really should be a separate question. — muru, Jun 18 '15 at 10:06
The solution on the duplicate page has an in-bash solution using awk. Suits me.
As for the separate question, I just posted it here:

http://unix.stackexchange.com/questions/210528/how-is-uniq-not-unique-enough-that-there-is-also-uniq-unique — enfascination, Jun 18 '15 at 10:21

score 1 · Answer 1 · answered Jun 18 '15 at 10:12

I'd use perl and a hash.

Something like:

 #!/usr/bin/perl

 use strict;
 use warnings;

 my %seen; 

 while ( <> ) { 
     print unless $seen{$_}++; 
 }

I think this'd one-liner-ify as:

perl -ne 'print unless $seen{$_}++' data.txt

(Or cat data into it).

This works on getting unique whole lines - you can also use split or regular expressions to just compare subsets.

E.g.

while ( <> ) { 
    my @fields = split ( ";" ); 
    print unless $seen{$fields[4]}++; 
}

Will split the line into fields based on ;, and just compare the 5th (first is zero in the array).

Sort -u without sorting but with better uniqueness?

1 Answers1