0

I don't want to sort my file, just filter out duplicate lines, maintaining the original ordering. Is there a way to use sort's unique function without it's sort function (something like cat -u would give if it existed)? Just using uniq without sort does nothing worthwhile, because uniq only looks at adjacent lines, so a file has to be sorted first.

Also, incidentally, what in hell is the difference between uniq and uniq --unique? Here are commands on a random file from pastebin:

wget -qO - http://pastebin.com/0cSPs9LR | wc -l
350
wget -qO - http://pastebin.com/0cSPs9LR | sort -u | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq -u | wc -l
258

In summary:

  1. How do I filter duplicates greedily without sorting?
  2. How is uniq not unique enough that there is also uniq --unique?

p.s. This question looks like a duplicate of the following q's, but it isn't:

  • Don't use sort or uniq at all. And "How is uniq not unique enough that there is also uniq --unique?" really should be a separate question. – muru Jun 18 '15 at 10:06
  • The solution on the duplicate page has an in-bash solution using awk. Suits me.

    As for the separate question, I just posted it here:

    http://unix.stackexchange.com/questions/210528/how-is-uniq-not-unique-enough-that-there-is-also-uniq-unique

    – enfascination Jun 18 '15 at 10:21

1 Answers1

1

I'd use perl and a hash.

Something like:

 #!/usr/bin/perl

 use strict;
 use warnings;

 my %seen; 

 while ( <> ) { 
     print unless $seen{$_}++; 
 }

I think this'd one-liner-ify as:

perl -ne 'print unless $seen{$_}++' data.txt

(Or cat data into it).

This works on getting unique whole lines - you can also use split or regular expressions to just compare subsets.

E.g.

while ( <> ) { 
    my @fields = split ( ";" ); 
    print unless $seen{$fields[4]}++; 
}

Will split the line into fields based on ;, and just compare the 5th (first is zero in the array).

Sobrique
  • 4,424