How to get only the unique results without having to sort data?

Question

$ cat data.txt 
aaaaaa
aaaaaa
cccccc
aaaaaa
aaaaaa
bbbbbb
$ cat data.txt | uniq
aaaaaa
cccccc
aaaaaa
bbbbbb
$ cat data.txt | sort | uniq
aaaaaa
bbbbbb
cccccc
$

The result that I need is to display all the lines from the original file removing all the duplicates (not just the consecutive ones), while maintaining the original order of statements in the file.

Here, in this example, the result that I actually was looking for was

aaaaaa
cccccc
bbbbbb

How can I perform this generalized uniq operation in general?

cjm · Accepted Answer · 2011-04-25T15:34:10.860

75

perl -ne 'print unless $seen{$_}++' data.txt

Or, if you must have a useless use of cat:

cat data.txt | perl -ne 'print unless $seen{$_}++'

Here's an awk translation, for systems that lack Perl:

awk '!seen[$0]++' data.txt
cat data.txt | awk '!seen[$0]++'

edited Apr 25 '11 at 15:34

answered Apr 24 '11 at 20:57

cjm

27,160

4

A slightly shorter awk script is { if (!seen[$0]++) print } – camh Apr 24 '11 at 22:32
general info: I time tested both perl and awk... awk was faster by 25+%)... 450000 unique lines (doubled up) to make 900000 lines (4 bytes per line) – Peter.O Apr 25 '11 at 00:48
1

@fred, unless your file is truly huge, either version takes longer to type in than it does to run. – cjm Apr 25 '11 at 03:32
14

The awk version can be made even shorter by leaving out the if, print, parentheses, and braces: awk '!seen[$0]++' – Gordon Davisson Apr 25 '11 at 06:29
What is seen? I've searched the User's Guide and can't find anything. – Christoph Wurm Jan 17 '12 at 14:58
2

@Legate, it's the name of an array in which we're recording every line we've seen. You could change it to '!LarryWall[$0]++' for all awk cares, but "seen" helps people understand the program better. – cjm Jan 17 '12 at 19:14
@cjm: I think I get it - mostly: It's an ordinary user-defined array and every input line is added as an index, though without a corresponding value. But what does ++ do? – Christoph Wurm Jan 18 '12 at 09:55
I merged several lists of special characters from different websites into a single plain text file (https://dl.dropboxusercontent.com/u/39189022/SpecialCharacters.txt) and tried to get rid of duplicates to make a proper list, using different commands I could find, including this one. But the result was always unsatisfactory, including many duplicates and so on (https://dl.dropboxusercontent.com/u/39189022/SpecialCharacters-SortUnique.txt) This casts a doubt in my mind on the reliability of these commands which I use from time to time on some important data :) I wonder why this is so??? – Sadi May 17 '14 at 08:44
1

@Sadi, that really should have been asked as a question, not a comment. But some of the lines in that file end in a space, and some don't. These commands consider the entire line significant, including the whitespace at the end. – cjm May 17 '14 at 17:07
Ooops, I didn't notice the spaces at the end of some lines; thank you! I just didn't want to be warned of a duplicate question. – Sadi May 17 '14 at 20:07
@Sadi, if you link to the original question/answer and then explain why it doesn't solve your problem, it won't be closed as a duplicate. (It might get closed for some other reason, but not that.) – cjm May 18 '14 at 01:24
The awk trick is also well explained here. – Serge Stroobandt May 20 '14 at 22:43
awk one is VERY fast at time()! – Aquarius Power Jun 19 '14 at 01:13

binfalse · Answer 2 · 2011-04-24T21:57:36.777

20

john has a tool called unique:

usr@srv % cat data.txt | unique out
usr@srv % cat out
aaaaaa
cccccc
bbbbbb

To achieve the same without additional tools in a single commandline is a bit more complex:

usr@srv % cat data.txt | nl | sort -k 2 | uniq -f 1 | sort -n | sed 's/\s*[0-9]\+\s\+//'
aaaaaa
cccccc
bbbbbb

nl prints line numbers in front of the lines, so if we sort/uniq behind them, we can restore the original order of the lines. sed just deletes the line numbers afterwards ;)

edited Apr 24 '11 at 21:57

answered Apr 24 '11 at 20:31

binfalse

5,528
4
27
28

is there any combination of common linux commands that could do the same? – Lazer Apr 24 '11 at 20:45
@Totor - see menkus' reply to a similar comment . @binfalse - your second solution does not work (maybe it works with this trivial sample but it doesn't work with some real life input). Please fix that, e.g. this should always work: nl -ba -nrz data.txt | sort -k2 -u | sort | cut -f2 – don_crissti Nov 07 '15 at 19:33

score 12 · Answer 3 · answered Jul 29 '13 at 18:53

12

I prefer to use this:

cat -n data.txt | sort --key=2.1 -b -u | sort -n | cut -c8-

cat -n adds line numbers,

sort --key=2.1 -b -u sorts on the second field (after the added line numbers), ignoring leading blanks, keeping unique lines

sort -n sorts in strict numeric order

cut -c8- keep all characters from column 8 to EOL (i.e., omit the line numbers we included)

answered Jul 29 '13 at 18:53

menkus

145
1
2

7

How to get only the unique results without having to sort data?

without having to sort data

– Jan Wikholm Jul 30 '13 at 06:40
8

'without having to sort the data' only appears in the title. The actual need is to: "display all the lines from the original file removing all the duplicates (not just the consecutive ones), while maintaining the original order of statements in the file." – menkus Aug 07 '13 at 18:00
The same thing, but using tabs as the delimiter for the line numbers: awk -v OFS='\t' '{ print NR, $0 }' data.txt | sort -u -b -k 2 | sort -n | cut -f 2- – Kusalananda Jun 29 '22 at 14:14

score 3 · Answer 4 · answered Jan 30 '14 at 20:38

3

Perl has a module that you can use that includes a function called uniq. So if you ave your data loaded in an array in Perl you simply call the function like this to make it unique, yet still maintain the original order.

use List::MoreUtils qw(uniq)    
@output = uniq(@output);

You can read more about this module here: List::MoreUtils

answered Jan 30 '14 at 20:38

slm

369,824

Can this handle huge files, e.g. 500 GB? – Boy Dec 20 '17 at 19:47

score 0 · Answer 5 · answered Jan 28 '24 at 07:06

Using Raku (formerly known as Perl_6)

~$ raku -e '.put for lines.unique;'  file

OR (more awk-like syntax):

~$ raku -ne 'state %h; .put unless %h{$_}++ ;'  file

Sample Input:

aaaaaa
aaaaaa
cccccc
aaaaaa
aaaaaa
bbbbbb

Sample Output:

aaaaaa
cccccc
bbbbbb

https://docs.raku.org
https://raku.org

How to get only the unique results without having to sort data?

5 Answers5

Linked

Related