Fastest way to delete duplicates in large wordlist?

Question

I need to deduplicate a large wordlist. I tried several commands and did some research in Fastest `uniq` tool in linux and How to remove duplicate lines in a large multi-GB textfile? where they explain that the fastest way to deduplicate a wordlist seems to be using awk.

awk  --> O(n) ?
sort --> O(n log n) ?

However, I found that this seems to be not true. Here are my testing results:

time sort -u input.txt -o output.txt 
real    0m12.446s  
user    0m11.347s  
sys 0m0.906s**


time awk '!x[$0]++' input.txt > output.txt
real    0m47.221s  
user    0m45.419s  
sys 0m1.260s

So using sort -u is 3.7 times faster. Why is this? is there an even faster method to do deduplication?

*** Update

As someone pointed out in the comments, it could be that my wordlist was already sorted to some extent. To exclude this possibility I generated two wordlists using random_number_wordlist_generator.py.

List1 = 7 Mb  
List2 = 690 Mb

**Results AWK:**  
***List1***  
real    0m1.643s  
user    0m1.565s  
sys     0m0.062s

***List2***  
real    2m6.918s  
user    2m4.499s  
sys     0m1.345s

**Results SORT:**  
***List1***  
real    0m0.724s  
user    0m0.666s  
sys     0m0.048s

***List2***  
real    1m27.254s  
user    1m25.013s  
sys     0m1.251s

I will generate a random list with numbers and check just to make sure — karlpy, Aug 27 '15 at 22:18
Just a remark: it would be good to compare two cases - small and very large files. — jimmij, Aug 27 '15 at 22:26
Big O notation is about what happens when input length approaches infinity: it tells you it an algorithm scales with big input. Some algorithms work better on small input size. — ctrl-alt-delor, Aug 27 '15 at 22:36
Incrementing the value in that array, will be at best O(log n) (if a hashed b-tree), and you do that n times. That is O(n log n). Therefore both are at best O(n log n). — ctrl-alt-delor, Aug 27 '15 at 22:47
I've made the tests with a small and a large random wordlist. Just to check the if there is a linear relationship, The results clearly show that sort is still at least twice as fast.. — karlpy, Aug 27 '15 at 22:55
Karlpy, what order did you execute in, awk first or sort? That may make a difference due to file caching — iruvar, Aug 27 '15 at 23:00
I executed awk first, now to exclude that possibility I changed the filename and ran sort first, with the same results. So it can't be due to caching.. — karlpy, Aug 27 '15 at 23:08
@karlpy: "I changed the filename ..." If you mean that you renamed the file, that's not good enough. Renaming a file just associates a new name with the old inode, which still points to the same old data blocks. If they were cached, they're still cached. ISTM that a much better technique would be to (1) make a copy of the file, and *then* (2) run one command on one file and (3) run the other command on the other file. — Scott - Слава Україні, Aug 28 '15 at 00:32
The explanation of how the awk command can be O(n), in the question that you linked to, seems plausible, and was well-received by the community (it got lots of votes), but it's not backed up by any references, and I'm not sure that it's right. Note that the bullet on awk is one part of a four-part answer. The other three parts seem to be sound, and the whole thing is presented in a good, professional style, so maybe we shouldn't interpret the upvotes to mean that people agree with the awk bullet. — Scott - Слава Україні, Aug 28 '15 at 00:40
@Scott I've made new lists and tried, still getting the same results though — karlpy, Aug 28 '15 at 01:06

score 4 · Answer 1 · answered Aug 30 '15 at 19:46

You are asking the wrong question, or asking the question wrongly and in the wrong stack, this is a better question to ask in the programming/stack-overflow for people to give you answers based on the algorithms used inside awk and sort.

PS: also do the needed with nawk, mawk and gawk to give us some more details to "zone into" ;) and do the runs like a 100 times each with the min, max, avg and standard deviation.

Any case back to the question at hand, from CompSci 210, it is about the algorithms used. Sort makes use of several, depending on the sizes, and memory constraints it hit to save files out to disk in temporary files to be merge sorted once it ran out of memory, and you'll have to look into the source code to see what the specific sort(1) command uses on the specific OS your are running it on, but from experience it's loading into memory as much as it can, do some quick-sort on it, write out to disk, rinse repeat, and at the end it'll do a merge-sort of the small sorted files. So here you'll have the O(n*log2(N)) for the parts, and then a approximate O(n*log(n)) merging operation

awk: The x[$0]++ mechanism is "suppose" to use hashing. BUT the problem with hashing, a supposed O(1) "lookup" operation, is collisions, and the handling of collisions. This might cause a problem when the data isn't nicely spread, nor filling the buckets etc. and in large lists, the hashing might be a big memory problem if the handling of the collisions isn't done right (and you might need to tune the hashing algorithms for the data expected), and then you need to look at the performance of the actual hashing functions and then the O(1) might be closer to a O(log(n)) for the inserts (Ie. O(1) for the first search, and if it does NOT exist you add it which could be O(log(n))), and that then the n*O(1) becomes a n*O(log(n))=> O(n*log(n)), not to mention you are also doing things in a "interpreted" manner :)

awk also, and crucially, splits each line on whitespaces by default. This takes time. — Kusalananda, Feb 17 '20 at 14:38
@Kusalananda that's very little overhead in the bigger scheme of things when you are hitting RAM and disk I/O limits ;) — Hvisage, Feb 18 '20 at 15:05

score 0 · Answer 2 · answered Feb 18 '20 at 04:08

I'd look hard how to use sort -u (possibly with further switches!) for the task, before breaking out some serious scripting language (Python/Perl/Raku, very probably after sorting), and only after seeing absolute need for utmost speed would I consider other alternatives.

score -2 · Answer 3 · answered Aug 28 '15 at 08:03

The speed difference is because 'sort' is a command (link), whereas 'awk' is a programming language (link).

'sort' command is takes input and return output. Whereas 'awk' is a programming language, which first interprets the code (terminal command) then starts processing on it. Simple as that.

Fastest way to delete duplicates in large wordlist?

*** Update

3 Answers3

Linked

Fastest way to delete duplicates in large wordlist?

*********** Update ********

3 Answers3

Linked

*** Update