I need to deduplicate a large wordlist. I tried several commands and did some research in Fastest `uniq` tool in linux and How to remove duplicate lines in a large multi-GB textfile? where they explain that the fastest way to deduplicate a wordlist seems to be using awk
.
awk --> O(n) ? sort --> O(n log n) ?
However, I found that this seems to be not true. Here are my testing results:
time sort -u input.txt -o output.txt
real 0m12.446s
user 0m11.347s
sys 0m0.906s**
time awk '!x[$0]++' input.txt > output.txt
real 0m47.221s
user 0m45.419s
sys 0m1.260s
So using sort -u
is 3.7 times faster. Why is this? is there an even faster method to do deduplication?
*********** Update ********
As someone pointed out in the comments, it could be that my wordlist was already sorted to some extent. To exclude this possibility I generated two wordlists using random_number_wordlist_generator.py.
List1 = 7 Mb
List2 = 690 Mb
**Results AWK:**
***List1***
real 0m1.643s
user 0m1.565s
sys 0m0.062s
***List2***
real 2m6.918s
user 2m4.499s
sys 0m1.345s
**Results SORT:**
***List1***
real 0m0.724s
user 0m0.666s
sys 0m0.048s
***List2***
real 1m27.254s
user 1m25.013s
sys 0m1.251s
O(log n)
(if a hashed b-tree), and you do thatn
times. That isO(n log n)
. Therefore both are at bestO(n log n)
. – ctrl-alt-delor Aug 27 '15 at 22:47awk
command can be O(n), in the question that you linked to, seems plausible, and was well-received by the community (it got lots of votes), but it's not backed up by any references, and I'm not sure that it's right. Note that the bullet onawk
is one part of a four-part answer. The other three parts seem to be sound, and the whole thing is presented in a good, professional style, so maybe we shouldn't interpret the upvotes to mean that people agree with theawk
bullet. – Scott - Слава Україні Aug 28 '15 at 00:40