I've searched around the internet and stackexchange for this. Even though there are lots of similar topics, I haven't found a solution yet.
So, I have a quite large list (approx. 20GB), which contains around 5% duplicate lines. I want to filter this list, sothat one of the duplicates is deleted. Example:
Input:
test123
Test123
test
test123
test 123
Output:
test123
Test123
test
test 123
Whether the list gets sorted or not doesn't matter.
I've tried sort -u -o output.txt
, also sort -us -o output.txt
. It works fine for smaller files, but when I try to do files of more than approx. 4GB, the resulting file is suspiciously small and, instead of a .txt file, it has apparently become an "emacs-lisp-source-text".
I'd be very grateful if someone could help me out!
awk
one). – Marco Dec 10 '15 at 09:46awk '!seen[$0]++' 8GiB_file > output
works without problems here. No issues with the file size. The same goes forsort -u -o output 8GiB_file
. Works here. – Marco Dec 10 '15 at 12:20for i in {1..30000000}; do echo 'ᚹᛖᛥᚫ\nəsoʊsiˈeıʃn\n⠙⠳⠃⠞\ntest123\n⌷←⍳→⍴∆∇⊃‾⍎⍕⌈\nTest123\nSTARGΛ<030a>TE\ntest\nκόψη\ntest123\nსაერთაშორისო\ntest 123\nКонференцию\nพระปกเกศกองบ<0e39><0e4a>ก<0e39><0e49>ข<0e36><0e49>นใหม<0e48>\nአይታረስ\n'; done | awk '!seen[$0]++'
Furthermore, how do you find out it's a lisp source file? I don't believe thatawk
s output is lisp. Maybe some tool's heuristics fail on the content of the resulting file. – Marco Dec 10 '15 at 14:52cat
to pipe the InputFile to awhile
loop,read
each line in the loop,grep
-F (or fgrep) the line against the desired OutputFile. If it's not already in the OutputFile, add it to the OutputFile withecho
(see full answer below). – MikeD Mar 10 '17 at 22:22