Removing duplicates in a large text list

Question

I've searched around the internet and stackexchange for this. Even though there are lots of similar topics, I haven't found a solution yet.

So, I have a quite large list (approx. 20GB), which contains around 5% duplicate lines. I want to filter this list, sothat one of the duplicates is deleted. Example:

Input:

test123
Test123
test
test123
test 123

Output:

test123
Test123
test
test 123

Whether the list gets sorted or not doesn't matter.

I've tried sort -u -o output.txt, also sort -us -o output.txt. It works fine for smaller files, but when I try to do files of more than approx. 4GB, the resulting file is suspiciously small and, instead of a .txt file, it has apparently become an "emacs-lisp-source-text".

I'd be very grateful if someone could help me out!

Possible duplicates: How to remove duplicate lines inside a text file? and How to remove duplicate lines in a large multi-GB textfile?. Please check if the answers work for you (especially the awk one). — Marco, Dec 10 '15 at 09:46
I've gone through those threads, but unfortunately the solutions don't work for me (awk seems to be even more vulnerable to errors involving large files). — user146854, Dec 10 '15 at 10:34
You should state in your question what exactly you have tried, which commands you ran and what the console output and return value of the command was. If possible provide an example file which demonstrates the issue. But that might not be feasible if the error only shows after a certain size is reached. And clarify what you mean by “more vulnerable to errors involving large files”. Small vs. large is relative. On an old laptop with 64MiB of memory 100MiB might be large, on a server with 512GiB of memory 100GiB might be small. — Marco, Dec 10 '15 at 10:54
And awk '!seen[$0]++' 8GiB_file > output works without problems here. No issues with the file size. The same goes for sort -u -o output 8GiB_file. Works here. — Marco, Dec 10 '15 at 12:20
I tried that exact command earlier. Like sort -u, it doesn't work properly and creates an "emacs-lisp-source-text'. However, I think I might have found the source of the problem. All the large files I have tried contain "strange" characters (arabic, chinese, hex, ... you name it). Because this only happened with large files, I concluded that the size was likely the reason. Could it be possible, that the "sort" and "awk" command have difficulties with certain kinds of characters? And if so, do you know an alternative that doesn't? — user146854, Dec 10 '15 at 12:33
The following works here (8+GiB of Unicode): for i in {1..30000000}; do echo 'ᚹᛖᛥᚫ\nəsoʊsiˈeıʃn\n⠙⠳⠃⠞\ntest123\n⌷←⍳→⍴∆∇⊃‾⍎⍕⌈\nTest123\nSTARGΛ<030a>TE\ntest\nκόψη\ntest123\nსაერთაშორისო\ntest 123\nКонференцию\nพระปกเกศกองบ<0e39><0e4a>ก<0e39><0e49>ข<0e36><0e49>นใหม<0e48>\nአይታረስ\n'; done | awk '!seen[$0]++' Furthermore, how do you find out it's a lisp source file? I don't believe that awks output is lisp. Maybe some tool's heuristics fail on the content of the resulting file. — Marco, Dec 10 '15 at 14:52
I added an answer (I've tested) below: Using cat to pipe the InputFile to a while loop, read each line in the loop, grep -F (or fgrep) the line against the desired OutputFile. If it's not already in the OutputFile, add it to the OutputFile with echo (see full answer below). — MikeD, Mar 10 '17 at 22:22

score 2 · Answer 1 · edited Mar 10 '17 at 15:06

Tested with GNU sort from GNU coreutils 8.26, I had no problem sorting a 5GiB file. So, you could try install that one.

Things to bear in mind though:

sort -u doesn't give you unique lines, but one of all the lines that sort the same. On GNU systems especially, and in your typical locale, there are several characters that sort the same. If you want unique lines at byte level, use LC_ALL=C sort -u.
sort uses temporary files for big inputs to sort in chunks so as not to use up the whole memory. If you do not have enough space in your temporary directory (usually /tmp unless you have set $TMPDIR), then it will fail. Set $TMPDIR (see also the -T option with GNU sort) to a directory with enough free space.

MikeD · Answer 2 · 2017-03-10T22:51:19.473

printf "">OutputFile
cat InputFile | 
while IFS= read -r line; do 
  if [ ! -z "$line" ]; then
    if ! grep -Fxqe "$line" OutputFile; then
      echo "$line">>OutputFile;
    fi
  fi
done

Explanation

Create a new OutputFile
printf "">OutputFile

Pipe the InputFile to a while loop
cat InputFile |

Read each line
while IFS= read -r line; do

Handle blank lines
if [ ! -z "$line" ]; then

Check if line is already in OutputFile
If the results are empty, it's not already in the OutputFile (i.e., it's unique)
if ! grep -Fxqe "$line" OutputFile; then

Put the line in OutputFile
echo "$line">>OutputFile;

Removing duplicates in a large text list

2 Answers2