awk combine two big files and remove duplicated lines

Question

I have two files:

A.txt - about 90GB
B.txt - about 80GB

I want to combine two files and remove duplicated lines.

How do I do this?

If commands other than awk are better for this job, please let me know.

Please [edit] your question and i) show us a few lines from your input files; ii) the output you would want from those input files. We can't help you unless we know exactly what you need. Do you just want to concatenate the files together and remove duplicates? — terdon, Mar 10 '17 at 13:28
Possible duplicate of How to remove duplicate lines inside a text file? — don_crissti, Mar 10 '17 at 14:02
@don_crissti. Here it's different. We have two files to merge and I presume the duplication comes from the merging. Also, solutions that store the content (170GB) in memory are probably out of the question for most ordinary computers in 2017. — Stéphane Chazelas, Mar 10 '17 at 14:26

Stéphane Chazelas · Answer 1 · 2017-03-10T14:43:52.367

You can probably not use awk hashes as that would mean storing all the unique lines in memory. So could only be used if the output file is significantly smaller than the available memory on the system.

If the input files are already sorted, you could do:

sort -mu A.txt B.txt > C.txt

You may need to change the locale to one that has the same sorting order as was used to sort the files.

That doesn't need to store more than one line of each file in memory at a time.

If they were not sorted, remove the -m, set $TMPDIR to a directory in a filesystem (preferably fast) with 170GB of free space and be prepared to wait a bit.

The result however will be sorted, which will speed up the merging of another file later on if need be.

sort will use temporary files, so that can work with even on system with little memory. But the more memory you have the better. With GNU sort, see also the --compress-program and --buffer-size options which can help you tune for better performance. If the sort order used don't matter to you, fix the locale to C (with LC_ALL=C sort...) as that would be the most efficient.

sounds reasonable, I will give this a test when I get back to work next week — WestFarmer, Mar 10 '17 at 15:33

MikeD · Answer 2 · 2017-03-10T22:51:44.907

1

printf "">MergeFile
cat A.txt B.txt | 
while IFS= read -r line; do 
  if [ ! -z "$line" ]; then
    if ! grep -Fxqe "$line" MergFile; then
      echo "$line">>MergeFile;
    fi
  fi
done

Explanation

Create a new MergeFile with
printf "">MergeFile # or optionally: touch MergeFile

Pipe the two files to a while loop :
cat A.txt B.txt |

Read each line:
while IFS= read -r line; do

Handle blank lines:
if [ ! -z "$line" ]; then
*if you want to keep the first blank line, add it back in an else clause

Empty results means it's first time going into MergeFile (i.e., it's unique):
if ! grep -Fxqe "$line" MergFile; then

Add it to the MergeFile:
echo "$line">>MergeFile;

edited Mar 10 '17 at 22:51

answered Mar 10 '17 at 15:36

MikeD

810

3

Let's be serious. We're talking of 170GB of data and you're suggesting to run up to 5 commands (the grep one reading a file also ending up also being several GB large) for each of the lines in the files. See you in the year 12421 for the results. See also Why is using a shell loop to process text considered bad practice? – Stéphane Chazelas Mar 10 '17 at 22:16
1

Good on you for quoting your variables though! And it's true that approach doesn't have memory exhaustion issues. There are several other issues like the missing IFS=, --, the fact that fgrep will look for substrings (you'd want if ! grep -Fxqe "$line" MergFile) – Stéphane Chazelas Mar 10 '17 at 22:27
Thanks, I'll keep this in mind for large files and I'll implement your suggestions for this one. I appreciate your feedback. – MikeD Mar 10 '17 at 22:27
Yes, I was trying to avoid the memory issues, but like you indicated, it would be extremely time consuming with these large files. – MikeD Mar 10 '17 at 22:45

score 0 · Answer 3 · answered Mar 10 '17 at 13:29

0

Try this command:

cat A.txt B.txt | awk '!seen[$0]++' > C.txt

It may take a while with so heavy files...

answered Mar 10 '17 at 13:29

Zumo de Vidrio

1,703

4

Unless there's a lot of duplication in the original files, you will also need a computer with several hundred GiB of memory. – Stéphane Chazelas Mar 10 '17 at 14:30

awk combine two big files and remove duplicated lines

3 Answers3