3

I have two files:

A.txt - about 90GB
B.txt - about 80GB

I want to combine two files and remove duplicated lines.

How do I do this?

If commands other than awk are better for this job, please let me know.

MikeD
  • 810

3 Answers3

2

You can probably not use awk hashes as that would mean storing all the unique lines in memory. So could only be used if the output file is significantly smaller than the available memory on the system.

If the input files are already sorted, you could do:

sort -mu A.txt B.txt > C.txt

You may need to change the locale to one that has the same sorting order as was used to sort the files.

That doesn't need to store more than one line of each file in memory at a time.

If they were not sorted, remove the -m, set $TMPDIR to a directory in a filesystem (preferably fast) with 170GB of free space and be prepared to wait a bit.

The result however will be sorted, which will speed up the merging of another file later on if need be.

sort will use temporary files, so that can work with even on system with little memory. But the more memory you have the better. With GNU sort, see also the --compress-program and --buffer-size options which can help you tune for better performance. If the sort order used don't matter to you, fix the locale to C (with LC_ALL=C sort...) as that would be the most efficient.

1
printf "">MergeFile
cat A.txt B.txt | 
while IFS= read -r line; do 
  if [ ! -z "$line" ]; then
    if ! grep -Fxqe "$line" MergFile; then
      echo "$line">>MergeFile;
    fi
  fi
done

Explanation

Create a new MergeFile with
printf "">MergeFile # or optionally: touch MergeFile

Pipe the two files to a while loop :
cat A.txt B.txt |

Read each line:
while IFS= read -r line; do

Handle blank lines:
if [ ! -z "$line" ]; then
*if you want to keep the first blank line, add it back in an else clause

Empty results means it's first time going into MergeFile (i.e., it's unique):
if ! grep -Fxqe "$line" MergFile; then

Add it to the MergeFile:
echo "$line">>MergeFile;

MikeD
  • 810
  • 3
    Let's be serious. We're talking of 170GB of data and you're suggesting to run up to 5 commands (the grep one reading a file also ending up also being several GB large) for each of the lines in the files. See you in the year 12421 for the results. See also Why is using a shell loop to process text considered bad practice? – Stéphane Chazelas Mar 10 '17 at 22:16
  • 1
    Good on you for quoting your variables though! And it's true that approach doesn't have memory exhaustion issues. There are several other issues like the missing IFS=, --, the fact that fgrep will look for substrings (you'd want if ! grep -Fxqe "$line" MergFile) – Stéphane Chazelas Mar 10 '17 at 22:27
  • Thanks, I'll keep this in mind for large files and I'll implement your suggestions for this one. I appreciate your feedback. – MikeD Mar 10 '17 at 22:27
  • Yes, I was trying to avoid the memory issues, but like you indicated, it would be extremely time consuming with these large files. – MikeD Mar 10 '17 at 22:45
0

Try this command:

cat A.txt B.txt | awk '!seen[$0]++' > C.txt

It may take a while with so heavy files...