I have two files:
A.txt - about 90GB
B.txt - about 80GB
I want to combine two files and remove duplicated lines.
How do I do this?
If commands other than awk
are better for this job, please let me know.
I have two files:
A.txt - about 90GB
B.txt - about 80GB
I want to combine two files and remove duplicated lines.
How do I do this?
If commands other than awk
are better for this job, please let me know.
You can probably not use awk
hashes as that would mean storing all the unique lines in memory. So could only be used if the output file is significantly smaller than the available memory on the system.
If the input files are already sorted, you could do:
sort -mu A.txt B.txt > C.txt
You may need to change the locale to one that has the same sorting order as was used to sort the files.
That doesn't need to store more than one line of each file in memory at a time.
If they were not sorted, remove the -m
, set $TMPDIR
to a directory in a filesystem (preferably fast) with 170GB of free space and be prepared to wait a bit.
The result however will be sorted, which will speed up the merging of another file later on if need be.
sort
will use temporary files, so that can work with even on system with little memory. But the more memory you have the better. With GNU sort
, see also the --compress-program
and --buffer-size
options which can help you tune for better performance. If the sort order used don't matter to you, fix the locale to C
(with LC_ALL=C sort...
) as that would be the most efficient.
printf "">MergeFile cat A.txt B.txt | while IFS= read -r line; do if [ ! -z "$line" ]; then if ! grep -Fxqe "$line" MergFile; then echo "$line">>MergeFile; fi fi done
Explanation
Create a new MergeFile with
printf "">MergeFile
# or optionally: touch MergeFile
Pipe the two files to a while loop :
cat A.txt B.txt |
Read each line:
while IFS= read -r line; do
Handle blank lines:
if [ ! -z "$line" ]; then
*if you want to keep the first blank line, add it back in an else clause
Empty results means it's first time going into MergeFile (i.e., it's unique):
if ! grep -Fxqe "$line" MergFile; then
Add it to the MergeFile:
echo "$line">>MergeFile;
grep
one reading a file also ending up also being several GB large) for each of the lines in the files. See you in the year 12421 for the results. See also Why is using a shell loop to process text considered bad practice?
– Stéphane Chazelas
Mar 10 '17 at 22:16
IFS=
, --
, the fact that fgrep
will look for substrings (you'd want if ! grep -Fxqe "$line" MergFile
)
– Stéphane Chazelas
Mar 10 '17 at 22:27
Try this command:
cat A.txt B.txt | awk '!seen[$0]++' > C.txt
It may take a while with so heavy files...