Your sort solution may be a bit faster if you sort the files separately, then
use comm
to find the non-common lines:
sort a.txt -o a.txt
sort b.txt -o b.txt
comm -3 a.txt b.txt | sed 's/^\t//'
Alternatively, if one of your data files is not too large, you can read it all into an associative array then compare the other file line by line. Eg with awk:
awk '
ARGIND==1 { item[$0] = 1; next }
ARGIND==2 { if(!item[$0])print; else item[$0] = 2 }
END { for(i in item)if(item[i]==1)print i }
' a.txt b.txt
In the above ARGIND
counts the files arguments.
The first line saves file 1 lines in array item
. The next line sees if the current line from file 2 is in this array. If not it is printed, else we note that this item was seen in both files. Finally, we print the items that were not seen in both files.
If one of your files is much smaller than the other, it is best to put it first in the args so the item array stays small:
if [ $(wc -l <a.txt) -lt $(wc -l <b.txt) ]
then args="a.txt b.txt"
else args="b.txt a.txt"
fi
awk '
ARGIND==1 { item[$0] = 1; next }
ARGIND==2 { if(!item[$0])print; else item[$0] = 2 }
END { for(i in item)if(item[i]==1)print i }
' $args
a
to a file:sort {a,b}.txt | uniq -u
.a
is unfortunately slow (~one second per line, >100 lines), so if anyone has a pipe-based solution, I'll accept that. – Nathan Ringo Jan 01 '16 at 07:28