Symmetric Difference Pipe?

Question

I have a command that produces a lists of strings followed by newlines, a, and a file containing a list of strings followed by newlines, b.txt. I need a command that calculates the symmetric difference of the output of a and contents of b.txt. Ideally this command should operate in a pipeline, as a is potentially very slow.

Venn diagram if you like those (Credits to Wikipedia):

Symmetric Difference Operation

For those more example oriented:

a outputs

apple
car

b.txt

banana
car
dog

Then the result should be

apple
banana
dog

Okay, I read more manpages, and I've figured out a solution if I can redirect a to a file: sort {a,b}.txt | uniq -u. a is unfortunately slow (~one second per line, >100 lines), so if anyone has a pipe-based solution, I'll accept that. — Nathan Ringo, Jan 01 '16 at 07:28
See also Linux tools to treat files as sets and perform set operations on them — Gilles 'SO- stop being evil', Jan 01 '16 at 23:28

score 4 · Answer 1 · answered Jan 01 '16 at 11:57

4

You can use process substitution to treat the output of a command as a file.

comm -3 <(a | sort) <(sort b.txt)

answered Jan 01 '16 at 11:57

Barmar

9,927

score 2 · Accepted Answer · answered Jan 01 '16 at 08:09

Your sort solution may be a bit faster if you sort the files separately, then use comm to find the non-common lines:

sort a.txt -o a.txt
sort b.txt -o b.txt
comm -3 a.txt b.txt | sed 's/^\t//'

Alternatively, if one of your data files is not too large, you can read it all into an associative array then compare the other file line by line. Eg with awk:

awk '
ARGIND==1 { item[$0] = 1; next }
ARGIND==2 { if(!item[$0])print; else item[$0] = 2 }
END   { for(i in item)if(item[i]==1)print i }
' a.txt b.txt

In the above ARGIND counts the files arguments. The first line saves file 1 lines in array item. The next line sees if the current line from file 2 is in this array. If not it is printed, else we note that this item was seen in both files. Finally, we print the items that were not seen in both files.

If one of your files is much smaller than the other, it is best to put it first in the args so the item array stays small:

if [ $(wc -l <a.txt) -lt $(wc -l <b.txt) ]
then args="a.txt b.txt"
else args="b.txt a.txt"
fi
awk '
ARGIND==1 { item[$0] = 1; next }
ARGIND==2 { if(!item[$0])print; else item[$0] = 2 }
END   { for(i in item)if(item[i]==1)print i }
' $args

You can always simply replace one of the filenames by - and then awk will read stdin for that file: eg: cat a.txt | awk '...' b.txt - however, it is only when you get to the end of both files that you can show which items in b.txt did not appear in a.txt. — meuh, Jan 01 '16 at 10:00
how could it possibly be faster? a only writes a line per second - processing time is not the bottle neck. — mikeserv, Jan 01 '16 at 10:37

jimmij · Answer 3 · 2016-01-01T09:59:10.890

A good tool to see differences is diff, you just need to play a little bit with its non-trivial options to format the output properly:

diff --unchanged-group-format= --new-group-format="%>" a b.txt

If a is not a file by a pipe then you should use - instead:

echo 'apple
car' | diff --unchanged-group-format= --new-group-format='%>' - b.txt

Output:

apple
banana
dog

Or if you don't care the context where a line appeared in the file:

echo 'apple
car' | sort | diff --unchanged-group-format= --new-group-format='%>' - <(sort b.txt)

Symmetric Difference Pipe?

3 Answers3