Does anyone know of any linux tool specifically designed to treat files as sets and perform set operations on them? Like difference, intersection, etc?
13 Answers
Assuming elements are strings of characters other than NUL and newline (beware that newline is valid in file names though), you can represent a set as a text file with one element per line and use some of the standard Unix utilities.
Set Membership
$ grep -Fxc 'element' set # outputs 1 if element is in set
# outputs >1 if set is a multi-set
# outputs 0 if element is not in set
$ grep -Fxq 'element' set # returns 0 (true) if element is in set
# returns 1 (false) if element is not in set
$ awk '$0 == "element" { s=1; exit }; END { exit !s }' set
returns 0 if element is in set, 1 otherwise.
$ awk -v e='element' '$0 == e { s=1; exit } END { exit !s }'
Set Intersection
$ comm -12 <(sort set1) <(sort set2) # outputs intersect of set1 and set2
$ grep -xF -f set1 set2
$ sort set1 set2 | uniq -d
$ join -t <(sort A) <(sort B)
$ awk '!done { a[$0]; next }; $0 in a' set1 done=1 set2
Set Equality
$ cmp -s <(sort set1) <(sort set2) # returns 0 if set1 is equal to set2
# returns 1 if set1 != set2
$ cmp -s <(sort -u set1) <(sort -u set2)
collapses multi-sets into sets and does the same as previous
$ awk '{ if (!($0 in a)) c++; a[$0] }; END{ exit !(c==NR/2) }' set1 set2
returns 0 if set1 == set2
returns 1 if set1 != set2
$ awk '{ a[$0] }; END{ exit !(length(a)==NR/2) }' set1 set2
same as previous, requires >= gnu awk 3.1.5
Set Cardinality
$ wc -l < set # outputs number of elements in set
$ awk 'END { print NR }' set
$ sed '$=' set
Subset Test
$ comm -23 <(sort -u subset) <(sort -u set) | grep -q '^'
# returns true iff subset is not a subset of set (has elements not in set)
$ awk '!done { a[$0]; next }; { if !($0 in a) exit 1 }' set done=1 subset
returns 0 if subset is a subset of set
returns 1 if subset is not a subset of set
Set Union
$ cat set1 set2 # outputs union of set1 and set2
# assumes they are disjoint
$ awk 1 set1 set2 # ditto
$ cat set1 set2 ... setn # union over n sets
$ sort -u set1 set2 # same, but doesn't assume they are disjoint
$ sort set1 set2 | uniq
$ awk '!a[$0]++' set1 set2 # ditto without sorting
Set Complement
$ comm -23 <(sort set1) <(sort set2)
# outputs elements in set1 that are not in set2
$ grep -vxF -f set2 set1 # ditto
$ sort set2 set2 set1 | uniq -u # ditto
$ awk '!done { a[$0]; next }; !($0 in a)' set2 done=1 set1
Set Symmetric Difference
$ comm -3 <(sort set1) <(sort set2) | tr -d '\t' # assumes not tab in sets
# outputs elements that are in set1 or in set2 but not both
$ sort set1 set2 | uniq -u
$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)
$ grep -vxF -f set1 set2; grep -vxF -f set2 set1
$ awk '!done { a[$0]; next }; $0 in a { delete a[$0]; next }; 1;
END { for (b in a) print b }' set1 done=1 set2
Power Set
All possible subsets of a set displayed space separated, one per line:
$ p() { [ "$#" -eq 0 ] && echo || (shift; p "$@") |
while read r; do printf '%s %s\n%s\n' "$1" "$r" "$r"; done; }
$ p $(cat set)
(assumes elements don't contain SPC, TAB (assuming the default value of $IFS
), backslash, wildcard characters).
Set Cartesian Product
$ while IFS= read -r a; do while IFS= read -r b; do echo "$a, $b"; done < set1; done < set2
$ awk '!done { a[$0]; next }; { for (i in a) print i, $0 }' set1 done=1 set2
Disjoint Set Test
$ comm -12 <(sort set1) <(sort set2) # does not output anything if disjoint
$ awk '++seen[$0] == 2 { exit 1 }' set1 set2 # returns 0 if disjoint
# returns 1 if not
Empty Set Test
$ wc -l < set # outputs 0 if the set is empty
# outputs >0 if the set is not empty
$ grep -q '^' set # returns true (0 exit status) unless set is empty
$ awk '{ exit 1 }' set # returns true (0 exit status) if set is empty
Minimum
$ sort set | head -n 1 # outputs the minimum (lexically) element in the set
$ awk 'NR == 1 { min = $0 }; $0 < min { min = $0 }; END { print min }'
ditto, but does numeric comparison when elements are numerical
Maximum
$ sort test | tail -n 1 # outputs the maximum element in the set
$ sort -r test | head -n 1
$ awk 'NR == 1 || $0 > max { max = $0 }; END { print max }'
ditto, but does numeric comparison when elements are numerical
All available at http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/

- 544,893

- 1,406
Sort of. You need to deal with sorting yourself, but comm
can be used to do that, treating each line as a set member: -12
for intersection, -13
for difference. (And -23
gives you flipped difference, that is, set2 - set1
instead of set1 - set2
.) Union is sort -u
in this setup.
-
2Indeed, comm seems to do most of the stuff. Although the arguments are very unintuitive. Thanks! – nilton Apr 25 '11 at 22:09
-
Very useful answer, but I believe the description for the difference is flipped around. To see what is in the first file but not the second file
(set1 - set2)
you usecomm -23
to supress the second and third columns, leaving only the first column. Likewise, to see the difference(set2 - set1)
you usecomm -13
– Tony Feb 04 '20 at 23:34
The tiny console tool “setop” is now availible in Debian Stretch and in Ubuntu since 16.10. You can get it via
sudo apt install setop
Here are some examples. The sets to be operated on are given as different input files:
setop input # is equal to "sort input --unique"
setop file1 file2 --union # option --union is default and can be omitted
setop file1 file2 file3 --intersection # more than two inputs are allowed
setop file1 - --symmetric-difference # ndash stands for standard input
setop file1 -d file2 # all elements contained in 1 but not 2
Boolean queries return only EXIT_SUCCESS
in case of true, and EXIT_FAILURE
as well as a message otherwise. This way, setop can be used in the shell.
setop inputfile --contains "value" # is element value contained in input?
setop A.txt B.txt --equal C.txt # union of A and B equal to C?
setop bigfile --subset smallfile # analogous --superset
setop -i file1 file2 --is-empty # intersection of 1 and 2 empty (disjoint)?
It is also possible to precisly describe how the input streams shall be parsed, actually by regular expressions:
setop input.txt --input-separator "[[:space:]-]"
means that a whitespace (i. e.\v
\t
\n
\r
\f
or space) or a minus sign is interpreted as a separator between elements (default is new line, i. e. every line of the input file is one element)setop input.txt --input-element "[A-Za-z]+"
means that elements are only words consisting of latin characters, all other characters are considered to be separators between elements
Furthermore, you can
--count
all elements of the output set,--trim
all input elements (i. e. erase all unwanted preceding and succeeding characters like space, comma etc.),- consider empty elements as valid via
--include-empty
, --ignore-case
,- set the
--output-separator
between elements of the output stream (default is\n
), - and so on.
See man setop
or github.com/phisigma/setop for more information.

- 101
I don't know of a specific tool but you can use Python, and its set class and operators, to write a little script to do that.
For exampe:
Python> s1 = set(os.listdir("/bin"))
Python> s2 = set(os.listdir("/usr/bin"))
Python> s1 & s2
set(['awk',
'basename',
'chroot', ...
-
1
-
-
So I'm not sure but the title of the question is about how to perform Set operations on files. Maybe one set element per line of the file (this is one interpretation tackled in code by @Gilles 'SO- stop being evil' )? But a quick reading of your code would suggest you're performing Set operations on directories, so... . – jubilatious1 Jun 02 '23 at 21:40
-
@jubilatious1 The
listdir
returns a list of files in that directory. So this example is operating on sets of file names. – Keith Jun 04 '23 at 22:09 -
-
@jubilatious1 not what I inferred from the original question. Even so, it's just an example. I figure the OP should be able to adapt the idea to his/her needs. – Keith Jun 08 '23 at 03:06
If you see a file as a set of lines, and the files are sorted, there's comm
.
If you see a file as a (multi)set of lines, and the lines aren't sorted, grep
can do difference and intersection (it achieves set difference and intersection, but doesn't respect count for multisets). Union is just cat
.
grep -xF -f small large >intersection
grep -vxF -f small large >difference
cat small large >union

- 829,060
With zsh
arrays (zsh
arrays can contain any arbitrary sequence of bytes, even 0).
set membership
if ((${array[(Ie)$element]})); then
echo '$element is in $array'
fi
(using the I
array subscript flag, to get the index of the last occurrence of $element
in the array (or 0 if not found). Remove e
(for e
xact) for $element
to be taken as a pattern)
if ((n = ${(M)#array:#$element})); then
echo "\$element is found $n times in \$array'
fi
${array:#pattern}
being a variation on ksh's ${var#pattern}
that removes the elements that match the pattern as opposed to just remove the leading part that matches the pattern. The (M)
(for matched) reverses the meaning and removes all but the matched elements (use $~element
for it to be taken as a pattern).
set intersection
common=("${(@)set1:*set2}")
${set1:*set2}
does the array intersection, but the "${(@)...}"
syntax is needed to preserve empty elements.
set equality
[[ ${(j: :)${(q)array1}} = ${(j: :)${(q)array2}} ]]
Tests whether the arrays are identical (and in the same order). The q
parameter expansion flag quotes the elements (to avoid problems with things like a=(1 "2 3")
vs b=("1 2" 3)
), and (j: :)
joins them with space before doing a string comparison.
To check that they have the same elements, irrespective of order, use the o
flag to order them. See also the u
flag (unique) to remove duplicates.
[[ ${(j: :)${(qo)array1}} = ${(j: :)${(qo)array2}} ]]
set cardinality
n=$#array
subset test
if ((${#array1:*array2} == ${#array2})); then
echo '$array2 is included in $array1'
fi
(assuming arrays with unique values)
union
union=("$array1[@]" "$array2[@]")
(see typeset -U
above or the u
parameter expansion flag to take care of duplicates). Again if the empty string is not one of the possible values, you can simplify to:
union=($array1 $array2)
complement
complement=("${(@)array1:|array2}")
for the elements of $array1
that are not in $array2
.
minimum/maximum (lexical comparison)
min=${${(o)array}[1]} max=${${(o)array}[-1]}
minimum/maximum (decimal integer comparison)
min=${${(no)array}[1]} max=${${(no)array}[-1]}

- 544,893
I've written a command tool in Go for this question. The source code is here on GitHub.
You can use homebrew
to install it on macOS:
brew install suzaku/homebrew-rose/rose
Usage is simply:
rose and file1 file2
for intersection.rose or file1 file2
for union.rose sub file1 file2
for subtraction.

- 22,803

- 121
- 3
Using Raku (formerly known as Perl_6)
Raku has two choices of object-types when it comes to set-related operations. They are:
Set
object type (immutable),SetHash
object type (mutable).
Unicode symbols can be used--e.g. infix ∩
for intersection--or the ASCII equivalent, e.g. infix (&)
for intersection. Note for the set results below, the first two lines are identical; in the third (and final) line only the set operator changes.
Below, taking the (famous, 72-word) "We hold these truths to be self-evident..." sentence from the US Declaration of Independence, and performing set-comparisons to the (equally famous, 38-word) Preamble of the US Constitution:
Intersection: infix (&)
, infix ∩
~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words;
my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words;
.sort>>.keys.put given $a (&) $b; #returns 10 words'
and form in liberty of people secure the to we
Union: infix (|)
, infix ∪
~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words;
my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words;
.sort>>.keys.put given $a (|) $b; #returns 100 words'
a abolish all alter america among and any are as be becomes blessings by certain common consent constitution created creator defence deriving destructive do domestic effect endowed ends equal establish evident for form foundation from general governed government governments happiness hold in institute instituted insure is it its just justice laying liberty life likely men more most new of on or ordain order organizing our ourselves people perfect posterity powers principles promote provide pursuit right rights safety secure seem self shall states such that the their them these they this to tranquility truths unalienable union united we welfare whenever with
Symmetric set difference: $a (^) $b
, or use infix ⊖
~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words;
my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words;
.sort>>.keys.put given $a (^) $b; #returns 90 words'
a abolish all alter america among any are as be becomes blessings by certain common consent constitution created creator defence deriving destructive do domestic effect endowed ends equal establish evident for foundation from general governed government governments happiness hold institute instituted insure is it its just justice laying life likely men more most new on or ordain order organizing our ourselves perfect posterity powers principles promote provide pursuit right rights safety seem self shall states such that their them these they this tranquility truths unalienable union united welfare whenever with
Asymmetric set difference: $a (-) $b
, or use infix ∖
~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words;
my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words;
.sort>>.keys.put given $a (-) $b; #returns 62 words'
abolish all alter among any are as be becomes by certain consent created creator deriving destructive effect endowed ends equal evident foundation from governed government governments happiness hold institute instituted is it its just laying life likely men most new on or organizing powers principles pursuit right rights safety seem self shall such that their them these they truths unalienable whenever with
Asymmetric set difference: $b (-) $a
, or use infix ∖
~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words;
my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words;
.sort>>.keys.put given $a (-) $b; #returns 28 words'
a america blessings common constitution defence do domestic establish for general insure justice more ordain order our ourselves perfect posterity promote provide states this tranquility union united welfare
Set operators that return a Bool
ean
(As above, Unicode symbols can be used, but many have easy-to-type ASCII variants):
- infix
(elem)
, infix∈
- infix
∉
- infix
(cont)
, infix∋
- infix
∌
- infix
(<=)
, infix⊆
- infix
⊈
- infix
(<)
, infix⊂
- infix
⊄
- infix
(>=)
, infix⊇
- infix
⊉
- infix
(>)
, infix⊃
- infix
⊅
- infix
(==)
, infix≡
- infix
≢
https://www.archives.gov/founding-docs/declaration-transcript
https://www.archives.gov/founding-docs/constitution
https://docs.raku.org/language/setbagmix#Sets,_bags,_and_mixes
https://raku.org

- 3,195
- 8
- 17
I wrote a little tool to do this which has been quite useful to me in various places. The UI is unpolished and I'm not sure about the performance characteristics for very large files (since it reads the whole list into memory) but "it works for me". The program is at https://github.com/nibrahim/lines. It's in Python. You can get it using pip install lines
.
It currently supports union, intersection, difference and symmetric difference of two files. Each line of the input file is treated as an element of a set.
It also has two extra operations. One of squeeze out blank lines in a file and the second (which has been very useful to me) is to look through the file and divide it into sets of similar strings. I needed this to look for files in a list that didn't match the general pattern.
I'd welcome feedback.

- 198
The filesystem treats filenames (whole filenames, including paths) as unique.
Operations?
You may copy the files in a/ and b/ to the empty directory c/, to get a new, union set.
With file-tests like -e name
and loops or find, you may check for files existing in two or more directories, to get the intersection, or the difference.

- 10,482
-
2I meant treating the contents of files as the elements of a set (let's say, one element per line), and the files themselves as sets. – nilton Apr 25 '11 at 22:04
Best answer here: Setdown (a dedicated tool)
I wrote a program called setdown that performs Set operations from the cli.
It can perform set operations by writing a definition similar to what you would write in a Makefile:
someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection
Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations.It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!
At any rate, I think that it's pretty cool and you should totally check it out.

- 101
Sample pattern for multiple files (intersection in this case):
eval `perl -le 'print "cat ",join(" | grep -xF -f- ", @ARGV)' t*`
Expands to:
cat t1 | grep -xF -f- t2 | grep -xF -f- t3
Test files:
seq 0 20 | tee t1; seq 0 2 20 | tee t2; seq 0 3 20 | tee t3
Output:
0
6
12
18

- 141
sort set1 set2 | uniq -d
doesn't work for multi-sets. Consider usingsort <(sort -u set1) <(sort -u set2) | uniq -d
. – neo Aug 28 '16 at 13:15