94

Does anyone know of any linux tool specifically designed to treat files as sets and perform set operations on them? Like difference, intersection, etc?

nilton
  • 1,043

13 Answers13

126

Assuming elements are strings of characters other than NUL and newline (beware that newline is valid in file names though), you can represent a set as a text file with one element per line and use some of the standard Unix utilities.

Set Membership

$ grep -Fxc 'element' set   # outputs 1 if element is in set
                            # outputs >1 if set is a multi-set
                            # outputs 0 if element is not in set

$ grep -Fxq 'element' set # returns 0 (true) if element is in set # returns 1 (false) if element is not in set

$ awk '$0 == "element" { s=1; exit }; END { exit !s }' set

returns 0 if element is in set, 1 otherwise.

$ awk -v e='element' '$0 == e { s=1; exit } END { exit !s }'

Set Intersection

$ comm -12 <(sort set1) <(sort set2)  # outputs intersect of set1 and set2

$ grep -xF -f set1 set2

$ sort set1 set2 | uniq -d

$ join -t <(sort A) <(sort B)

$ awk '!done { a[$0]; next }; $0 in a' set1 done=1 set2

Set Equality

$ cmp -s <(sort set1) <(sort set2) # returns 0 if set1 is equal to set2
                                   # returns 1 if set1 != set2

$ cmp -s <(sort -u set1) <(sort -u set2)

collapses multi-sets into sets and does the same as previous

$ awk '{ if (!($0 in a)) c++; a[$0] }; END{ exit !(c==NR/2) }' set1 set2

returns 0 if set1 == set2

returns 1 if set1 != set2

$ awk '{ a[$0] }; END{ exit !(length(a)==NR/2) }' set1 set2

same as previous, requires >= gnu awk 3.1.5

Set Cardinality

$ wc -l < set     # outputs number of elements in set

$ awk 'END { print NR }' set

$ sed '$=' set

Subset Test

$ comm -23 <(sort -u subset) <(sort -u set) | grep -q '^'
# returns true iff subset is not a subset of set (has elements not in set)

$ awk '!done { a[$0]; next }; { if !($0 in a) exit 1 }' set done=1 subset

returns 0 if subset is a subset of set

returns 1 if subset is not a subset of set

Set Union

$ cat set1 set2     # outputs union of set1 and set2
                    # assumes they are disjoint

$ awk 1 set1 set2 # ditto

$ cat set1 set2 ... setn # union over n sets

$ sort -u set1 set2 # same, but doesn't assume they are disjoint

$ sort set1 set2 | uniq

$ awk '!a[$0]++' set1 set2 # ditto without sorting

Set Complement

$ comm -23 <(sort set1) <(sort set2)
# outputs elements in set1 that are not in set2

$ grep -vxF -f set2 set1 # ditto

$ sort set2 set2 set1 | uniq -u # ditto

$ awk '!done { a[$0]; next }; !($0 in a)' set2 done=1 set1

Set Symmetric Difference

$ comm -3 <(sort set1) <(sort set2) | tr -d '\t'  # assumes not tab in sets
# outputs elements that are in set1 or in set2 but not both

$ sort set1 set2 | uniq -u

$ cat <(grep -vxF -f set1 set2) <(grep -vxF -f set2 set1)

$ grep -vxF -f set1 set2; grep -vxF -f set2 set1

$ awk '!done { a[$0]; next }; $0 in a { delete a[$0]; next }; 1; END { for (b in a) print b }' set1 done=1 set2

Power Set

All possible subsets of a set displayed space separated, one per line:

$ p() { [ "$#" -eq 0 ] && echo || (shift; p "$@") |
        while read r; do printf '%s %s\n%s\n' "$1" "$r" "$r"; done; }
$ p $(cat set)

(assumes elements don't contain SPC, TAB (assuming the default value of $IFS), backslash, wildcard characters).

Set Cartesian Product

$ while IFS= read -r a; do while IFS= read -r b; do echo "$a, $b"; done < set1; done < set2

$ awk '!done { a[$0]; next }; { for (i in a) print i, $0 }' set1 done=1 set2

Disjoint Set Test

$ comm -12 <(sort set1) <(sort set2)  # does not output anything if disjoint

$ awk '++seen[$0] == 2 { exit 1 }' set1 set2 # returns 0 if disjoint # returns 1 if not

Empty Set Test

$ wc -l < set            # outputs 0  if the set is empty
                         # outputs >0 if the set is not empty

$ grep -q '^' set # returns true (0 exit status) unless set is empty

$ awk '{ exit 1 }' set # returns true (0 exit status) if set is empty

Minimum

$ sort set | head -n 1   # outputs the minimum (lexically) element in the set

$ awk 'NR == 1 { min = $0 }; $0 < min { min = $0 }; END { print min }'

ditto, but does numeric comparison when elements are numerical

Maximum

$ sort test | tail -n 1    # outputs the maximum element in the set

$ sort -r test | head -n 1

$ awk 'NR == 1 || $0 > max { max = $0 }; END { print max }'

ditto, but does numeric comparison when elements are numerical

All available at http://www.catonmat.net/blog/set-operations-in-unix-shell-simplified/

llhuii
  • 1,406
  • 1
    I think the Python version is much simpler and more intuitive. ;-) – Keith Apr 22 '11 at 23:37
  • I think this is the most complete answer. Unfortunately which commands to run or which arguments (comm -12, -23, -13) in each case is not always intuitive as "intersection" or "difference". Maybe'll create a wrapper around them, since I'm always using these things. – nilton Apr 25 '11 at 22:08
  • I ran [pol@localhost inst]$ grep -xc and INSTALL-BINARY 0 [pol@localhost inst]$ but I don't understand what it means. The word "and" should occur many times in the file. What am I doing wrong? – Vérace Feb 03 '15 at 11:06
  • 1
    Set intersection: sort set1 set2 | uniq -d doesn't work for multi-sets. Consider using sort <(sort -u set1) <(sort -u set2) | uniq -d. – neo Aug 28 '16 at 13:15
15

Sort of. You need to deal with sorting yourself, but comm can be used to do that, treating each line as a set member: -12 for intersection, -13 for difference. (And -23 gives you flipped difference, that is, set2 - set1 instead of set1 - set2.) Union is sort -u in this setup.

t7ko
  • 515
geekosaur
  • 32,047
  • 2
    Indeed, comm seems to do most of the stuff. Although the arguments are very unintuitive. Thanks! – nilton Apr 25 '11 at 22:09
  • Very useful answer, but I believe the description for the difference is flipped around. To see what is in the first file but not the second file (set1 - set2) you use comm -23 to supress the second and third columns, leaving only the first column. Likewise, to see the difference (set2 - set1) you use comm -13 – Tony Feb 04 '20 at 23:34
10

The tiny console tool “setop” is now availible in Debian Stretch and in Ubuntu since 16.10. You can get it via sudo apt install setop

Here are some examples. The sets to be operated on are given as different input files: setop input # is equal to "sort input --unique" setop file1 file2 --union # option --union is default and can be omitted setop file1 file2 file3 --intersection # more than two inputs are allowed setop file1 - --symmetric-difference # ndash stands for standard input setop file1 -d file2 # all elements contained in 1 but not 2

Boolean queries return only EXIT_SUCCESS in case of true, and EXIT_FAILURE as well as a message otherwise. This way, setop can be used in the shell. setop inputfile --contains "value" # is element value contained in input? setop A.txt B.txt --equal C.txt # union of A and B equal to C? setop bigfile --subset smallfile # analogous --superset setop -i file1 file2 --is-empty # intersection of 1 and 2 empty (disjoint)?

It is also possible to precisly describe how the input streams shall be parsed, actually by regular expressions:

  • setop input.txt --input-separator "[[:space:]-]" means that a whitespace (i. e. \v \t \n \r \f or space) or a minus sign is interpreted as a separator between elements (default is new line, i. e. every line of the input file is one element)
  • setop input.txt --input-element "[A-Za-z]+" means that elements are only words consisting of latin characters, all other characters are considered to be separators between elements

Furthermore, you can

  • --count all elements of the output set,
  • --trim all input elements (i. e. erase all unwanted preceding and succeeding characters like space, comma etc.),
  • consider empty elements as valid via --include-empty,
  • --ignore-case,
  • set the --output-separator between elements of the output stream (default is \n),
  • and so on.

See man setop or github.com/phisigma/setop for more information.

Frank
  • 101
8

I don't know of a specific tool but you can use Python, and its set class and operators, to write a little script to do that.

For exampe:

Python> s1 = set(os.listdir("/bin"))
Python> s2 = set(os.listdir("/usr/bin"))
Python> s1 & s2

set(['awk',
     'basename',
     'chroot', ...
dr_
  • 29,602
Keith
  • 7,914
  • 1
  • 28
  • 29
  • 1
    Yes, nice answer. Why use awk if python is available? – guettli Nov 17 '15 at 09:23
  • You forgot: Python> import os – James Bowery Oct 23 '16 at 19:48
  • So I'm not sure but the title of the question is about how to perform Set operations on files. Maybe one set element per line of the file (this is one interpretation tackled in code by @Gilles 'SO- stop being evil' )? But a quick reading of your code would suggest you're performing Set operations on directories, so... . – jubilatious1 Jun 02 '23 at 21:40
  • @jubilatious1 The listdir returns a list of files in that directory. So this example is operating on sets of file names. – Keith Jun 04 '23 at 22:09
  • @Keith see a comment from the OP here – jubilatious1 Jun 05 '23 at 08:53
  • @jubilatious1 not what I inferred from the original question. Even so, it's just an example. I figure the OP should be able to adapt the idea to his/her needs. – Keith Jun 08 '23 at 03:06
3

If you see a file as a set of lines, and the files are sorted, there's comm.

If you see a file as a (multi)set of lines, and the lines aren't sorted, grep can do difference and intersection (it achieves set difference and intersection, but doesn't respect count for multisets). Union is just cat.

grep -xF -f small large >intersection
grep -vxF -f small large >difference
cat small large >union
3

I've made a Python utility that can do line-wise union, intersection, difference and product of multiple files. It's called SetOp, you can find it on PyPI (here). Syntax looks like this:

$ setop -i file1 file2 file3  # intersection
$ setop -d file1 file2 file3  # difference
Tigr
  • 31
2

With zsh arrays (zsh arrays can contain any arbitrary sequence of bytes, even 0).

set membership

if ((${array[(Ie)$element]})); then
  echo '$element is in $array'
fi

(using the I array subscript flag, to get the index of the last occurrence of $element in the array (or 0 if not found). Remove e (for exact) for $element to be taken as a pattern)

if ((n = ${(M)#array:#$element})); then
  echo "\$element is found $n times in \$array'
fi

${array:#pattern} being a variation on ksh's ${var#pattern} that removes the elements that match the pattern as opposed to just remove the leading part that matches the pattern. The (M) (for matched) reverses the meaning and removes all but the matched elements (use $~element for it to be taken as a pattern).

set intersection

common=("${(@)set1:*set2}")

${set1:*set2} does the array intersection, but the "${(@)...}" syntax is needed to preserve empty elements.

set equality

[[ ${(j: :)${(q)array1}} = ${(j: :)${(q)array2}} ]]

Tests whether the arrays are identical (and in the same order). The q parameter expansion flag quotes the elements (to avoid problems with things like a=(1 "2 3") vs b=("1 2" 3)), and (j: :) joins them with space before doing a string comparison.

To check that they have the same elements, irrespective of order, use the o flag to order them. See also the u flag (unique) to remove duplicates.

[[ ${(j: :)${(qo)array1}} = ${(j: :)${(qo)array2}} ]]

set cardinality

n=$#array

subset test

if ((${#array1:*array2} == ${#array2})); then
  echo '$array2 is included in $array1'
fi

(assuming arrays with unique values)

union

union=("$array1[@]" "$array2[@]")

(see typeset -U above or the u parameter expansion flag to take care of duplicates). Again if the empty string is not one of the possible values, you can simplify to:

union=($array1 $array2)

complement

complement=("${(@)array1:|array2}")

for the elements of $array1 that are not in $array2.

minimum/maximum (lexical comparison)

min=${${(o)array}[1]} max=${${(o)array}[-1]}

minimum/maximum (decimal integer comparison)

min=${${(no)array}[1]} max=${${(no)array}[-1]}
2

I've written a command tool in Go for this question. The source code is here on GitHub.

You can use homebrew to install it on macOS:

brew install suzaku/homebrew-rose/rose

Usage is simply:

  • rose and file1 file2 for intersection.
  • rose or file1 file2 for union.
  • rose sub file1 file2 for subtraction.
AdminBee
  • 22,803
Satoru.Logic
  • 121
  • 3
2

Using Raku (formerly known as Perl_6)

Raku has two choices of object-types when it comes to set-related operations. They are:

  1. Set object type (immutable),
  2. SetHash object type (mutable).

Unicode symbols can be used--e.g. infix for intersection--or the ASCII equivalent, e.g. infix (&) for intersection. Note for the set results below, the first two lines are identical; in the third (and final) line only the set operator changes.

Below, taking the (famous, 72-word) "We hold these truths to be self-evident..." sentence from the US Declaration of Independence, and performing set-comparisons to the (equally famous, 38-word) Preamble of the US Constitution:

Intersection: infix (&), infix

~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words; 
            my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words; 
           .sort>>.keys.put given $a (&) $b; #returns 10 words'
and form in liberty of people secure the to we

Union: infix (|), infix

~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words; 
            my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words; 
           .sort>>.keys.put given $a (|) $b; #returns 100 words'
a abolish all alter america among and any are as be becomes blessings by certain common consent constitution created creator defence deriving destructive do domestic effect endowed ends equal establish evident for form foundation from general governed government governments happiness hold in institute instituted insure is it its just justice laying liberty life likely men more most new of on or ordain order organizing our ourselves people perfect posterity powers principles promote provide pursuit right rights safety secure seem self shall states such that the their them these they this to tranquility truths unalienable union united we welfare whenever with

Symmetric set difference: $a (^) $b, or use infix

~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words; 
            my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words; 
           .sort>>.keys.put given $a (^) $b; #returns 90 words'
a abolish all alter america among any are as be becomes blessings by certain common consent constitution created creator defence deriving destructive do domestic effect endowed ends equal establish evident for foundation from general governed government governments happiness hold institute instituted insure is it its just justice laying life likely men more most new on or ordain order organizing our ourselves perfect posterity powers principles promote provide pursuit right rights safety seem self shall states such that their them these they this tranquility truths unalienable union united welfare whenever with

Asymmetric set difference: $a (-) $b, or use infix

~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words; 
            my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words; 
           .sort>>.keys.put given $a (-) $b; #returns 62 words'
abolish all alter among any are as be becomes by certain consent created creator deriving destructive effect endowed ends equal evident foundation from governed government governments happiness hold institute instituted is it its just laying life likely men most new on or organizing powers principles pursuit right rights safety seem self shall such that their them these they truths unalienable whenever with

Asymmetric set difference: $b (-) $a, or use infix

~$ raku -e 'my Set $a .= new(.comb(/<alpha>+/)>>.lc) given "declaration.txt".IO.words; 
            my Set $b .= new(.comb(/<alpha>+/)>>.lc) given "preamble.txt".IO.words; 
           .sort>>.keys.put given $a (-) $b; #returns 28 words'
a america blessings common constitution defence do domestic establish for general insure justice more ordain order our ourselves perfect posterity promote provide states this tranquility union united welfare

Set operators that return a Boolean

(As above, Unicode symbols can be used, but many have easy-to-type ASCII variants):

  1. infix (elem), infix
  2. infix
  3. infix (cont), infix
  4. infix
  5. infix (<=), infix
  6. infix
  7. infix (<), infix
  8. infix
  9. infix (>=), infix
  10. infix
  11. infix (>), infix
  12. infix
  13. infix (==), infix
  14. infix

https://www.archives.gov/founding-docs/declaration-transcript
https://www.archives.gov/founding-docs/constitution
https://docs.raku.org/language/setbagmix#Sets,_bags,_and_mixes
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
1

I wrote a little tool to do this which has been quite useful to me in various places. The UI is unpolished and I'm not sure about the performance characteristics for very large files (since it reads the whole list into memory) but "it works for me". The program is at https://github.com/nibrahim/lines. It's in Python. You can get it using pip install lines.

It currently supports union, intersection, difference and symmetric difference of two files. Each line of the input file is treated as an element of a set.

It also has two extra operations. One of squeeze out blank lines in a file and the second (which has been very useful to me) is to look through the file and divide it into sets of similar strings. I needed this to look for files in a list that didn't match the general pattern.

I'd welcome feedback.

0

The filesystem treats filenames (whole filenames, including paths) as unique.

Operations?

You may copy the files in a/ and b/ to the empty directory c/, to get a new, union set.

With file-tests like -e name and loops or find, you may check for files existing in two or more directories, to get the intersection, or the difference.

user unknown
  • 10,482
  • 2
    I meant treating the contents of files as the elements of a set (let's say, one element per line), and the files themselves as sets. – nilton Apr 25 '11 at 22:04
0

Best answer here: Setdown (a dedicated tool)

I wrote a program called setdown that performs Set operations from the cli.

It can perform set operations by writing a definition similar to what you would write in a Makefile:

someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection

Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations.It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!

At any rate, I think that it's pretty cool and you should totally check it out.

0

Sample pattern for multiple files (intersection in this case):

eval `perl -le 'print "cat ",join(" | grep -xF -f- ", @ARGV)' t*`

Expands to:

cat t1 | grep -xF -f- t2 | grep -xF -f- t3

Test files:

seq 0 20 | tee t1; seq 0 2 20 | tee t2; seq 0 3 20 | tee t3

Output:

0
6
12
18
bsb
  • 141