How to subtract two list (fast)?

Question

What is a fast way to subtract two lists¹. The lists may be small, maybe a direct way in shell works. Or the lists may be long, maybe external tools are the faster way.

Assume you have two lists:

list1=( 1 2 3 4 5 6 7 8 9 10 11 12 )
list2=( 1 2 3   5   7 8 9    11 12 )

How to remove all elements of list2 from list1 to obtain a resulting list (listr) equivalent to:

listr=( 4 6 10 )

The lists could be inside files also, as the should if the list is big (it may use too much memory).

To make this question brief, I am placing all algoritms in a community answer.

Please read the multiple tests done in there.

Multisets

The original question was meant to find the missing elements of a complete list (list1) in list2, without repetitions.

However, if the lists are:

list1=( a a b b b c     d d   )
list2=(     b b   c c c d d e )

And the definition of multiset subtraction is as in this page, the expected result is:

listr= ( a a b )

Only algorithms 1 and 3 work correctly.
Neither algorithms 2 or 4 can do this.
Algorithm 5 (comm) could match this definition by doing comm -23.
Algorithm 6 (zsh) fails. I do not know how to make it work.
Algorithm 7 (comm). As said above, using -23 works.

I have not analyzed all the algorithms for the Set symmetric difference definition which should yield:

listr=( a a b c c e )

But comm -3 list1.txt list2.txt | tr -d ' \t' works.

¹ Yes, I know that it is a bad idea to process a text file (list of lines) in a shell, it is slow by design.
But there are cases where it seems that it can not be avoided.
I (we) am (are) open to suggestions.

If the lists are sorted and on file, just join them (with the -v option). — Kusalananda, May 13 '18 at 18:26
What about duplicates? What should be the output with 1 1 1 and 1 1? What about 1 2 and 2 1? — Eric Duminil, May 14 '18 at 07:39
@EricDuminil Interesting question. However, the answer will depend on the definition of subtraction when applied to multisets. Please read the edited question. — , May 14 '18 at 17:48
@don_crissti The answer you point to defines the method required here as a complement, yes. However, there is no evaluation of speed in any sense of the possible solutions given there. I beg to differ. — , May 14 '18 at 18:21

Stephen Kitt · Accepted Answer · 2018-05-14T09:07:46.643

14

You can use comm to remove anything that’s common to both lists:

listr=($(comm -3 <(printf "%s\n" "${list1[@]}" | sort) <(printf "%s\n" "${list2[@]}" | sort) | sort -n))

This sorts both lists in the order comm expects, compares them, outputs only items which are unique to either list, and sorts them again in numeric order.

If both lists are sorted lexicographically (as per LC_COLLATE), you can avoid sorting again:

listr=($(comm --nocheck-order -3 <(printf "%s\n" "${list1[@]}") <(printf "%s\n" "${list2[@]}")))

This also works great if the values you need to compare are stored in files.

edited May 14 '18 at 09:07

answered May 13 '18 at 18:37

Stephen Kitt

434,908

Good Idea. However, With all the sorting involved, It takes well over 5 seconds. – May 13 '18 at 19:18
@isaac how about without sorting? – Stephen Kitt May 13 '18 at 19:22
@StephenKitt Quite better, but still around 2 seconds. Please take a look at the (community) answer I just posted with the measured times. Please feel free to edit/add/adapt the answer. – May 13 '18 at 19:50
To clarify, the inputs need to be sorted lexically (with strcoll), not numerically for the comm --nocheck-order to work. See for instance comm --nocheck-order -3 <(printf '%s\n' 3 20) <(printf '%s\n' 1 20) – Stéphane Chazelas May 14 '18 at 09:03
In general, this will fall on face with arbitrary input; lines/elements with whitespace or metacharacters that can be considered globs. – llua May 14 '18 at 16:40

score 6 · Answer 2 · answered May 13 '18 at 18:38

6

#!/bin/zsh
list1=( 1 2 3 4 5 6 7 8 9 10 11 12 )
list2=( 1 2 3   5   7 8 9    11 12 )
listr=("${(@)list1:|list2}")
typeset -p listr

answered May 13 '18 at 18:38

llua

6,900

score 4 · Answer 3 · 2018-05-14T18:33:32.620

Abstract:

For long lists, if lists are already sorted, comm (alg7) is the fastest.
The zsh solution is (by far) the fastest if no file is read, that is, the lists are given "in memory". However, that is not a fair comparison with all the other solutions that have to read values from files. I changed the original code (which avoided the time to read files from testing) to a code that also include the time to read files.

This is a Community Answer, it is meant to only report the time of each answer.

Please DO edit and add your solution/option to compare all.

List of algorithms

alg1: naive looped solution.
alg2: using external sort and uniq -u
alg3: processing an string in bash.
alg4: join -v on sorted lists (Thanks @Kusalananda)
alg5: comm (Thanks @Stephen Kitt)
alg6: zsh (Thanks @Llua)
alg7: comm but on already sorted files (thanks @Stephen Kitt)

Note from zsh manual:

${name:|arrayname}
If arrayname is the name (N.B., not contents) of an array variable, then any elements contained in arrayname are removed from the substitution of name.

Testing

As there are several ways to solve this, we need a general framework to test (in fairness) the answers.

Some guidelines (change them if you find them unfair):

Measure enough repetitions to have a reasonably precision.
Measure inside the shell used (avoid loading/unloading of the shell).
Avoid output overhead by either not printing or redirecting to /dev/null.

Testing code:

#!/bin/bash
alg1(){
         arr=( "${list1[@]}" )
         for i in "${list2[@]}"; do
             for j in "${!arr[@]}"; do
         if [[ "$i" == "${arr[j]}" ]]; then
             unset arr["$j"]
             break
         fi
             done
     done
     printf '%s ' "${arr[@]}"; echo
}

alg2(){
         arr=($({ printf '%s\n' "${list1[@]}" "${list2[@]}"; } | sort | uniq -u))
         printf '%s ' "${arr[@]}"; echo
}

alg3(){
         a=" $(printf '%s ' ${list1[@]})" # Watch the space at the start!!.
         for i in "${list2[@]}"; do
         a=${a/ "$i" / };
     done
     printf '%s\n' "$a"
}

alg4(){  join -v 1 list1.txt list2.txt ; }

alg5(){  #listr=$(
                    comm -3 <(printf "%s\n" "${list1[@]}" | sort) \
                            <(printf "%s\n" "${list2[@]}" | sort) |
                sort -n
     #)
      }

alg6(){ zsh -c '  alg6(){
                           list1=( $(cat list1.txt) )
                           list2=( $(cat list2.txt) )
                           listr=("${(@)list1:|list2}")
                           typeset -p listr
                        }
                  TIMEFMT="%E %U %S"
                  time ( for ((k=0;k<'"$1"';k++)); do alg6; done; )
                '
      }
#: <<-\_comment_
alg7(){ comm -23 list1.txt list2.txt; }

try(){ for ((k=0;k<$1;k++)); do "$2"; done; }

#list1=( 1 2 3 4 5 6 7 8 9 10 11 12 )
#list2=( 1 2 3   5   7 8 9    11 12 )

#list1=( a a b b b c     d d   )
#list2=(     b b   c c c d d e )

size=1000000
list1=( "0" $(seq 1 "$size") )
list2=( "${list1[@]}" ); unset "list2[123]" "list2[234]" "list2[345]"

printf '%s\n' "${list1[@]}" | sort >list1.txt
printf '%s\n' "${list2[@]}" | sort >list2.txt

repeats=${1:-10}; shift
out=${1:-no}    ; shift
(($#==0)) && set -- alg{1..7}

TIMEFORMAT='%R %U %S'
for   i
do    printf '%s ' "$i"
      if [[ $out == no ]]; then
      [[ $i != alg6 ]] &&
          time try "$repeats" "$i" >/dev/null ||
          time alg6 "$repeats" >/dev/null
      else
      [[ $i != alg6 ]] &&
          time try "$repeats" "$i"            ||
          time alg6 "$repeats"
      fi
done

Results:

Short list (as presented in the code):

$ ./script
alg1 2.056 0.806 1.237
alg2 3.478 3.126 1.756
alg3 0.999 0.728 0.304
alg4 1.186 0.780 0.434
alg5 5.234 1.926 1.722
alg6 2.71s 1.64s 1.26s
     2.758 1.637 1.265
alg7 1.156 0.799 0.422

The times for alg6 are as reported by zsh and after as reported by bash.
Also, the execution time of zsh is really smaller (0.050) if the reading of files is removed from the function to outside.

Longer List

Testing a list of only 500 values (10 repeats) reveals that alg1 is very inefficient. Removing it from further testing:

alg1 4.149 3.471 0.657
alg2 0.145 0.055 0.063
alg3 0.219 0.180 0.009
alg4 0.039 0.015 0.003
alg5 0.149 0.018 0.027
alg6 0.06s 0.02s 0.02s
     0.087 0.030 0.018
alg7 0.033 0.008 0.008

Testing 5k values (10 repeats) reveals that alg3 is also inefficient:

alg2 0.590 0.526 0.187
alg3 12.957 12.888 0.044
alg4 0.098 0.047 0.008
alg5 0.654 0.028 0.036
alg6 0.16s 0.12s 0.04s
     0.211 0.118 0.044
alg7 0.038 0.022 0.014

Testing 50k values (10 repeats):

alg2 6.487 5.838 1.611
alg4 0.488 0.469 0.019
alg5 5.073 0.250 0.056
alg6 1.42s 1.20s 0.21s
     1.467 1.206 0.216
alg7 0.271 0.247 0.014

For 500k (10 repeats)

alg4 5.471 5.269 0.156
alg6 15.14s 13.33s 1.91s
     15.215 13.335 1.926
alg7 2.833 2.655 0.138

For 1M (10 repeats)

alg4 11.127 10.804 0.251
alg7 5.772 5.525 0.230

You said in the question that some lists may be larger than what's reasonable to put into an array. This would disqualify a number of solutions, for example the zsh solution. — Kusalananda, May 13 '18 at 19:52
Yes, as a general question, it has several points of view. For short lists, zsh (and all the examples presented so far) do work. Feel free to generate a list of 50k lines (seq 50000), test again and add results to this answer. — , May 13 '18 at 19:57
The question was explicitly tagged zsh in additional to other shells. How would that answer be cheating? — llua, May 13 '18 at 20:07
@llua I mean that maybe zsh is doing some internal optimization and avoiding/bypassing the real output step (I do not know if that happens), thus: if it is not cheating. — , May 13 '18 at 20:18
alg7 still has an extra sort, which shouldn’t be needed if the files are sorted already (which they are). Removing the sort should produce times comparable to join. — Stephen Kitt, May 13 '18 at 20:19
@StephenKitt Yes, the final sort is numeric (as I believe You intended the solution to work) removed, re-tested, edited (ASAP). — , May 13 '18 at 20:21
I intended the solution to work as I wrote it ;-). (The first variant has a numeric sort to restore the order given in the question, which is lost by the sort used to prepare the data for comm; the second doesn’t need the numeric sort because the order given in the question is preserved throughout.) — Stephen Kitt, May 13 '18 at 20:24
@StephenKitt Yes :-) ... But then, in your first solution, the partial sort(s) may be numeric (sort -n) and the last re-sorting could be completelly avoided/removed (could it not?). — , May 13 '18 at 20:28
No, comm expects files to be sorted according to LC_COLLATE, so sort -n can’t be used on its inputs (see its documentation). — Stephen Kitt, May 13 '18 at 20:38