7

If I have n files in a directory, for example;

a
b
c

How do I get pairwise combinations of these files (non-directional) to pass to a function?

The expected output is

a-b
a-c
b-c

so that it can be passed to a function like

fn -file1 a -file2 b
fn -file1 a -file2 c
...

This is what I am trying out now.

for i in *.txt
 do
  for j in *.txt
   do
    if [ "$i" != "$j" ]
     then
      echo "Pairs $i and $j"
     fi
   done
 done

Output

Pairs a.txt and b.txt
Pairs a.txt and c.txt
Pairs b.txt and a.txt
Pairs b.txt and c.txt
Pairs c.txt and a.txt
Pairs c.txt and b.txt

I still have duplicates (a-b is same as b-a) and I am thinking perhaps there is a better way to do this.

5 Answers5

11

Put the file names in an array and run through it manually with two loops.

You get each pairing only once if if j < i where i and j are the indexes used in the outer and the inner loop, respectively.

$ touch a b c d
$ f=(*)
$ for ((i = 0; i < ${#f[@]}; i++)); do 
      for ((j = i + 1; j < ${#f[@]}; j++)); do 
          echo "${f[i]} - ${f[j]}"; 
      done;
  done 
a - b
a - c
a - d
b - c
b - d
c - d
ilkkachu
  • 138,973
  • 1
    Note that it is better to use printf rather than echo: https://unix.stackexchange.com/questions/65803/why-is-printf-better-than-echo – cryptarch Dec 23 '18 at 20:06
  • 1
    @cryptarch, to be in line with the question, the content of the loop should be a call to fn, instead of echo or printf. echo works fine as an example here, though. – ilkkachu Dec 23 '18 at 21:24
  • Sure, it's not broken, you already got my +1 ;) – cryptarch Dec 23 '18 at 21:45
5

You're very close in your script, but you want to remove duplicates; i.e a-b is considered a duplicate of b-a.

We can use an inequality to handle this; only display the filename if the first file comes before the second file alphabetically. This will ensure only one of each matches.

for i in *.txt
do
  for j in *.txt
  do
    if [ "$i" \< "$j" ]
    then
     echo "Pairs $i and $j"
    fi
  done
done

This gives the output

Pairs a.txt and b.txt
Pairs a.txt and c.txt
Pairs b.txt and c.txt

This isn't an efficient algorithm (it's O(n^2)) but may be good enough for your needs.

  • This will take more than twice as long as https://unix.stackexchange.com/a/490657/305714 because you are checking each pair twice rather than restricting the loop to avoid redundancy – cryptarch Dec 23 '18 at 20:08
  • Yes, but without knowing the cost of fn it's hard to know if this overhead is significant or not. Taking 0.2s instead of 0.1s doesn't mean anything if every call to fn takes 1 second. Sometimes the naive algorithms are just fine ;-) In this case I just fixed the original code, rather than providing a more optimised alternative, because I considered it a better "teaching" solution. – Stephen Harris Dec 23 '18 at 20:16
1

With join trick for filenames without whitespace(s):

Sample list of files:

$ ls *.json | head -4
1.json
2.json
comp.json
conf.json

$ join -j9999 -o1.1,2.1 <(ls *.json | head -4) <(ls *.json | head -4) | awk '$1 != $2'
1.json 2.json
1.json comp.json
1.json conf.json
2.json 1.json
2.json comp.json
2.json conf.json
comp.json 1.json
comp.json 2.json
comp.json conf.json
conf.json 1.json
conf.json 2.json
conf.json comp.json

  • -j option points to a common field position to join on; but -j9999 will provoke mixed joining resembling cartesian product.
0
for i in *.txt ; do
  for j in *.txt ; do
    if [ "$i" '<' "$j" ] ; then
      echo "Pairs $i and $j"
    fi
  done
done
Ole Tange
  • 35,514
0

You could use perl's Alogithm::Combinatorics module to avoid having to devise the algorithm yourself.

perl -MAlgorithm::Combinatorics=combinations -e '
  if ((@files = <*.txt>) >= 2) {
    for (combinations(\@files, 2)) {
      system "cmd", "-file1", $_->[0], "-file2", $_->[1];
    }
  } else {
    die "Not enough txt files in the current working directory\n";
  }'

See perldoc Algorithm::Combinatorics for details and other things that module can do.