13

This question is strongly related to this and this question. I have a file that contains several lines where each line is a path to a file. Now I want to pair each line with each different line (not itself). Also a pair A B is equal to a B A pair for my purposes, so only one of these combinations should be produced.

Example

files.dat reads like this in a shorthand notation, each letter is a file path (absolute or relative)

a
b
c
d
e

Then my result should look something like this:

a b
a c
a d
a e
b c
b d
b e
c d
c e
d e

Preferrably I would like to solve this in bash. Unlike the other questions, my file list is rather small (about 200 lines), so using loops and RAM capacity pose no problems.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Enno
  • 233
  • Does it have to be in bash proper, or just something available via the bash commandline? Other utilities are better positioned to process text. – Jeff Schaller Mar 17 '19 at 14:15
  • @JeffSchaller Something accessible via the bash commandline. I was a bit unclear, sorry – Enno Mar 17 '19 at 14:17
  • This is almost becoming a Code Golf :P – Richard de Wit Mar 18 '19 at 11:22
  • 3
    As a general rule, as long as you need to do something non-trivial, use your favourite scripting language over BASH. It will be less fragile (for example, against special characters or spaces), and much easier to expand whenever you need it (if you need three, or filter some of them away). Python or Perl should be installed in almost any Linux box, so they are good choices (unless you are working on embedded systems, like Busybox). – Davidmh Mar 18 '19 at 12:26

7 Answers7

9

Use this command:

awk '{ name[$1]++ }
    END { PROCINFO["sorted_in"] = "@ind_str_asc"
        for (v1 in name) for (v2 in name) if (v1 < v2) print v1, v2 }
        ' files.dat

PROCINFO may be a gawk extension.  If your awk doesn’t support it, just leave out the PROCINFO["sorted_in"] = "@ind_str_asc" line and pipe the output into sort (if you want the output sorted).

(This does not require the input to be sorted.)

9

If you have ruby installed:

$ ruby -0777 -F'\n' -lane '$F.combination(2) { |c| puts c.join(" ")}' ip.txt
a b
a c
a d
a e
b c
b d
b e
c d
c e
d e
  • -0777 slurp entire file (should be okay as it is mentioned in OP that file size is small)
  • -F'\n' split based on newline, so each line will be an element in $F array
  • $F.combination(2) generate combinations 2 elements at a time
  • { |c| puts c.join(" ")} print as required
  • if input file can contain duplicates, use $F.uniq.combination(2)


for 3 elements at a time:

$ ruby -0777 -F'\n' -lane '$F.combination(3) { |c| puts c.join(" ")}' ip.txt
a b c
a b d
a b e
a c d
a c e
a d e
b c d
b c e
b d e
c d e


With perl (not generic)

$ perl -0777 -F'\n' -lane 'for $i (0..$#F) {
                             for $j ($i+1..$#F) { 
                               print "$F[$i] $F[$j]\n" } }' ip.txt
a b
a c
a d
a e
b c
b d
b e
c d
c e
d e


With awk

$ awk '{ a[NR]=$0 }
       END{ for(i=1;i<=NR;i++)
              for(j=i+1;j<=NR;j++)
                print a[i], a[j] }' ip.txt 
a b
a c
a d
a e
b c
b d
b e
c d
c e
d e
Sundeep
  • 12,008
8
$ join -j 2 -o 1.1,2.1 file file | awk '!seen[$1,$2]++ && !seen[$2,$1]++'
a b
a c
a d
a e
b c
b d
b e
c d
c e
d e

This assumes that no line in the input file contains any whitespace. It also assumes that the file is sorted.

The join command creates the full cross product of the lines in the file. It does this by joining the file with itself on a non-existing field. The non-standard -j 2 may be replaced by -1 2 -2 2 (but not by -j2 unless you use GNU join).

The awk command reads the result of this and only outputs results that are pairs that has not yet been seen.

Kusalananda
  • 333,661
8

A python solution. The input file is fed to itertools.combinations from the standard library, which generates 2-length tuples that are formatted and printed to standard output.

python3 -c 'from itertools import combinations
with open("file") as f:
    lines = (line.rstrip() for line in f)
    lines = ("{} {}".format(x, y) for x, y in combinations(lines, 2))
    print(*lines, sep="\n")
'
iruvar
  • 16,725
5

Here's one in pure shell.

test $# -gt 1 || exit
a=$1
shift
for f in "$@"
do
  echo $a $f
done
exec /bin/sh $0 "$@"

Example:

~ (137) $ sh test.sh $(cat file.dat)
a b
a c
a d
a e
b c
b d
b e
c d
c e
d e
~ (138) $ 
EdC
  • 51
  • 1
    Command substitution strips trailing newlines, so you're better off with something like <file.dat xargs test.sh than test.sh $(cat file.dat) – iruvar Mar 17 '19 at 20:33
1

Using Perl we can do it as shown:

$ perl -lne '
     push @A, $_}{
     while ( @A ) {
        my $e = shift @A;
        print "$e $_" for @A;
     }
' input.txt
0

Ole Tange pointed out in a comment on my answer to a related question, GNU parallel can do this and other sorts of restrictions on combinations. See more examples here.

% parallel --plus echo {choose_k} :::: files.dat  :::: files.dat
a b
a c
a d
a e
b c
b d
b e
c d
c e
d e