1

I am trying to obtain the pairwise combination of of string available for each stack of data,

input file contains two columns: col1 is genenames, col2 is name of various stressors.

        gene1   FishKairomones
        gene1   Microcystin
        gene1   Calcium
        gene2   Cadmium
        gene2   Microcystis
        gene2   FishKairomones
        gene2   Phosphorous
        gene3   FishKairomones
        gene3   Microcystin
        gene3   Phosphorous
        gene3   Cadmium

So here from the table, gene1 is responsive to 3 stressors, fishkairomones, microcystin and calcium.

I would like to obtain a pairwise table like this:

    gene1   FishKairomones  gene1   Microcystin
    gene1   FishKairomones  gene1   Calcium
    gene1   Microcystin gene1   Calcium
    gene2   Cadmium gene2   Microcystis
    gene2   Cadmium gene2   FishKairomones
    gene2   Cadmium gene2   Phosphorous
    gene2   Microcystis gene2   FishKairomones
    gene2   Microcystis gene2   Phosphorous
    gene2   FishKairomones  gene2   Phosphorous

As you can see, gene1 FishKairomones is linked to gene1 microcystin, gene1 fishkairomones is linked to also calcium, and gene1 microcystin is linked to gene1 calcium. Similarly I would like to do it for all genes.

Sometimes the gene can have 3 stressors, sometimes 4 and so on.

I tried the code here: Command line tool to "cat" pairwise expansion of all rows in a file

This creates all pairwise combinations of the entire file, which is not what I want.

biobudhan
  • 537

1 Answers1

2

AWK solution (will work even for unordered input lines):

awk '{ a[$1]=($1 in a? a[$1]",":"")$2 }   # grouping `stressors` by `gene` names
     END { 
         for (k in a) {                   # for each `gene`
             len=split(a[k], b, ",");     # split `stressors` string into array b
             for (i=1;i<len;i++)          # construct pairwise combinations
                 for (j=i+1;j<=len;j++)   # between `stressors` 
                     print k,b[i],k,b[j] 
         } 
     }' file

The output:

gene1 FishKairomones gene1 Microcystin
gene1 FishKairomones gene1 Calcium
gene1 Microcystin gene1 Calcium
gene2 Cadmium gene2 Microcystis
gene2 Cadmium gene2 FishKairomones
gene2 Cadmium gene2 Phosphorous
gene2 Microcystis gene2 FishKairomones
gene2 Microcystis gene2 Phosphorous
gene2 FishKairomones gene2 Phosphorous
gene3 FishKairomones gene3 Microcystin
gene3 FishKairomones gene3 Phosphorous
gene3 FishKairomones gene3 Cadmium
gene3 Microcystin gene3 Phosphorous
gene3 Microcystin gene3 Cadmium
gene3 Phosphorous gene3 Cadmium