Compare data by different column

Question

I have a file (tab separated) containing rows with a different number of columns. Such like this:

Bin_37:_Pelotomaculum_sp._DTU098    GH3 GH57    GH15    GH18    GT2 GT4 GT28                                                                        
Bin_45_1:_Thiopseudomonas_denitrificans GH3 GH57    GT2 GT9 CBM48                                                                               
...

My question is: how can I generate another file (tsv) containing the comparision of rows by column where the data are organized. Missing values are filled up with NA. For example, like this:

Bin_37:_Pelotomaculum_sp._DTU098 GH3 GH57 GH15 GH18 GT2 GT4 GT28 NA NA
Bin_45_1:_Thiopseudomonas_denitrificans GH3 GH57 NA NA GT2 NA NA GT9 CBM48
...

[edit] your question to clarify the steps you take to compare the columns across the rows to generate the output as it's not obvious and because of that none of the solutions posted so far will produce the expected output. — Ed Morton, Feb 03 '20 at 18:46

Yurko · Accepted Answer · 2020-02-04T15:00:46.077

Maybe not the most efficient for super huge files, but working version.

Input file file1:

Bin_37:_Pelotomaculum_sp._DTU098        GH3     GH57    GH15    GH18    GT2     GT4     GT28
Bin_45_1:_Thiopseudomonas_denitrificans GH3     GH57    GT2     GT9     CBM48
Bin_99:_to_make_sure_no_columns_is_ok

Script (/bin/sh or /bin/bash):

#!/bin/sh
F="file1";
COLS=$(cat "${F}"|sed 's/^[^\t]*//g;s/\t/\n/g'|sort|uniq|xargs);
# list of all available unique columns in SORTED order
echo "All avaiulable columns: [${COLS}]";
echo
# reading from the file line by line
cat "${F}"|while read L; do
  # assign to A the first column
  A=$(echo "${L}"|cut -d' ' -f1);
  # if A is not empty
  [ -n "${A}" ] && 
  {
    # take one by one all possible column values
    for C in ${COLS}; do
      # if the taken line has such column, add it to A,
      # otherwise add to A NA
      echo "${L} "|grep "\s${C}\s" >/dev/null && 
        A="$A"$'\t'"${C}" || 
        A="$A"$'\tNA'; 
    done;
    # print result line
    echo "${A}";
  };
done

Output:

All avaiulable columns: [CBM48 GH15 GH18 GH3 GH57 GT2 GT28 GT4 GT9]

Bin_37:_Pelotomaculum_sp._DTU098        NA      GH15    GH18    GH3     GH57    GT2     GT28    GT4     NA
Bin_45_1:_Thiopseudomonas_denitrificans CBM48   NA      NA      GH3     GH57    GT2     NA      NA      GT9
Bin_99:_to_make_sure_no_columns_is_ok   NA      NA      NA      NA      NA      NA      NA      NA      NA

The same (without list of available columns at the beginning) as one liner:

F="file1"; COLS=$(cat "${F}"|sed 's/^[^\t]*//g;s/\t/\n/g'|sort|uniq|xargs); cat "${F}"|while read L; do A=$(echo "${L}"|cut -d' ' -f1); [ -n "${A}" ] && { for C in ${COLS}; do echo "${L} "|grep "\s${C}\s" >/dev/null && A="$A"$'\t'"${C}" || A="$A"$'\tNA'; done; echo "${A}"; }; done

UPDATED. Optimized more efficient version, based on suggestions in comments (/bin/bash required):

F="file1"; IFS=$'\n'; COLS=($(sed 's/^[^\t]*//g;s/\t/\n/g' "${F}"|sort -u)); while read -r L; do A="${L%%$'\t'*}"; [ -n "${A}" ] && for C in ${COLS[@]}; do [[ "${L}"$'\t' == *$'\t'"${C}"$'\t'* ]] && A="$A"$'\t'"${C}" || A="$A"$'\tNA'; done && echo "${A}"; done <${F}; IFS=' '

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice. There are several other issues but that's the main one. — Ed Morton, Feb 03 '20 at 18:36
@EdMorton yes, I started the answer from not the most efficient.. OK, now we can optimize and confuse ppl by using A="${L%%\s*}" instead of A=$(echo "${L}"|cut -d' ' -f1), also replace echo "${L} "|grep "\s${C}\s" >/dev/null with [ "${L} " == "*\s${C}\s*" ].. What else would you suggest? — Yurko, Feb 03 '20 at 19:04
Get rid of the UUOCs, add default the IFS= and -r to the read line, change sort|uniq to sort -u, probably change COLS from a scalar to an array (I'd have to think about that more than I'm willing to to be sure), but at the end of the day it's pointless since it'll still end up far more code, far slower, far less portable, far harder to maintain, and far more likely to still contain bugs than the awk equivalent. — Ed Morton, Feb 03 '20 at 19:56
Thank you! Nice script! It worked! I like it because it is listing up the available unique data. — Roland, Feb 04 '20 at 08:11

oliv · Answer 2 · 2020-02-03T18:51:26.687

0

Provided your input file is well tab separated, you can use this GNU awk script:

awk 'BEGIN{RS="[\t\n]"} !NF{$1="NA"} {printf "%s%s", $0, RT}' file

The record separator RS is set to the tab or newline in order to get the number of fields in NF.

If NF is empty, meaning there are not word between 2 tabs, the string NA is added.

The script prints the resulting record with the record terminator RT (a \t or a \n).

edited Feb 03 '20 at 18:51

answered Feb 03 '20 at 16:28

oliv

2,636

Thank you! This is an elegant one-line script, but it did not work for me. – Roland Feb 04 '20 at 08:16
... and what didn't work exactly? – oliv Feb 04 '20 at 20:32
The problem is: when I tried it, the command just filled up the missing fields with "NA" and not compare the data to each other. – Roland Feb 06 '20 at 07:30
this is likely because you're not using GNU awk – oliv Feb 06 '20 at 15:57

Ed Morton · Answer 3 · 2020-02-03T18:50:21.080

0

Like all of the other answers so far, this won't produce the expected output you provided from the input you provided but if your input actually did contain empty tab-separated fields then this would fill those with NAs:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == FNR {
    gsub(/\t+$/,"")
    maxNF = (NF>maxNF ? NF : maxNF)
    next
}
{
    for (i=1; i<=maxNF; i++) {
        printf "%s%s", ($i == "" ? "NA" : $i), (i < maxNF ? OFS : ORS)
    }
}

$ awk -f tst.awk file file
Bin_37:_Pelotomaculum_sp._DTU098        GH3     GH57    GH15    GH18    GT2     GT4     GT28
Bin_45_1:_Thiopseudomonas_denitrificans GH3     GH57    GT2     GT9     CBM48   NA      NA

edited Feb 03 '20 at 18:50

answered Feb 03 '20 at 18:44

Ed Morton

31,617

What about GT9 and CBM48 not in the first results line, we should get NA there twice instead? I like the approach, but It looks like logic here is a bit different. – Yurko Feb 03 '20 at 19:31
Right, I posted a solution for my interpretation of what the OP is trying to do and in my interpretation the values of the populated fields are irrelevant, the only thing that matters is populating empty fields with "NA". Looks like we all have different interpretations of the IOPs requirements and none of them will produce the output they want from the input they posted so idk for sure what the OP is really trying to do. – Ed Morton Feb 03 '20 at 19:58
I would like to thank this suggestion too. It is a little bit tricky for me to understand the script, but thank you for the effort to lend a hand to help me out. – Roland Feb 04 '20 at 08:14
You're welcome. To be clear - even if it produces output you're happy with, the shell script you accepted as the answer is completely the wrong approach for whatever it is you're trying to do, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the issues. – Ed Morton Feb 04 '20 at 13:39
Thank you for the advice. Yes, maybe if I have bigger files I should use your suggested way. I have to dive more deeply into the world of scripts and shells :). – Roland Feb 06 '20 at 07:26
My advice is about much more than just how much slower a shell loop will be to run. The guys who invented shell to create/destroy files/processes and sequence calls to tools also invented awk as the tool for shell to call to manipulate text. Just sayin'... – Ed Morton Feb 06 '20 at 14:58

Compare data by different column

3 Answers3