Extract columns from text file matching header substrings

Question

I would like to extract tab-delimited columns from a text file "columns.txt" in which the header (first line) matches substrings listed in another text file "strings".
"columns.txt" looks like this:

A   B   C   D   E   F   rs243_A   rs546_G   rs987_T   rs025_C   ...
A   B   C   D   E   F   0         0         0         1         ...
A   B   C   D   E   F   1         1         2         2         ...
A   B   C   D   E   F   0         1         2         0         ...
... ... ... ... ... ... ...       ...       ...       ...       ...

"strings.txt" looks like this:

rs243
rs987  
...

The output text file should copy columns 1-6 from "columns.txt" and then add all extracted columns (tab-delimited) specified in "strings.txt". The output file "output.txt" should look like this:

A   B   C   D   E   F   rs243   rs987   ...
A   B   C   D   E   F   0       0       ...
A   B   C   D   E   F   1       2       ...
A   B   C   D   E   F   0       2       ...
... ... ... ... ... ... ...     ...     ...

The code I am using prints columns 1-6 to "output.txt" as I want, but does not add the extracted columns:

awk -F '\t' -f /data/p_00614/ABCD/scripts/extract.awk /data/strings.txt /data/columns.txt > /data/output.txt

with "extract.awk":

BEGIN { OFS = FS }

FNR == NR {
    sub("_.*", "", $1)
    columns[$1] = 1
    next
}

FNR == 1 {
    for (i = 1; i <= NF; ++i)
        if (i <= 6 || $i in columns)
            keep[i] = 1
}

{
    nf = split($0, fields, FS)
    $0 = ""
    j = 0

    for (i = 1; i <= nf; ++i)
        if (i in keep)
            $(++j) = fields[i]

    print 
}

I think that

sub("_.*", "", $1)

does not work. "_.*" probably does not cut every substring starting from _ but just exact matches. Any suggestions on how to fix this? Thank you!

Does the array contain what you expect? Try printing it { for (i in keep) print keep[i] }. Also: maybe bundle your questions - they are all on the same problem. — FelixJN, May 28 '19 at 08:42
No it doesn't. It contains columns 1-6, but not the additional columns. — mitch, May 28 '19 at 08:50
Your titles/headers don't seem to match strings.txt as they have _X in them. Without these letters the script works. Also consider using cut as described before and add cut -f 1,6. — FelixJN, May 28 '19 at 08:59
But sub("_.*", "", $1) should remove all _X so that strings should match!? — mitch, May 28 '19 at 09:09
OK. sub("_.*", "", $1) does not remove _X so that strings do not match and nothing is printed to the output. I will ask how to fix the sub command. — mitch, May 28 '19 at 09:24
Please [edit] your question and show us actual data we can use to test our answers. You talk about columns 1-6 but only show us 4 columns. We need to see an accurate and full example of a few lines of your input data and the output you would expect from that input data. By the way, based on the files you're showing, I think you might be interested in our sister site, [bioinformatics.se]. — terdon, May 28 '19 at 09:55

score 0 · Accepted Answer · answered May 28 '19 at 11:37

This is a bug in the code that I provided in an earlier answer to one of your questions (now corrected). The _.* bit should not be removed from the strings being read from strings.txt, but from the data being read from columns.txt.

Corrected script:

BEGIN { OFS = FS }

FNR == NR {
    columns[$1] = 1
    next
}

FNR == 1 {
    for (i = 1; i <= NF; ++i) {
        sub("_.*", "", $i)
        if (i <= 6 || $i in columns)
            keep[i] = 1
    }
}

{
    nf = split($0, fields, FS)
    $0 = ""
    j = 0

    for (i = 1; i <= nf; ++i)
        if (i in keep)
            $(++j) = fields[i]

    print 
}

Note the very slight changes to the FNR == NR and FNR == 1 blocks from what you have in the question.

Extract columns from text file matching header substrings

1 Answers1