awk: extract tab-delimited columns matching sub-strings in first line

Question

I would like to extract tab-delimited columns from a text file ("columns.txt") in which the header (first line) matches certain strings listed in another text file ("strings.txt").

"columns.txt" looks like this:

rs2438689   rs54666437   rs9877702046   rs025436779...
0           0            0              1
1           1            2              2 
0           1            2              0 
...         ...          ...            ...

"strings.txt" looks like this:

rs2438689
rs9877702046   
...

The output text file "output.txt" should look like this (tab-delimited):

rs2438689   rs9877702046...
0           0              
1           2               
0           2               
...         ...

Any suggestions on how to do this with awk? Thank you!

I have a script that matches a single string (e.g. "rs2438689") and removes all columns containing it: awk -F '\t' -v OFS='\t' 'FNR==1{for(i=1;i<=NF;++i)if($i!~/rs2438689/)k[i]=1}{n=split($0,f,FS);$0=j="";for(i=1;i<=n;++i)if(i in k)$(++j)=f[i]}1' Problem 1: How to read in multiple strings from another text file? How to keep columns matching the string and delete the rest? Thank you! — mitch, May 27 '19 at 15:01

score 3 · Answer 1 · answered May 27 '19 at 19:47

Instead of Awk, how about making a comma-separated list of column names from strings.txt, and using that as a list of namedcols for csvtool:

$ csvtool -t TAB -u TAB namedcol "$(paste -sd, < strings.txt)" columns.txt
rs2438689   rs9877702046
0   0
1   2
0   2
... ...

or similarly with csvcut/csvformat from the Python-based csvkit:

$ csvcut -tc "$(paste -sd, < strings.txt)" columns.txt | csvformat -T
rs2438689   rs9877702046
0   0
1   2
0   2
... ...

score 2 · Answer 2 · answered May 27 '19 at 15:52

2

With perl

$ perl -F'\t' -lane 'if(!$#ARGV){ $h{$_}=1 }
                     else{ @i = grep { $h{$F[$_]} == 1 } 0..$#F if !$c++;
                           print join "\t", @F[@i]}' strings.txt columns.txt
rs2438689   rs9877702046
0   0
1   2
0   2

if(!$#ARGV){ $h{$_}=1 } for first input file, create a hash with line content as key
@i = grep { $h{$F[$_]} == 1 } 0..$#F if !$c++ for first line of second file, create an index list of all matching column names from the hash
print join "\t", @F[@i] print the matching columns

answered May 27 '19 at 15:52

Sundeep

12,008

What would be the solution if the headers in "columns.txt" would look like this rs2438689_G rs54666437_A rs9877702046_C rs025436779_A ... so that "strings.txt" only matches substrings rs2438689 rs9877702046 ...? 2. How can I tell perl to just copy the first 6 columns and do the extraction from column 7 to the final column? 3. How can I tell perl to print the columns to a new text file "columns_new.txt"? Thank you!

mitch

May 28 '19 at 06:57

for the 3rd question, just add > columns_new.txt to the command.. that is a shell feature, will work with other commands too.. for the first two, I get the feeling that your nature of work might require more such adjustments.. I'd advice to learn awk/perl from basics, I've a repo https://github.com/learnbyexample/Command-line-text-processing dedicated for such tools which might help you – Sundeep May 28 '19 at 08:11

Kusalananda · Accepted Answer · 2019-05-28T11:39:24.387

Modifying my solution for your previous question:

awk -F '\t' -f script.awk strings.txt columns.txt

where script.awk is

BEGIN { OFS = FS }

FNR == NR {
    columns[$1] = 1
    next
}

FNR == 1 {
    for (i = 1; i <= NF; ++i)
        if ($i in columns)
            keep[i] = 1
}

{
    nf = split($0, fields, FS)
    $0 = ""
    j = 0

    for (i = 1; i <= nf; ++i)
        if (i in keep)
            $(++j) = fields[i]

    print
}

Here, the FNR == NR block would only execute while reading from the first file listed on the command line (strings.txt). It would populate the columns array with keys that are the names of the columns. The rest of the code is more or less unchanged from the old solution, apart form where we check whether the current column is one that we'd like to keep (in the FNR == 1 block).

Addressing the questions in comments:

To always copy the first six columns and to cut the column headers at _, change

FNR == 1 {
    for (i = 1; i <= NF; ++i)
        if ($i in columns)
            keep[i] = 1
}

into

FNR == 1 {
    for (i = 1; i <= NF; ++i) {
        sub("_.*", "", $i)
        if (i <= 6 || $i in columns)
            keep[i] = 1
    }
}

@mitch We usually discourage followup questions in comments. If you have further questions, then either modify the actual question (which may invalidate given answer, so it's not a good idea either) or ask a new question. I have addressed your followup questions in the answer. — Kusalananda, May 28 '19 at 07:33

FelixJN · Answer 4 · 2019-05-28T08:22:33.803

Is the script very complex (i.e. awk a necessity) or your data very big? You could use datamash to transpose your datafile, grep columns (now lines) with the strings file and retranspose:

datamash transpose < in.txt  | grep -f strings.txt  | datamash transpose > out.txt

Like that you can also get the non-matching columns:

datamash transpose < in.txt  | grep -f strings.txt -v  | datamash transpose > out.txt

EDIT:

Since you are working on a very large file, cut might do the deal: the first (headers) line of your data is transformed to one line per header, grep selects the line numbers matching strings.txt entries and thus we have the field (column) numbers we need for cut which are then retransformed to comma-separated values for cut's field selection.

cut -f $( grep -n -f strings.txt <( head -1 data.txt | tr '\t' '\n' ) \|
   sed 's/:.*//' | tr '\n' ',' | sed 's/,$//' ) data.txt

Again non-matching columns arise from using grep -v.

awk is needed since I am working on a 22GB file. – mitch May 28 '19 at 06:51 — mitch, May 28 '19 at 06:51
Maybe cut can do the deal. – FelixJN May 28 '19 at 08:22 — FelixJN, May 28 '19 at 08:22

score 0 · Answer 5 · answered May 28 '19 at 14:43

Done by using below script it may be long it worked fine

k=wc -l file1| awk '{print $1}'

for ((i=1;i<=$k;i++));  do for j in `cat file2`; do awk -v i="$i" -v j="$j" '$i == j {x=NR+k}(NR<=x){print $i}' file1; done ; done>final.txt

z=`wc -l final.txt| awk '{print $1}'`

for ((i=1;i<=$z;i++)); do j=$(($i+3)); sed -n ''$i','$j'p' final.txt >file_starting_with_$i.txt; i=$j; done

paste file_starting_with*

output

rs2438689   rs9877702046
0       0
1       2
0       2

awk: extract tab-delimited columns matching sub-strings in first line

5 Answers5

Linked