1

I would like to extract tab-delimited columns from a text file ("columns.txt") in which the header (first line) matches certain strings listed in another text file ("strings.txt").

"columns.txt" looks like this:

rs2438689   rs54666437   rs9877702046   rs025436779...
0           0            0              1
1           1            2              2 
0           1            2              0 
...         ...          ...            ...

"strings.txt" looks like this:

rs2438689
rs9877702046   
...

The output text file "output.txt" should look like this (tab-delimited):

rs2438689   rs9877702046...
0           0              
1           2               
0           2               
...         ...    

Any suggestions on how to do this with awk? Thank you!

mitch
  • 35
  • 1
    What have you done so far, where are you stuck? – Panki May 27 '19 at 14:40
  • I have a script that matches a single string (e.g. "rs2438689") and removes all columns containing it: awk -F '\t' -v OFS='\t' 'FNR==1{for(i=1;i<=NF;++i)if($i!~/rs2438689/)k[i]=1}{n=split($0,f,FS);$0=j="";for(i=1;i<=n;++i)if(i in k)$(++j)=f[i]}1' Problem 1: How to read in multiple strings from another text file? How to keep columns matching the string and delete the rest? Thank you! – mitch May 27 '19 at 15:01

5 Answers5

3

Instead of Awk, how about making a comma-separated list of column names from strings.txt, and using that as a list of namedcols for csvtool:

$ csvtool -t TAB -u TAB namedcol "$(paste -sd, < strings.txt)" columns.txt
rs2438689   rs9877702046
0   0
1   2
0   2
... ...

or similarly with csvcut/csvformat from the Python-based csvkit:

$ csvcut -tc "$(paste -sd, < strings.txt)" columns.txt | csvformat -T
rs2438689   rs9877702046
0   0
1   2
0   2
... ...
steeldriver
  • 81,074
2

With perl

$ perl -F'\t' -lane 'if(!$#ARGV){ $h{$_}=1 }
                     else{ @i = grep { $h{$F[$_]} == 1 } 0..$#F if !$c++;
                           print join "\t", @F[@i]}' strings.txt columns.txt
rs2438689   rs9877702046
0   0
1   2
0   2
  • if(!$#ARGV){ $h{$_}=1 } for first input file, create a hash with line content as key
  • @i = grep { $h{$F[$_]} == 1 } 0..$#F if !$c++ for first line of second file, create an index list of all matching column names from the hash
  • print join "\t", @F[@i] print the matching columns
Sundeep
  • 12,008
  • What would be the solution if the headers in "columns.txt" would look like this rs2438689_G rs54666437_A rs9877702046_C rs025436779_A ... so that "strings.txt" only matches substrings rs2438689 rs9877702046 ...? 2. How can I tell perl to just copy the first 6 columns and do the extraction from column 7 to the final column? 3. How can I tell perl to print the columns to a new text file "columns_new.txt"? Thank you!
  • – mitch May 28 '19 at 06:57
  • for the 3rd question, just add > columns_new.txt to the command.. that is a shell feature, will work with other commands too.. for the first two, I get the feeling that your nature of work might require more such adjustments.. I'd advice to learn awk/perl from basics, I've a repo https://github.com/learnbyexample/Command-line-text-processing dedicated for such tools which might help you – Sundeep May 28 '19 at 08:11