I would like to extract tab-delimited columns from a text file "columns.txt" in which the header (first line) matches substrings listed in another text file "strings".
"columns.txt" looks like this:
A B C D E F rs243_A rs546_G rs987_T rs025_C ...
A B C D E F 0 0 0 1 ...
A B C D E F 1 1 2 2 ...
A B C D E F 0 1 2 0 ...
... ... ... ... ... ... ... ... ... ... ...
"strings.txt" looks like this:
rs243
rs987
...
The output text file should copy columns 1-6 from "columns.txt" and then add all extracted columns (tab-delimited) specified in "strings.txt". The output file "output.txt" should look like this:
A B C D E F rs243 rs987 ...
A B C D E F 0 0 ...
A B C D E F 1 2 ...
A B C D E F 0 2 ...
... ... ... ... ... ... ... ... ...
The code I am using prints columns 1-6 to "output.txt" as I want, but does not add the extracted columns:
awk -F '\t' -f /data/p_00614/ABCD/scripts/extract.awk /data/strings.txt /data/columns.txt > /data/output.txt
with "extract.awk":
BEGIN { OFS = FS }
FNR == NR {
sub("_.*", "", $1)
columns[$1] = 1
next
}
FNR == 1 {
for (i = 1; i <= NF; ++i)
if (i <= 6 || $i in columns)
keep[i] = 1
}
{
nf = split($0, fields, FS)
$0 = ""
j = 0
for (i = 1; i <= nf; ++i)
if (i in keep)
$(++j) = fields[i]
print
}
I think that
sub("_.*", "", $1)
does not work. "_.*"
probably does not cut every substring starting from _
but just exact matches. Any suggestions on how to fix this? Thank you!
{ for (i in keep) print keep[i] }
. Also: maybe bundle your questions - they are all on the same problem. – FelixJN May 28 '19 at 08:42_X
in them. Without these letters the script works. Also consider usingcut
as described before and addcut -f 1,6
. – FelixJN May 28 '19 at 08:59sub("_.*", "", $1)
should remove all_X
so that strings should match!? – mitch May 28 '19 at 09:09sub("_.*", "", $1)
does not remove_X
so that strings do not match and nothing is printed to the output. I will ask how to fix thesub
command. – mitch May 28 '19 at 09:24