Delete tab-delimited columns matching sub-string in first line

Question

I would like to delete all tab-delimited columns from a text file in which the header (first line) contains the string "_HET". The input text file looks like this:

rs36810213_HET   rs2438689   rs70927523570_HET   rs54666437   ...
1                0           2                   0
0                1           0                   1
2                0           1                   1
...              ...         ...                 ...

The output text file should look like this:

rs2438689   rs54666437   ...
0           0
1           1
0           1
...         ...

The code I am using does not remove anything:

#!/bin/bash

path="/data/folder"

awk -v OFS='\t' '

NR==1{
    for (i=1;i<=NF;i++)
        if ($i=="_HET") {
            n=i-1
            m=NF-(i==NF)
        }
    }

{
    for(i=1;i<=NF;i+=1+(i==n))
        printf "%s%s",$i,i==m?ORS:OFS
}

' $path/input.txt >> $path/output.txt

Any suggestions on how to fix this code? Thank you!

Kusalananda · Accepted Answer · 2019-05-27T13:55:52.467

awk -F '\t' -f script.awk file

where script.awk is

BEGIN { OFS = FS }

FNR == 1 {
    for (i = 1; i <= NF; ++i)
        if ($i !~ /_HET/)
            keep[i] = 1
}

{
    nf = split($0, fields, FS)
    $0 = ""
    j = 0

    for (i = 1; i <= nf; ++i)
        if (i in keep)
            $(++j) = fields[i]

    print
}

This first parses the headers on the first line and remembers which ones we're interested in keeping in the keep associative array.

Then, for each line, it re-creates the current record (the line) from only the fields that we're wanting to keep, and prints it.

It does this by (re-)splitting the line on the current field separator into the array fields, then empty all fields (with $0 = ""; this resets NF), and finally assigning only the fields from fields that are keys in the keep array.

Some people like one-liners:

awk -F '\t' -v OFS='\t' 'FNR==1{for(i=1;i<=NF;++i)if($i!~/_HET/)k[i]=1}{n=split($0,f,FS);$0=j="";for(i=1;i<=n;++i)if(i in k)$(++j)=f[i]}1' file

I didn't quite follow your code completely, but $i=="_HET" will compare the i:th field to the string _HET. This test will fail unless the value of the field is exactly _HET (which none of your header field are).

A totally different approach:

cut -f "$( awk -F '\t' -v OFS="," '{for(i=1;i<=NF;++i)if($i!~/_HET/)k[i]=1;$0="";for(i in k)$(++j)=i;print;exit}' file )" file

This uses the awk program

BEGIN { OFS = "," }

{
    for (i = 1; i <= NF; ++i)
        if ($i !~ /_HET/)
            keep[i] = 1

    $0 = ""

    for (i in keep)
        $(++j) = i

    print
    exit
}

not to output the contents of the wanted columns, but to output their column numbers as a comma-delimited string. This string is then used by cut to cut out the columns from the data.

Thank you. The output is space-delimited. How can I have it tab-delimited? — mitch, May 27 '19 at 13:47
@mitch I fixed this in a recent revision to the question already. OFS must be set to the same thing as FS, and you set FS to a tab using -F '\t' as shown. — Kusalananda, May 27 '19 at 13:50

score 0 · Answer 2 · answered May 27 '19 at 16:01

0

You can do this with Perl as shown:

$ perl -F'/\t/' -pale '$"="\t";
    $. == 1 and @A = grep { $F[$_] !~ /_HET/ } 0 .. $#F;
    $_ = "@F[@A]";
' input.tsv

answered May 27 '19 at 16:01

Rakesh Sharma

839

Delete tab-delimited columns matching sub-string in first line

2 Answers2

Linked