Delete columns that sum to zero

Question

I have a numeric table, that is all cells have numbers. It is a tab-delimited file, with non-numeric headers and row names. I need to delete all columns that add up zero. I would like to keep the first column (row names), and the headers of the rest of the columns that were not removed.

Input

Output

Analogous problem but with rows: delete lines that sum to zero

awk solution would be awesome; I want to avoid loading huge files in R.

give us a sample input/output example as well as what you've tried so far... — Sundeep, Aug 31 '16 at 10:27

score 2 · Answer 1 · edited Aug 31 '16 at 13:18

to remove column

c.awk :

 { for(i=1;i<=NF;i++) { line[NR][i]=$i ; col[i]+=$i ;} }
END {
 for ( l=1 ; l<=NR ; l++ )
  {
    printf line[l][1]   "\t" ;
    for (c=2;c<=NF;c++) if (col[c]) printf line[l][c]  "\t" ;
    printf "\n" ;
  }
}

where

{ for(i=1;i<=NF;i++) { line[NR][i]=$i ; col[i]+=$i ;} } store all line (including col names).
END clause print all column if count != 0 .
beware that all data are kept in memory.

test:

awk -f c.awk a
a   b       c
e   1       2
f   3       4
g   5       6

for line solution ...

try

 awk 'NR==1 {print } NR>1 { s=0 ; for(i=1;i<=NF;i++) s+=$i ; if (s) print ;}'

where

NR==1 {print } print header
NR>1 { s=0 ; for(i=1;i<=NF;i++) s+=$i ; if (s) print ;} test for 0 and print if not
you may start with i=2 if first column is row name.
beware of floating point numbers, they may not sum up to 0.

note this will output lines, not delete from original file.

Stéphane Chazelas · Answer 2 · 2016-08-31T14:35:40.620

It may be easier with perl if you want to keep the spacing:

perl -lne '
   $i = 0;
   for (/\S+\s*/g) {
      $cell[$.][$i] = $_;
      $sum[$i++] += $_
   }
   END{
     @keep=(0, grep {$sum[$_]} (1..$#sum));
     print((@{$cell[$_]})[@keep]) for (1..$.)
   }'

That loads the whole file in memory. To avoid that, you'd need two passes in the file.

That could be done with a combination of awk and sed:

awk '
  NR>1{for (i=2; i<=NF; i++) sum[i]+=$i; if (NF>n) n = NF}
  END {
    for (;n>1;n--)
      if (!sum[n])
        print "s/[^[:blank:]]\\{1,\\}[[:blank:]]*//" n
  }' < file | sed -f - file

awk generating the sed script to remove the columns whose sum is 0. The s/[^[:blank:]]\{1,\}[[:blank:]]*//3 sed commands would remove the columns while preserving the spacing of the other columns, but would be quite expensive, you may want to do that stripping in perl if performance is an issue.

For rows, it's a lot easier:

perl -MList::Util=sum -lane 'print if $. == 1 or sum @F'

score 0 · Answer 3 · answered Aug 31 '16 at 11:54

Since those values are always integers you could do something like:

cut $(awk 'NR>1{for(i=2;i<=NF;i++) s[i]+=$i}END{printf("%s", "-f 1");
for (i=2;i<=NF;i++) {if (s[i]) printf(",%s", i)}}' infile) infile

this reads the file twice: awk gets the column numbers where the sum isn't zero; these are then used with cut to print only the desired columns.

Delete columns that sum to zero

3 Answers3