3

I have a numeric table, that is all cells have numbers. It is a tab-delimited file, with non-numeric headers and row names. I need to delete all columns that add up zero. I would like to keep the first column (row names), and the headers of the rest of the columns that were not removed.

Input

a  b  c  d
e  1  2  0
f  3  4  0
g  5  6  0

Output

a  b  c
e  1  2
f  3  4
g  5  6

Analogous problem but with rows: delete lines that sum to zero

awk solution would be awesome; I want to avoid loading huge files in R.

fibar
  • 41

3 Answers3

2

to remove column

c.awk :

 { for(i=1;i<=NF;i++) { line[NR][i]=$i ; col[i]+=$i ;} }
END {
 for ( l=1 ; l<=NR ; l++ )
  {
    printf line[l][1]   "\t" ;
    for (c=2;c<=NF;c++) if (col[c]) printf line[l][c]  "\t" ;
    printf "\n" ;
  }
}

where

  • { for(i=1;i<=NF;i++) { line[NR][i]=$i ; col[i]+=$i ;} } store all line (including col names).
  • END clause print all column if count != 0 .
  • beware that all data are kept in memory.

test:

awk -f c.awk a
a   b       c
e   1       2
f   3       4
g   5       6

for line solution ...

try

 awk 'NR==1 {print } NR>1 { s=0 ; for(i=1;i<=NF;i++) s+=$i ; if (s) print ;}'

where

  • NR==1 {print } print header
  • NR>1 { s=0 ; for(i=1;i<=NF;i++) s+=$i ; if (s) print ;} test for 0 and print if not
  • you may start with i=2 if first column is row name.
  • beware of floating point numbers, they may not sum up to 0.

note this will output lines, not delete from original file.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Archemar
  • 31,554
1

It may be easier with perl if you want to keep the spacing:

perl -lne '
   $i = 0;
   for (/\S+\s*/g) {
      $cell[$.][$i] = $_;
      $sum[$i++] += $_
   }
   END{
     @keep=(0, grep {$sum[$_]} (1..$#sum));
     print((@{$cell[$_]})[@keep]) for (1..$.)
   }'

That loads the whole file in memory. To avoid that, you'd need two passes in the file.

That could be done with a combination of awk and sed:

awk '
  NR>1{for (i=2; i<=NF; i++) sum[i]+=$i; if (NF>n) n = NF}
  END {
    for (;n>1;n--)
      if (!sum[n])
        print "s/[^[:blank:]]\\{1,\\}[[:blank:]]*//" n
  }' < file | sed -f - file

awk generating the sed script to remove the columns whose sum is 0. The s/[^[:blank:]]\{1,\}[[:blank:]]*//3 sed commands would remove the columns while preserving the spacing of the other columns, but would be quite expensive, you may want to do that stripping in perl if performance is an issue.

For rows, it's a lot easier:

perl -MList::Util=sum -lane 'print if $. == 1 or sum @F'
0

Since those values are always integers you could do something like:

cut $(awk 'NR>1{for(i=2;i<=NF;i++) s[i]+=$i}END{printf("%s", "-f 1");
for (i=2;i<=NF;i++) {if (s[i]) printf(",%s", i)}}' infile) infile

this reads the file twice: awk gets the column numbers where the sum isn't zero; these are then used with cut to print only the desired columns.

don_crissti
  • 82,805