1

I have 3000 files (1.out, 2.out, 3.out...., each with a single column something like this:

0.446477
0.439331
0.444394
0.425003
0.428981
0.419547
0.432834
0.417874
........

I need to calculate average and standard deviation for each row across 3000 files. I could calculate the average using:

awk '{a[FNR]+=$1;b[FNR]++;}END{for(i=1;i<=FNR;i++)print a[i]/b[i];}' *.out

But I am stuck with the calculation of standard deviation.

Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232
ashu
  • 23
  • Where are you stuck? You don't know the formula for standard deviation? – muru Aug 07 '17 at 04:42
  • @muru The formula is sqrt (1/n*(sum(Xi-mean)^2)). I don't know how to use this in scripting. Thanks ! – ashu Aug 07 '17 at 04:50
  • You can use sqrt in awk: https://unix.stackexchange.com/a/336613/70524 – muru Aug 07 '17 at 04:56
  • do you want to print average and standard deviation for each file separately? – RomanPerekhrest Aug 07 '17 at 05:36
  • @RomanPerekhrest : I need a single average i.e. First row from each file to be averaged and written as single line in out file. Followed by second line from each file..like wise. And each row will have a standard deviation value for it's average. Thanks ! – ashu Aug 07 '17 at 05:47
  • @ashu, still unclear, your "avarage" logic is unclear, should it join all files and take mean of all columns of each row? – RomanPerekhrest Aug 07 '17 at 05:50
  • @RomanPerekhrest : For example:

    Input files

    1.out

    2 4 5 6

    2.out

    1 2 3 5

    3.out

    4 5 6 7

    Output file should have:

    (2+1+4)/3 std_dev1 (4+2+5)/3 std_dev2 (5+3+6)/3 std_dev3 (6+5+7)/3 std_dev4

    – ashu Aug 07 '17 at 05:56
  • @RomanPerekhrest : Yes, I need mean of all columns for each row by joining all the individual files. Thanks a lot ! – ashu Aug 07 '17 at 06:00

2 Answers2

2

Complex solution using paste command and datamash tool:

  1. Merging all files into one file for further processing:

    paste [0-9]*.out | datamash transpose > data
    

    -- datamash transpose- will transpose rows into columns

  2. Getting total number of fields:

    nf=`awk '{print NF; exit}' data`
    
  3. Calculating mean and standard deviation:

    for ((i=1; i<$nf; i++)); do datamash mean $i pstdev $i < data; done 
    

The output would look like (1st column - mean value, 2nd column - st. deviation value):

0.596477    0.11180339887499
0.589331    0.11180339887499
0.594394    0.11180339887499
0.575003    0.11180339887499
0.578981    0.11180339887499
0.569547    0.11180339887499
0.582834    0.11180339887499
.......
1

Standard deviation can be calculated in one pass without modifying your script much.

awk '{a[FNR]+=$1; b[FNR]++; c[FNR]+=$1*$1 } 
     END{
        for(i=1;i<=FNR;i++)
          print a[i]/b[i], sqrt((c[i]-a[i]*a[i]/b[i])/(b[i]-1)) ;
     }
' *.out

This is the naive implementation, another implementation, the "online method" on the same wiki page :

awk '{
  x=$1
  n[FNR] += 1
  delta = x - mean[FNR]
  mean[FNR] += delta/n[FNR]
  delta2 = x - mean[FNR]
  M2[FNR] += delta * delta2
}
END{
for(i=1;i<=FNR;i++)
        if(n[i]<2)
                print mean[i], 0
        else
                print mean[i], sqrt(M2[i]/(n[i]-1))
}' *.out
Emmanuel
  • 4,187