mean and standard deviation from multiple files

Question

I have 3000 files (1.out, 2.out, 3.out...., each with a single column something like this:

I need to calculate average and standard deviation for each row across 3000 files. I could calculate the average using:

awk '{a[FNR]+=$1;b[FNR]++;}END{for(i=1;i<=FNR;i++)print a[i]/b[i];}' *.out

But I am stuck with the calculation of standard deviation.

Where are you stuck? You don't know the formula for standard deviation? — muru, Aug 07 '17 at 04:42
@muru The formula is sqrt (1/n*(sum(Xi-mean)^2)). I don't know how to use this in scripting. Thanks ! — ashu, Aug 07 '17 at 04:50
You can use sqrt in awk: https://unix.stackexchange.com/a/336613/70524 — muru, Aug 07 '17 at 04:56
do you want to print average and standard deviation for each file separately? — RomanPerekhrest, Aug 07 '17 at 05:36
@RomanPerekhrest : I need a single average i.e. First row from each file to be averaged and written as single line in out file. Followed by second line from each file..like wise. And each row will have a standard deviation value for it's average. Thanks ! — ashu, Aug 07 '17 at 05:47
@ashu, still unclear, your "avarage" logic is unclear, should it join all files and take mean of all columns of each row? — RomanPerekhrest, Aug 07 '17 at 05:50
@RomanPerekhrest : For example:
Input files

1.out

2 4 5 6

2.out

1 2 3 5

3.out

4 5 6 7

Output file should have:

(2+1+4)/3 std_dev1 (4+2+5)/3 std_dev2 (5+3+6)/3 std_dev3 (6+5+7)/3 std_dev4 — ashu, Aug 07 '17 at 05:56
@RomanPerekhrest : Yes, I need mean of all columns for each row by joining all the individual files. Thanks a lot ! — ashu, Aug 07 '17 at 06:00

score 2 · Answer 1 · answered Aug 07 '17 at 07:56

Complex solution using paste command and datamash tool:

Merging all files into one file for further processing:
```
paste [0-9]*.out | datamash transpose > data
```
-- datamash transpose- will transpose rows into columns
Getting total number of fields:
```
nf=`awk '{print NF; exit}' data`
```

Calculating mean and standard deviation:

for ((i=1; i<$nf; i++)); do datamash mean $i pstdev $i < data; done

The output would look like (1st column - mean value, 2nd column - st. deviation value):

0.596477    0.11180339887499
0.589331    0.11180339887499
0.594394    0.11180339887499
0.575003    0.11180339887499
0.578981    0.11180339887499
0.569547    0.11180339887499
0.582834    0.11180339887499
.......

Thanks a lot ! Though looks like datamash is not installed on the Redhat work station I am working. I need to talk to my sys admin for this. — ashu, Aug 07 '17 at 08:41
@ashu, welcome, datamash is indeed a good tool for such cases — RomanPerekhrest, Aug 07 '17 at 08:56

Emmanuel · Answer 2 · 2017-08-08T04:06:32.633

Standard deviation can be calculated in one pass without modifying your script much.

awk '{a[FNR]+=$1; b[FNR]++; c[FNR]+=$1*$1 } 
     END{
        for(i=1;i<=FNR;i++)
          print a[i]/b[i], sqrt((c[i]-a[i]*a[i]/b[i])/(b[i]-1)) ;
     }
' *.out

This is the naive implementation, another implementation, the "online method" on the same wiki page :

awk '{
  x=$1
  n[FNR] += 1
  delta = x - mean[FNR]
  mean[FNR] += delta/n[FNR]
  delta2 = x - mean[FNR]
  M2[FNR] += delta * delta2
}
END{
for(i=1;i<=FNR;i++)
        if(n[i]<2)
                print mean[i], 0
        else
                print mean[i], sqrt(M2[i]/(n[i]-1))
}' *.out

mean and standard deviation from multiple files

2 Answers2