2

I have data structured like this

X   43808504    G   1   ^]. <
X   43808505    C   3   .   4
X   43808506    T   8   .   ?
X   43808507    G   5   .   C

I want to get the max (8), min (1), and mean (4.25) from column 4 and write that to a file.

I've been wrestling with sorting and then cutting data away but that seems really inefficient.

Thanks for any help

jesse_b
  • 37,005

2 Answers2

7

Using awk:

awk 'NR == 1 { min = $4; max = $4 }
{
    sum += $4
    if ($4 > max) {
        max = $4
    }
    if ($4 < min) {
        min = $4
    }
} END {
    print max
    print min
    print sum / NR
}' input

First we set the min and max variable as the value of the 4th column in line 1, later we will check each value in column 4 to see if it is less than the current value of min or more than the current value of max, if so set min to that value.

Then we create a sum variable with the sum of all values of column 4. This will later be used to calculate the mean by dividing the sum by the total number of rows.

At the end we print the max, min, and mean.

jesse_b
  • 37,005
6

With Miller

$ mlr --nidx --repifs stats1 -a 'min,max,mean' -f 4 data
1 8 4.250000

You can redirect the output to a file in the usual way, by adding > file

With GNU datamash

$ datamash -W min 4 max 4 mean 4 < data
1   8   4.25
steeldriver
  • 81,074