1

I have

Sample_A   100
Sample_A   200
Sample_B   300
Sample_B   100

And I want to print the average of the values in row 2 for each key in row 1

Sample_A   150
Sample_B   200

I can print the sums of the values in row 2 for each key in row 1 using the excellent answer to another question: Sum First Column on basis of Second Column

The command is:

awk 'NR { k = $1; cnt[k] += $2 } END { print; for (k in cnt) print k,cnt[k]}' File.txt

And this produces

Sample_A  300
Sample_B  400

But in order to calculate the average, I need a way to save the number of occurrences of the key, something like

awk 'NR { k = $1; cnt[k] += $2; count(k)=$2} END { print; for (k in cnt) print k,cnt[k]/count(k)}' File.txt

But my count(k) code is kind of a shot in the dark and doesn't work.

αғsнιη
  • 41,407
  • You're thinking along the right lines, except awk arrays are accessed using square brackets rather than parentheses and you need to increment count rather than assign $2 to it i.e. count[k]++ or count[k] += 1 – steeldriver Aug 30 '18 at 17:46

2 Answers2

4

With awk you could do:

awk '{seen[$1]+=$2; count[$1]++} END{for (x in seen)print x, seen[x]/count[x]}' infile
Sample_A 150
Sample_B 200

Or using GNU datamash:

datamash -t' ' --sort --group 1 mean 2 <infile
Sample_A 150
Sample_B 200
αғsнιη
  • 41,407
1
$ awk '{ sum[$1] += $2; count[$1] += 1 } END { for ( key in count ) { print key, sum[key] / count[key] } }' input
Sample_A 150
Sample_B 200

To derive a mean, you need two things: a count of the number of entities being averaged, and the sum of those values. We use an array, count for the former, and an array sum for the latter. The keys in each array are the assigned to the first column in your data file.

We then use an END clause once the data are being collected to look at each array to get the sums and counts, divide one by the other, and display the results.

The awk script, reformatted to not all be on one line, looks like this:

{ 
  sum[$1] += $2 
  count[$1] += 1
} 
END { 
  for (key in count) { 
    print key, sum[key] / count[key] 
  } 
}
DopeGhoti
  • 76,081