4

I want to get the exact number when I try to find the average of a column of values.

For example, this is the column of input values:

1426044
1425486
1439480
1423677
1383676
1360088
1390745
1435123
1422970
1394461
1325896
1251248
1206005
1217057
1168298
1153022
1199310
1250162
1247917
1206836

When I use the following command:

... | awk '{ sum+=$1} END { print sum/NR}'

I get the following output: 1.31638e+06. However, I want the exact number, which is 1316375.05 or even better, in this format 1,316,375.05

How can I do this with command line tools only?

EDIT 1

I found the following one-liner awk command which will get me the max, min and mean:

awk 'NR == 1 { max=$1; min=$1; sum=0 } { if ($1>max) max=$1; if ($1<min) min=$1; sum+=$1;} END {printf "Min: %d\tMax: %d\tAverage: %.2f\n", min, max, sum/NR}'

Why is it that NR must be initialized as 1? When I delete NR == 1, I get the wrong result.

EDIT 2

I found the following awk script from Is there a way to get the min, max, median, and average of a list of numbers in a single command?. It will get the sum, count, mean, median, max, and min values of a single column of numeric data, all in one go. It reads from stdin, and prints tab-separated columns of the output on a single line. I tweaked it a bit. I noticed that it does not need NR == 1 unlike the awk command above (in my first edit). Can someone please explain why? I think it has to do with the fact that the numeric data has been sorted and placed into an array.

#!/bin/sh

sort -n | awk '

  $1 ~ /^(\-)?[0-9]*(\.[0-9]*)?$/ {
    a[c++] = $1;
    sum += $1;
  }
  END {
    ave = sum / c;
    if( (c % 2) == 1 ) {
      median = a[ int(c/2) ];
    } else {
      median = ( a[c/2] + a[c/2-1] ) / 2;
    }

    {printf "Sum: %d\tCount: %d\tAverage: %.2f\tMedian: %d\tMin: %d\tMax: %d\n", sum, c, ave, median, a[0], a[c-1]}
  }
'
rplee
  • 377

2 Answers2

8
... | awk '{ sum+=$1} END { print sum/NR}'

By default, (GNU) awk prints numbers with up to 6 significant digits (plus the exponent part). This comes from the default value of the OFMT variable. It doesn't say that in the docs, but this only applies to non-integer valued numbers.

You could change OFMT to affect all print statements, or rather, just use printf here, so it also works if the average happens to be an integer. Something like %.3f would print the numbers with three digits after the decimal point.

...| awk '{ sum+=$1} END { printf "%.3f\n", sum/NR }'

See the docs for the meaning of the f and g, and the precision modifier (.prec in the second link):

awk 'NR == 1 { max=$1; min=$1; sum=0 } ...'

This doesn't initialize NR. Instead, it checks if NR is equal to one, i.e. we're on the first line. (== is comparison, = is assignment.) If so, initializes max, min and sum. Without that, max and min would start as zeroes. You could never have a negative maximum value, or a positive minimum value.

ilkkachu
  • 138,973
  • Note that it (or any POSIX awk implementation) only does that for non-integer numbers. – Stéphane Chazelas May 31 '20 at 18:57
  • @StéphaneChazelas, figures, since otherwise even 1234567 wouldn't print accurately. Too bad the GNU docs don't seem to mention that. Thanks. – ilkkachu May 31 '20 at 19:05
  • 1
    It's covered for CONVFMT though. See info gawk 'Strings And Numbers'. As a special case, if a number is an integer, then the result of converting it to a string is always an integer, no matter what the value of 'CONVFMT' may be. It also says Pre-POSIX 'awk' Used 'OFMT' for String Conversion – Stéphane Chazelas May 31 '20 at 19:09
  • 1
    Also note that there's some variation between awk implementations as to what is considered integer. See echo 1e10 1e19 1e30 | awk '{print $1+0, $2+0, $3+0}' in different awk implementations for instance. – Stéphane Chazelas May 31 '20 at 19:15
  • @ikkachu Thanks for this explanation :) I still don't understand why when I use the awk command in Edit 1 without NR == 1, my output is : 'Min: 16214 Max: 16214 Average: 810.70' Do you mind walking me through what happens when NR == 1 is omitted? And also why NR == 1 is unnecessary in the script in Edit 2? – rplee Jun 01 '20 at 17:36
  • @rplee, if you remove the whole NR == 1 { min=$1 ... } block, then the min and max are implicitly initialized to zero, e.g. this shows Min: 0, even though the smallest number going in is 3: seq 3 5 | awk ' { if ($1>max) max=$1; if ($1<min) min=$1; sum+=$1;} END {printf "Min: %d\tMax: %d\tAverage: %.2f\n", min, max, sum/NR}'. If you just remove the condition NR == 1, but keep the following block { min=$1...} , then min, max and sum are initialized again and again on every line, and the output will show the last number as both the min and the max. – ilkkachu Jun 01 '20 at 17:56
  • @rplee, that one-liner is short enough that you can walk it through line-by-line for some small input. The longer script use sort to to get the input numbers in order, so it doesn't need to keep track of the minimum and maximum values, it just looks at the first and last ones. – ilkkachu Jun 01 '20 at 17:59
  • @iikachu Ah ok, I understand now. Thanks again. – rplee Jun 02 '20 at 20:57
  • @iikachu Is the sum=0 even necessary? I ran the data through the awk command without the sum=0 and it still gave me the same result. – rplee Jun 04 '20 at 17:39
  • @rplee, not strictly, but it's good style. The variables are autoinitialized as empty, and that acts like zero if you do arithmetic on it. But e.g. print sum would then just print an empty field, and not a zero. (printf %f forces it to be interpreted as a number) – ilkkachu Jun 04 '20 at 17:56
  • @ilkkachu I see, that makes sense – rplee Jun 04 '20 at 20:04
6

If using GNU awk, try this. Adds the commas by making use of the ' modifier.

$ awk '{sum+=$1}END{printf "%'\''.2f\n",sum/NR}' filename
1,316,375.05
$

If you've got jq, try this.

$ jq -s min,max,add/length filename
1153022
1439480
1316375.05
$

From gnu.org : gawk Format Modifiers

A single quote or apostrophe character is a POSIX extension to ISO C. It indicates that the integer part of a floating-point value, or the entire part of an integer decimal value, should have a thousands-separator character in it. This only works in locales that support such characters. For example:

steve
  • 21,892
  • Hmm... when I try the awk command I get the following : awk: run time error: improper conversion(number 1) in printf("%'.2f ") FILENAME="-" FNR=20 NR=20 – rplee May 31 '20 at 20:54
  • I am using mawk 1.3.3 – rplee May 31 '20 at 21:14
  • @rplee - mawk 1.3.3 is well over 10 years old. The ' format modifier got introduced to mawk later. See https://invisible-island.net/mawk/CHANGES.html#index-t20160927 – steve May 31 '20 at 22:21
  • Ah ok I installed the latest version of mawk (mawk 1.3.4 from 1/20/2020) and it worked – rplee Jun 05 '20 at 06:08