4

From time to time I find myself writing awk scripts that compute some simple statistics. For example computing a histogram, the average of a value, the standard deviation or even the variance ...

Doing that again and again with helper arrays/variables and for-loops in the END clause etc. feels a little bit tedious and error-prone.

In Dtrace there is a quite awesome syntax for such tasks which they call aggregations. It is similar to the concept/API of Accumulators in the Boost C++ library.

Thus my question: are there awk variants which provide similar concepts/syntax that allow for convenient and iterative computation of such statistics?

An imaginative example of such syntax:

$ someawk '{ @time[$1] = avg($2) }' measurements.log
prog1    150
prog2    200
....

(where the 1st column contains the program name, the 2nd the runtime of one measurement, measurements.log contains multiple measurements for each program and the aggregate function avg computes the average)

maxschlepzig
  • 57,532
  • Maybe Perl might suit you better - it might be even worth the bigger resources it needs. – peterph Dec 25 '12 at 21:31
  • 1
    You may be interested in Num, which is uses AWK to calculate descriptive statisitcs such as count, mean, variance, etc. See http://www.numcommand.com – joelparkerhenderson Nov 09 '15 at 20:19

1 Answers1

4

Awk is designed for simple text processing. If you want more than that, there's a point where you need to ditch awk and use a more capable language.

Perl is the natural progression. It has most of the features of awk with a similar syntax, and is installed by default on most non-embedded unix systems. I'm not aware of any library for the kind of statistical analysis you describe, but there are many libraries out there.

For statistical analysis, the language of choice is R. It's weaker than awk on text processing, so unless your data is already in a format that R understands, you'll need to massage it first, possibly by piping awk into R. See Is there a way to get the min, max, median, and average of a list of numbers in a single command? for an example of using R that's similar to your example.