To get average of columns for all the entries whose first column entries are duplicate

Question

I have a file which looks like this:

30901 foo1 bar1 89
30902 foo2 bar2 51
30903 foo3 bar3 101
30903 foox bary 12
30903 fooz bara 23
30903 foob barc 62
30904 fooe barf 59
...
...

I want my output to have output like this:

49.5 being the average of fourth column, for all four duplicate entries for id: 30903 in column 1.

All I could do was to get number of unique entries using
cat <logfile> | awk '{print $1}' | uniq -c
to get the count of duplicate entries based on first column (id)

Closely related: Awk to average column based on key in 2nd column — steeldriver, Jan 30 '19 at 15:25

Archemar · Accepted Answer · 2019-01-30T20:18:02.673

3

try

 awk '{c[$1]++ ; t[$1]+=$4 }  
    END { for (cc in c ) if ( c[cc]>1 ) printf "%s : %.1f\n",cc,t[cc]/c[cc];}' file.log

which give

30903 : 49.5

to get your output, remove if (c[cc]>1) part.

the script basically count c[$1]++ and add t[$1]+=$4 value.

c[x] is for count(x), that is $1 : first field
t[x] is for total(x), that is sum of total value for 4'th field

at the end, we loop over all values, computing and printing average.

edited Jan 30 '19 at 20:18

answered Jan 30 '19 at 15:40

Archemar

31,554

More descriptive variable names would be helpful. – glenn jackman Jan 30 '19 at 18:31
Thank you it works fine. This is exactly what I wanted. – Mr_S Feb 01 '19 at 07:06
@glennjackman Descriptive names in a throwaway twp-liner in Awk? What for. It's a bit ironic that the loop dummy cc has a longer name than the c and t globals, though. – Kaz Feb 02 '19 at 19:56

score 0 · Answer 2 · answered Jan 31 '19 at 09:07

0

With Miller (http://johnkerl.org/miller/doc) you can run

mlr --nidx stats1 -a mean -f 4 -g 1 inputFile

and have

30901 89.000000
30902 51.000000
30903 49.500000
30904 59.000000

-f 4 to set the Value-field names on which to compute statistics (the fourth field);
-g 1 to set group-by-field names (the first field).

answered Jan 31 '19 at 09:07

aborruso

2,855

Thanks for the alternate solution. I cannot confirm it because I am using Fedora OS and I couldn't find Miller for fedora. – Mr_S Feb 01 '19 at 07:41
@Mr_S here you have the generic linux binary (mlr.linux.x86_64 ) https://github.com/johnkerl/miller/releases/tag/5.4.0 – aborruso Feb 01 '19 at 08:21

To get average of columns for all the entries whose first column entries are duplicate

2 Answers2