0

I have a file which looks like this:

30901 foo1 bar1 89
30902 foo2 bar2 51
30903 foo3 bar3 101
30903 foox bary 12
30903 fooz bara 23
30903 foob barc 62
30904 fooe barf 59
...
...

I want my output to have output like this:

30901 89
30902 51
30903 49.5
... 

49.5 being the average of fourth column, for all four duplicate entries for id: 30903 in column 1.

All I could do was to get number of unique entries using
cat <logfile> | awk '{print $1}' | uniq -c
to get the count of duplicate entries based on first column (id)

Archemar
  • 31,554
Mr_S
  • 3

2 Answers2

3

try

 awk '{c[$1]++ ; t[$1]+=$4 }  
    END { for (cc in c ) if ( c[cc]>1 ) printf "%s : %.1f\n",cc,t[cc]/c[cc];}' file.log

which give

30903 : 49.5

to get your output, remove if (c[cc]>1) part.

the script basically count c[$1]++ and add t[$1]+=$4 value.

  • c[x] is for count(x), that is $1 : first field
  • t[x] is for total(x), that is sum of total value for 4'th field

at the end, we loop over all values, computing and printing average.

Archemar
  • 31,554
0

With Miller (http://johnkerl.org/miller/doc) you can run

mlr --nidx stats1 -a mean -f 4 -g 1 inputFile

and have

30901 89.000000
30902 51.000000
30903 49.500000
30904 59.000000
  • -f 4 to set the Value-field names on which to compute statistics (the fourth field);
  • -g 1 to set group-by-field names (the first field).
aborruso
  • 2,855
  • Thanks for the alternate solution. I cannot confirm it because I am using Fedora OS and I couldn't find Miller for fedora. – Mr_S Feb 01 '19 at 07:41
  • @Mr_S here you have the generic linux binary (mlr.linux.x86_64 ) https://github.com/johnkerl/miller/releases/tag/5.4.0 – aborruso Feb 01 '19 at 08:21