2

I have a data file that comprises a site, a type and a count by site/type. In the full file there are several sites and at least four different types per site. What I am trying to do is summarise the top two types for each site and just print those out. The example data file I have looks like this;-

Site|Type|Count
site1|A|1 
site1|B|25 
site1|B|66 
site1|D|22 
site1|B|22 
site1|A|45 
site1|A|55 
site1|C|50 
site1|C|55 
site1|A|6 
site1|A|30 
site2|B|21 
site2|D|362 
site2|A|36 
site2|C|2 
site2|A|4 
site2|A|69 
site3|B|36 
site3|C|62 
site3|D|541 
site3|C|55 
site3|A|52 
site3|A|63 
site3|A|52
site4|B|52 
site4|B|55 
site4|D|52 
site4|C|25 
site4|B|55 
site4|A|55 

The output should look like this:

site1|A|137 
site1|B|113 
site2|A|109 
site2|D|362 
site3|A|167 
site3|D|541 
site4|A|55 
site4|B|162

Currently my AWK program is like this:

BEGIN {
  FS="|"
}
{
  site=$1
    type=$2
    nums=$3+0
key=site","type

++recs[key]

tot[key]+=$3 

} END { for (i in recs) {print i "," tot[i]} }

Is there somebody who can help me to create an array by site/type for the totals in order to find and print the top two values by site and type please?

JackH
  • 21
  • You have several answers now, please see https://unix.stackexchange.com/help/someone-answers for what to do next. – Ed Morton Nov 24 '21 at 13:07

1 Answers1

4

With GNU awk for arrays of arrays and sorted_in:

$ cat tst.awk
BEGIN { FS=OFS="|" }
NR > 1 {
    tot[$1][$2] += $3
}
END {
    PROCINFO["sorted_in"] = "@val_str_asc"
    for ( site in tot ) {
        cnt = 0
        PROCINFO["sorted_in"] = "@val_num_desc"
        for ( type in tot[site] ) {
            if ( ++cnt < 3 ) {
                print site, type, tot[site][type]
            }
        }
    }
}

$ awk -f tst.awk file
site1|A|137
site1|B|113
site2|D|362
site2|A|109
site3|D|541
site3|A|167
site4|B|162
site4|A|55
Ed Morton
  • 31,617