Count unique associated values in awk (or perl)

Question

I've already found "How to print incremental count of occurrences of unique values in column 1", which is similar to my question, but the answer isn't sufficient for my purposes.

First let me just illustrate what I want to do:

# Example input
apple   abc jkl
apple   xyz jkl
apple   abc xyz
apple   qrs xyz
apple   abc jkl
banana  abc lmno
banana  lmnop   xyz
banana  lmnopq  jkl
banana  abc jkl
banana  lmnop   pqrs
banana  abcdefg tuv
cucumber    abc lmno
cucumber    abc jkl
cucumber    abc xyz
cucumber    abcd    jkl
cucumber    abc jkl
cucumber    abc lmno

# Desired output
apple   3   2
banana  4   5
cucumber    2   3

So, for each separate value of field 1, print that field, and a count of the unique associated values for field 2, and then for field 3.

The input is sorted by the first field, but sorting by other fields is disallowed (and would do no good as the 2nd and 3rd fields both need to be handled).

I'd much rather accomplish this in awk; it is probably far easier in perl and I'm interested in learning how to do that as well, but I'm dealing with an awk script and I'd rather not rewrite the whole thing.

I came up with one method which works, but is quite lengthy and seems very hacky to me. I'll post that as an answer (when I get back to the office) but would love to see any actually good approaches. (I don't think mine is "good".)

In Postgres this would be, roughly, select col1, count(distinct col2), count(distinct col3) from mytable group by col1; — Wildcard, Oct 11 '17 at 05:15

score 3 · Accepted Answer · answered Nov 17 '15 at 08:51

With awk:

awk 'function p(){print l,c,d; delete a; delete b; c=d=0} 
  NR!=1&&l!=$1{p()} ++a[$2]==1{c++} ++b[$3]==1{d++} {l=$1} END{p()}' file

Explanation:

function p(): defines a function called p(), which prints the values and deletes the used variables and arrays.
NR!=1&&l!=$1 if its not the first line and the variable l equals the first field $1, then run the p() function.
++a[$2]==1{c++} if the increment of the element value of the a array with index $2 equals 1, then that value is first seen, and therefore increment the c variable. The ++ before the element, returns the new value, therefore causes an increment before the comparsion with 1.
++b[$3]==1{d++} the same as above but with the 3rd field and the d variable.
{l=$1} The l to the first field (for the next iteration.. above)
END{p()} after the last line is processed, awk has to print the values for the last block

With your given input the outout is:

apple 3 2
banana 4 5
cucumber 2 3

Wow! We came up with almost the same thing independently :) A few very nice changes I incorporated from your code into mine: the printcounts function can include the resetvars code; the prepended ++ to check if ( ++seenthis [...] == 1) is much cleaner than calling that array element twice; the assignment of currentf1 can be at the end—makes it much cleaner. I'll update my answer with my modified code. — Wildcard, Nov 17 '15 at 19:31

RobertL · Answer 2 · 2015-11-17T21:41:16.490

I like whitespace and descriptive variable names. What else is there to say? It's been a while since I've written a lot of awk, I even forgot about the -f on the shebang. However, as I did this I really felt like I was in the zen of it. Haiku code.

I like this solution because there's a minimum of coded logic. Only two for loops iterating over array indexes. No 3 part stepping for loops, no if statements, no explicit value comparisons. All of those things are statistically correlated to software defects (bugs). Interestingly, there are no explicit assignments, and only one math operation, the increment on the count. I think this all indicates maximum usage of the language features.

I feel like something might be missing, but I haven't yet been able to find any holes in it.

Please comment. Opinions and constructive criticism requested. I'd like to hear about the performance considerations of this script.

#!/usr/bin/awk -f

function count(seen, unique_count) {
    for (ij in seen) {
        split(ij, fields, SUBSEP)
        ++unique_count[fields[1]]
    }
}

{
    seen2[$1,$2]
    seen3[$1,$3]
}

END {
    count(seen2, count2)
    count(seen3, count3)
    for (i in count3) {
        print i, count2[i], count3[i]
    }
}

Annotation

I guess one unique feature of this script is that the seen2 and seen3 arrays contain no data, only indexes. This is because we're only counting unique values, therefore, the only thing that matters is that the values have been seen, we don't care how many times they occur.

#!/usr/bin/awk -f

The count function takes an array, seen, indexed by 2 field values encountered in input records, either fields 1 and 2, or fields 1 and 3, and returns an array internally called unique_count indexed by the first field, containing the counts of unique field values for the column accumulated by the second index:

function count(seen, unique_count) {

The count function iterates over the indexes of the array seen:

    for (ij in seen) {

Splits the index into the two original values, field 1 and either field 2 or field 3:

        split(ij, fields, SUBSEP)

Increment the count for the element indexed by field 1:

        ++unique_count[fields[1]]
    }
}

On every input line encountered, we create an empty array element, if it doesn't already exist, indexed by the first field, and either the second or third field. Keep a separate array (seen2 and seen3) for each field number that's being counted. There will be only one array element for each unique value in the given column (2 or 3):

{
    seen2[$1,$2]
    seen3[$1,$3]
}

At the end of data, count the number of unique fields seen for each column:

END {

Pass the arrays accumulated from the input to the count function, and receive the count2 or count3 populated with the unique field counts.

    count(seen2, count2)
    count(seen3, count3)

Step through either of count2 or count3 arrays (doesn't matter which since they all have the first field of each line), and print the field one, followed by the counts of unique values found for each line containing field one:

    for (i in count3) {
        print i, count2[i], count3[i]
    }
}

One-Liner Version

awk 'function c(s,u){for(x in s){split(x,f,SUBSEP); ++u[f[1]];}}
 {a[$1,$2]; b[$1,$3];} END {c(a,d); c(b,e); for(i in d){print i,d[i],e[i];}}'

Very interesting! Does calling col2[$1,$2] with no assignment make that element exist? Is that POSIX compatible? This looks like an interestingly different approach which would correctly handle nonsorted input as well as the example input I gave...would you mind adding an explanation of the more unusual aspects here? — Wildcard, Nov 17 '15 at 19:35
I'm not sure if it's POSIX, but this script produces the same output when run by gawk, mawk, and FreeBSD awk. I was just making a few edits for readability and adding comments. Soon to come. — RobertL, Nov 17 '15 at 19:39
@Wildcard Yes, seen2[$1,$2] (edited example) does create the element. Yes, this does handle unsorted input correctly. I'm not sure what's unusual about it, all the answers look unusual to me. :-) I'll add some description/annotation. — RobertL, Nov 17 '15 at 20:46
Ingenious, actually. :) I like it very much. I've already got my code finalized and checked in, but if I run into a similar situation again (not unlikely) I'll incorporate parts of this. I didn't know about SUBSEP (although using FS would work as well for many cases). Haiku code indeed. For my use case, I have other disrelated columns I am summarizing for each value of $1 (doing other things than count unique). So I need to print each line of output as I go, not all at the end. (Not a criticism at all, just explaining parts of my design choices :) — Wildcard, Nov 17 '15 at 21:40

score 2 · Answer 3 · answered Nov 17 '15 at 08:36

perl -lane 'undef $h{ $F[0] }[ $_ - 1 ]{ $F[$_] } for 1,2
            }{
            for my $k (keys %h) {
                print join " ", $k, map scalar keys $_, @{ $h{$k} }
            }' < input

Basically, you create a hash like this:

  'apple' => [
               {
                 'abc' => undef,
                 'xyz' => undef,
                 'qrs' => undef
               },
               {
                 'jkl' => undef,
                 'xyz' => undef
               }
             ],
  'banana' => [
                {
                  'abcdefg' => undef,
                  'lmnop' => undef,
                  'lmnopq' => undef,
                  'abc' => undef
                },
                {
                  'lmno' => undef,
                  'pqrs' => undef,
                  'tuv' => undef,
                  'jkl' => undef,
                  'xyz' => undef
                }
              ],
  'cucumber' => [
                  {
                    'abcd' => undef,
                    'abc' => undef
                  },
                  {
                    'lmno' => undef,
                    'jkl' => undef,
                    'xyz' => undef
                  }
                ]

And then just count the keys for each inner hash.

I'm going to have to come back to this (and Sobrique's answer) once I've actually learned perl basics...it's a bit inscrutable to me right now, but from your hash diagram it looks like it's just the approach I was expecting to be possible with perl. Thanks! :) — Wildcard, Nov 17 '15 at 19:34

score 2 · Answer 4 · answered Nov 17 '15 at 14:59

I would probably tackle it something like this:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;

my %count_of_col2;
my %count_of_col3;

#iterate data
while (<DATA>) {
    #split on whitespace
    my ( $key, $col2, $col3 ) = split;
    #update counts
    $count_of_col2{$key}{$col2}++;
    $count_of_col3{$key}{$col3}++;
}

foreach my $key ( sort keys %count_of_col2 ) {
    print join( "\t",
        $key,
         #keys gives us all the elements - we use scalar to count them. 
        scalar keys %{ $count_of_col2{$key} },
        scalar keys %{ $count_of_col3{$key} } ),
        "\n";
}

__DATA__
apple   abc jkl
apple   xyz jkl
apple   abc xyz
apple   qrs xyz
apple   abc jkl
banana  abc lmno
banana  lmnop   xyz
banana  lmnopq  jkl
banana  abc jkl
banana  lmnop   pqrs
banana  abcdefg tuv
cucumber    abc lmno
cucumber    abc jkl
cucumber    abc xyz
cucumber    abcd    jkl
cucumber    abc jkl
cucumber    abc lmno

Note - only supports two columns. You could do arbitrary numbers if you wanted, but then you'd probably want an outer array.

No, probably not that. My boilerplate for answering stuff gets copied and pasted... — Sobrique, Nov 17 '15 at 22:33

score 1 · Answer 5 · edited Apr 13 '17 at 12:37

As promised, here is the approach I did work out before writing this question. It works and maybe it is a good approach, but it seemed overly convoluted for this apparently simple task. Now it seems it's actually not so bad. :)

function printcounts() {
  printf "%s", currentf1
  for (i = 2; i <= 3; i++ ) {
    printf "%s", FS countuniq [ i ]
  }
  printf "\n"
}

function resetvars() {
  delete already_seen_value
  for ( i = 2; i <= 3; i++ ) {
    countuniq [ i ] = 0
  }
}

$1 != currentf1 {

  if ( NR != 1 ) {
    printcounts()
  }
  currentf1 = $1
  resetvars()
}

{
  for ( i = 2; i <= 3; i++ ) {
    if ( ! already_seen_value [ i ":" $i ] ) {
      already_seen_value [ i ":" $i ] ++
      countuniq [ i ] ++
    }
  }
}
END {
  printcounts()
}

With modifications based on chaos's answer:

function printcounts() {
  printf "%s", currentf1
  for (i = 2; i <= 3; i++ ) {
    printf "%s", FS countuniq [ i ] + 0
  }
  printf "\n"
  # Reset vars
  delete seenthis
  delete countuniq
}

NR != 1 && currentf1 != $1 {
  printcounts()
}

{
  for ( i = 2; i <= 3; i++ ) {
    if ( ++ seenthis [ i ":" $i ] == 1 ) {
      countuniq [ i ] ++
    }
  }
  currentf1 = $1
}

END {
  printcounts()
}

(The + 0 in the printcounts function is to ensure a number is always printed, as the actual use case involves a comma field separator and ignoring empty fields, so a zero count is actually possible.)

jubilatious1 · Answer 6 · 2022-01-19T04:33:00.620

Using Raku (formerly known as Perl_6)

raku -e 'my @a=lines; my $cols=@a[0].words.elems; my %x; \
loop (my $i=1; $i < $cols; $i++) {my %h.=append(.[0] => .[$i]) for @a>>.words; \
%x.=append( %h>>.unique>>.elems) }; .put for %x.sort;' file.txt

Sample Input:

apple   abc jkl
apple   xyz jkl
apple   abc xyz
apple   qrs xyz
apple   abc jkl
banana  abc lmno
banana  lmnop   xyz
banana  lmnopq  jkl
banana  abc jkl
banana  lmnop   pqrs
banana  abcdefg tuv
cucumber    abc lmno
cucumber    abc jkl
cucumber    abc xyz
cucumber    abcd    jkl
cucumber    abc jkl
cucumber    abc lmno

Sample Output:

apple   3 2
banana  4 5
cucumber    2 3

Above is a solution coded in Raku, a member of the Perl-family of programming languages. Briefly, lines are read in and stored in array @a. The first line is used to get the number of columns, $cols. An accumulator hash %x is declared.

For each position of @a (containing a line), @a is broken into words (i.e. columns). A classic loop is then used to loop over each "leftmost-column-as-key / next-column-rightward-as-value" word-pair, append-ing into the %h hash (declared with my inside the block). Of note here is Raku's => 'fat-arrow' hash constructor, a simple idiom that gets the job done. In the last statement of the loop block, an accumulator hash %x is append-ed, and during this process duplicates are removed with unique, and the value is converted to the count of elems.

(Note, the >>. notation indicates a "hyper-methodop" operator, which "...will call a method on all elements of a List out of order and return the list of return values in order." ).

The above code satisfies the OP's requirements. But in certain situations two-accumulating hashes might be overkill. The OP might simply want a column-wise total. The code below accomplishes that task, dispensing with the %x accumulator hash:

~$ raku -e 'my @a=lines; my $cols=@a[0].words.elems; \
loop (my $i=1; $i < $cols; $i++) {my %h.=append(.[0] => .[$i]) for @a>>.words; \
%h>>.unique>>.elems.say};' file.txt

Sample Output:

{apple => 3, banana => 4, cucumber => 2} #Column 2
{apple => 2, banana => 5, cucumber => 3} #Column 3

https://docs.raku.org/syntax/loop
https://raku.org

Count unique associated values in awk (or perl)

6 Answers6

Linked