I like whitespace and descriptive variable names. What else is there to say? It's been a while since I've written a lot of awk
, I even forgot about the -f
on the shebang. However, as I did this I really felt like I was in the zen of it. Haiku code.
I like this solution because there's a minimum of coded logic. Only two for loops iterating over array indexes. No 3 part stepping for
loops, no if
statements, no explicit value comparisons. All of those things are statistically correlated to software defects (bugs). Interestingly, there are no explicit assignments, and only one math operation, the increment on the count. I think this all indicates maximum usage of the language features.
I feel like something might be missing, but I haven't yet been able to find any holes in it.
Please comment. Opinions and constructive criticism requested. I'd like to hear about the performance considerations of this script.
#!/usr/bin/awk -f
function count(seen, unique_count) {
for (ij in seen) {
split(ij, fields, SUBSEP)
++unique_count[fields[1]]
}
}
{
seen2[$1,$2]
seen3[$1,$3]
}
END {
count(seen2, count2)
count(seen3, count3)
for (i in count3) {
print i, count2[i], count3[i]
}
}
Annotation
I guess one unique feature of this script is that the seen2
and seen3
arrays contain no data, only indexes. This is because we're only counting unique values, therefore, the only thing that matters is that the values have been seen, we don't care how many times they occur.
#!/usr/bin/awk -f
The count
function takes an array, seen
, indexed by 2 field values encountered in input records, either fields 1 and 2, or fields 1 and 3, and returns an array internally called unique_count
indexed by the first field, containing the counts of unique field values for the column accumulated by the second index:
function count(seen, unique_count) {
The count
function iterates over the indexes of the array seen
:
for (ij in seen) {
Splits the index into the two original values, field 1 and either field 2 or field 3:
split(ij, fields, SUBSEP)
Increment the count for the element indexed by field 1:
++unique_count[fields[1]]
}
}
On every input line encountered, we create an empty array element, if it doesn't already exist, indexed by the first field, and either the second or third field. Keep a separate array (seen2
and seen3
) for each field number that's being counted. There will be only one array element for each unique value in the given column (2 or 3):
{
seen2[$1,$2]
seen3[$1,$3]
}
At the end of data, count the number of unique fields seen for each column:
END {
Pass the arrays accumulated from the input to the count
function, and receive the count2
or count3
populated with the unique field counts.
count(seen2, count2)
count(seen3, count3)
Step through either of count2
or count3
arrays (doesn't matter which since they all have the first field of each line), and print the field one, followed by the counts of unique values found for each line containing field one:
for (i in count3) {
print i, count2[i], count3[i]
}
}
One-Liner Version
awk 'function c(s,u){for(x in s){split(x,f,SUBSEP); ++u[f[1]];}}
{a[$1,$2]; b[$1,$3];} END {c(a,d); c(b,e); for(i in d){print i,d[i],e[i];}}'
select col1, count(distinct col2), count(distinct col3) from mytable group by col1;
– Wildcard Oct 11 '17 at 05:15