0

I am collecting data from multiple files and dumping into a single summary file.

cat *files* | sort -u > final.rpt
a 1
a 5
a 6
b 2
b 3 
b 0
c 1
c 7

How can I uniquify based on string in first column and sum all the number in second column corresponding to that string. For the example above expected output is :

a 12
b 5
c 8
Marius_Couet
  • 1,095
Nirav
  • 17

5 Answers5

4

Something like can do the work:

awk '{a[$1]+=$2} END {for (i in a) print i, a[i]}' <input file>

The idea is to use associate arrays and sum there the value in second column

Romeo Ninov
  • 17,484
  • 1
    If input file is sorted, then an associate array is not really needed. But unless the OP's input file is absolutely gigantic and the memory is low, the optimization is hardly worth the effort. – legolegs Aug 23 '23 at 07:55
  • @legolegs, with associate arrays is more strait method (i know :) ). And the speed is acceptable. On really huge file sort may get much longer (if do not reach memory constrains) – Romeo Ninov Aug 23 '23 at 08:05
3

Using datamash:

$ datamash -sW sum 2 -g 1 <file

This computes sum of second field groupby first field.


Using Miller:

$ mlr --nidx stats1 -a sum -f 2 -g 1 file

Assuming that the unsorted file has two fields only as in:

a 1
a 5
a 6
b 2
b 3
b 0
c 1
c 7
a 1

The following command first sorts the file then computes sum.

$ mlr --nidx uniq -f 1,2 then \
stats1 -a sum -f 2 -g 1 file
1

For completeness, here goes the solution with O(1) for memory usage:

awk 'NR>1 && $1!=prev {print prev, sum;sum=0} {sum+=$2;prev=$1} END {print prev, sum}' < final.rpt
a 12
b 5
c 8

Explanation: for each new record where $1 changes we print the result and reset the sum. We also print result at the end of the file, but not at the beginning (NR>1). The code is somewhat cumbersome in comparison with the version with an associate array.

legolegs
  • 321
  • 1
    If you tweak it to awk '$1!=prev{if (NR>1) print prev, sum; sum=0; prev=$1} {sum+=$2} END{if (NR>1) print prev, sum}' then it won't need to assign a value to prev for every input line, just when necessary, and it won't print a single blank char if the input is empty. So it'll be slightly more efficient and slightly more robust. – Ed Morton Aug 23 '23 at 16:44
1

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %h; 
             given .split(/\s/, 2, :skip-empty) -> ($k,$v) { %h{$k} += $v }; 
             END .say for %h.sort;'   file

Above is an answer coded in Raku, a member of the Perl-family of programming languages. You could consider preparing your input data by adapting the Raku answer here (sorting on a single column without uniquifying).

Code is called at the commandline with the -ne (linewise, non-autoprinting) flags. A hash %h is declared in a BEGIN block. In the body of the linewise loop, each line is split on whitespace into 2 pieces. The :skip-empty parameter is used so that blank column values will throw an error.

The two elements are assigned to temporary variables $k & $v, which are referred to inside the block. Within the block, each $k key is looked-up in the %h hash. The key is added to the hash if not found, and the $v second-column value is += assigned and/or added to that key's value (i.e. values accumulate). Here the += operator solves a lot of problems as it is really shorthand for %h{$k} = (%h{$k} // 0) + $v.

Finally after all lines are read, the result is output in an END block.

Sample Input:

a 1
a 5
a 6
b 2
b 3
b 0
c 1
c 7

Sample Output (using say):

a => 12
b => 5
c => 8

To get tab-separated columns in the output, use put instead of say:

Sample Output (using put):

a   12
b   5
c   8

https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
0

Would ... | sort | uniq -c >final.rpt work (instead of using ... | sort -u >final.rpt) to collect the data?

U. Windl
  • 1,411