Uniquify first column and add numbers in second column in a file in linux

Question

I am collecting data from multiple files and dumping into a single summary file.

cat *files* | sort -u > final.rpt
a 1
a 5
a 6
b 2
b 3 
b 0
c 1
c 7

How can I uniquify based on string in first column and sum all the number in second column corresponding to that string. For the example above expected output is :

a 12
b 5
c 8

Hello, did you try anything so far ? What error did you encounter or what didn't match the output you are trying to obtain ? — Marius_Couet, Aug 23 '23 at 07:08
Are you sure you want to do sort -u on that input instead of just sort? — Ed Morton, Aug 23 '23 at 16:13
What should happen if two or more lines in the output of cat have e.g. a 1? — Kusalananda, Aug 24 '23 at 07:12

score 4 · Answer 1 · answered Aug 23 '23 at 07:09

4

Something like can do the work:

awk '{a[$1]+=$2} END {for (i in a) print i, a[i]}' <input file>

The idea is to use associate arrays and sum there the value in second column

answered Aug 23 '23 at 07:09

Romeo Ninov

17,484

1

If input file is sorted, then an associate array is not really needed. But unless the OP's input file is absolutely gigantic and the memory is low, the optimization is hardly worth the effort. – legolegs Aug 23 '23 at 07:55
@legolegs, with associate arrays is more strait method (i know :) ). And the speed is acceptable. On really huge file sort may get much longer (if do not reach memory constrains) – Romeo Ninov Aug 23 '23 at 08:05

Prabhjot Singh · Answer 2 · 2023-08-23T09:17:25.673

3

Using datamash:

$ datamash -sW sum 2 -g 1 <file

This computes sum of second field groupby first field.

Using Miller:

$ mlr --nidx stats1 -a sum -f 2 -g 1 file

Assuming that the unsorted file has two fields only as in:

a 1
a 5
a 6
b 2
b 3
b 0
c 1
c 7
a 1

The following command first sorts the file then computes sum.

$ mlr --nidx uniq -f 1,2 then \
stats1 -a sum -f 2 -g 1 file

edited Aug 23 '23 at 09:17

answered Aug 23 '23 at 08:43

Prabhjot Singh

1,925

score 1 · Answer 3 · answered Aug 23 '23 at 08:42

1

For completeness, here goes the solution with O(1) for memory usage:

awk 'NR>1 && $1!=prev {print prev, sum;sum=0} {sum+=$2;prev=$1} END {print prev, sum}' < final.rpt
a 12
b 5
c 8

Explanation: for each new record where $1 changes we print the result and reset the sum. We also print result at the end of the file, but not at the beginning (NR>1). The code is somewhat cumbersome in comparison with the version with an associate array.

answered Aug 23 '23 at 08:42

legolegs

321

1

If you tweak it to awk '$1!=prev{if (NR>1) print prev, sum; sum=0; prev=$1} {sum+=$2} END{if (NR>1) print prev, sum}' then it won't need to assign a value to prev for every input line, just when necessary, and it won't print a single blank char if the input is empty. So it'll be slightly more efficient and slightly more robust. – Ed Morton Aug 23 '23 at 16:44

jubilatious1 · Answer 4 · 2023-08-24T11:00:15.000

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %h; 
             given .split(/\s/, 2, :skip-empty) -> ($k,$v) { %h{$k} += $v }; 
             END .say for %h.sort;'   file

Above is an answer coded in Raku, a member of the Perl-family of programming languages. You could consider preparing your input data by adapting the Raku answer here (sorting on a single column without uniquifying).

Code is called at the commandline with the -ne (linewise, non-autoprinting) flags. A hash %h is declared in a BEGIN block. In the body of the linewise loop, each line is split on whitespace into 2 pieces. The :skip-empty parameter is used so that blank column values will throw an error.

The two elements are assigned to temporary variables $k & $v, which are referred to inside the block. Within the block, each $k key is looked-up in the %h hash. The key is added to the hash if not found, and the $v second-column value is += assigned and/or added to that key's value (i.e. values accumulate). Here the += operator solves a lot of problems as it is really shorthand for %h{$k} = (%h{$k} // 0) + $v.

Finally after all lines are read, the result is output in an END block.

Sample Input:

a 1
a 5
a 6
b 2
b 3
b 0
c 1
c 7

Sample Output (using say):

a => 12
b => 5
c => 8

To get tab-separated columns in the output, use put instead of say:

Sample Output (using put):

a   12
b   5
c   8

https://raku.org

score 0 · Answer 5 · answered Aug 24 '23 at 09:30

0

Would ... | sort | uniq -c >final.rpt work (instead of using ... | sort -u >final.rpt) to collect the data?

answered Aug 24 '23 at 09:30

U. Windl

1,411

uniq -c will count the occurrences, not sum the numbers after – Romeo Ninov Aug 24 '23 at 18:10
With the question it was not quite obvious how the original files did look like. – U. Windl Aug 25 '23 at 09:41

Uniquify first column and add numbers in second column in a file in linux

5 Answers5