5

file 1:

HOGBRM443983         -2522.00  19800826
HOGBRM445985         -2389.00  19801101
HOUSAM1891409        -1153.00  19811228
HOUSAM2004289        -650.00   19860101
HOUSAM2005991        -843.00   19860109
HOCANM388722         -1546.00  19860116
HOUSAM2007297        -1882.00  19860125
HOUSAM2007389        -1074.00  19860128
HOITAM801038516      -691.00   19860128

Columns 2 and 3 include values and birthdate information( year,month,day) of each id from column1, respectively . I want to check how many ids exist within each birth year and what are the average values (from second column) of ids across different years. For example, in file1 there are 2, 1 and 6 ids across years 1980, 1981 and 1986 respectively so the output should be:

output:

1980 2 -2455.5
1981 1 -1153.00
1986 6 -114.33

in which the first column shows the year of birth, the second column shows a number of ids within each year and the third column is the average values of ids across different years.

Considering that the real data is indeed huge, any suggestion would be appreciated.

don_crissti
  • 82,805
zara
  • 1,313

4 Answers4

7

With gnu datamash:

cut -c1-35 infile | datamash -W -g 3 count 3 mean 2

Note that you need process your data first (I used cut as it was the obvious choice with your input sample but any tool will do) so as to remove the month and day from the birth date:

HOGBRM443983         -2522.00  1980
HOGBRM445985         -2389.00  1980
HOUSAM1891409        -1153.00  1981
HOUSAM2004289        -650.00   1986
......

and only then pipe it to datamash.
This also assumes 3rd columns is sorted by year (if it's not sorted use datamash -s -W -g ...)

don_crissti
  • 82,805
5

The Awk answer:

awk '{y=substr($3,1,4); c[y]++; s[y]+=$2} END {for (y in c) {print y, c[y], (s[y]/c[y])}}' file.txt
Wildcard
  • 36,499
2

Consider using a real database.

Using a Postgres sandbox set up in a Vagrant VM, I did this using the following steps:

CREATE TABLE MyData (id text, val float, bday date);
INSERT INTO MyData VALUES
('HOGBRM443983',-2522.00,'1980-08-26'),
('HOGBRM445985',-2389.00,'1980-11-01'),
('HOUSAM1891409',-1153.00,'1981-12-28'),
('HOUSAM2004289',-650.00,'1986-01-01'),
('HOUSAM2005991',-843.00,'1986-01-09'),
('HOCANM388722',-1546.00,'1986-01-16'),
('HOUSAM2007297',-1882.00,'1986-01-25'),
('HOUSAM2007389',-1074.00,'1986-01-28'),
('HOITAM801038516',-691.00,'1986-01-28')
;
SELECT
  extract(year FROM bday) AS yr,
  count(id) AS count,
  avg(val) AS average
FROM mydata GROUP BY yr;

Output is:

  yr  | count |      average      
------+-------+-------------------
 1981 |     1 |             -1153
 1980 |     2 |           -2455.5
 1986 |     6 | -1114.33333333333
(3 rows)

You could probably handle this with text processing, but you mention that the data is HUGE, and an actual database is designed for this sort of computation. (And the blog post I linked to has all the steps to get a Postgres sandbox set up.)

Wildcard
  • 36,499
2

Miller was created to solve problems like these:

$ cat hogbrm.txt | \
  mlr --nidx --repifs put '$3=sub(string($3),"(....).*", "\1")' \
  then stats1 -a count,mean -f 2 -g 3
1980 2 -2455.500000
1981 1 -1153.000000
1986 6 -1114.333333

Context:

  • Use --nidx since there is no header, just positionally indexed columns
  • Use --repifs since multiple spaces separate columns
  • Use sub to drop the last four digits of the date (column 3)
  • Use stats1 to compute the count and mean of column 2 grouped by column 3
Satō Katsura
  • 13,368
  • 2
  • 31
  • 50
John Kerl
  • 186
  • 5