Can I use GNU parallel to group command arguments by a column value?

Question

Consider the data from the GNU parallel manual's example for --group-by:

cat > table.csv <<"EOF"
UserID, Consumption
123,    1
123,    2
12-3,   1
221,    3
221,    1
2/21,   5
EOF

Is there a way to group records by one column and write all the values from another column in the group as command-line arguments?

This command doesn't group but otherwise gives me the output structure I want.

cat table.csv | parallel --colsep , --header : -kN1 echo UserID {1}  Consumption {2}

UserID 123 Consumption 1
UserID 123 Consumption 2
UserID 12-3 Consumption 1
UserID 221 Consumption 3
UserID 221 Consumption 1
UserID 2/21 Consumption 5

What command would give me output like this?

UserID 123 Consumption 1 2
UserID 12-3 Consumption 1
UserID 221 Consumption 3 1
UserID 2/21 Consumption 5

I also want to limit the number of "Consumption" values.

Say there were more than 4 in one of the groups.

cat > table.csv <<"EOF"
UserID, Consumption
123,    1
123,    2
123,    3
123,    4
123,    5
123,    6
123,    7
12-3,   1
221,    3
221,    1
2/21,   5
EOF

I want the command line to contain no more than 4 "Consumption" values.

UserID 123 Consumption 1 2 3 4
UserID 123 Consumption 5 6 7
UserID 12-3 Consumption 1
UserID 221 Consumption 3 1
UserID 2/21 Consumption 5

The manual shows how to use --group-by to select the correct groups.

cat table.csv | \
parallel --pipe --colsep , --header : --group-by UserID -kN1 wc

4 lines of wc output mean that it operates on 4 groups. The first group for example has 3 lines, 6 words, and 40 characters.

      3       6      40
      2       4      30
      3       6      40
      2       4      30

To make the group input clearer I swap wc for cat.

cat table.csv | \
parallel --pipe --colsep , --header : --group-by UserID -kN1 cat

The cat output shows that parallel passes the original input lines to the job and copies the header line as the first line of each group.

UserID, Consumption
123,    1
123,    2
UserID, Consumption
12-3,   1
UserID, Consumption
221,    3
221,    1
UserID, Consumption
2/21,   5

The problem is that --group-by makes Parallel use standard input instead of command-line arguments. I don't see a way around that.

Do I need to change the way I pass the arguments to GNU parallel? Do I need to use another tool to create the correct format before using GNU parallel to execute?

I'm using GNU parallel version 20231122.

Just FYI, you can end a line on | there is no need for | \, the | alone is fine as the last character of the line. Any control operator can be used as the last character of a line. — terdon, Dec 14 '23 at 09:47
Thanks, @terdon. The GNU parallel manual uses that syntax and I copied it. — Iain Samuel McLean Elder, Dec 14 '23 at 12:37
Ah! Really? That might explain why I see it so often. Thanks @OleTange, I was wondering why you would have done that and I suspected you knew something I didn't. — terdon, Dec 15 '23 at 09:15

Ole Tange · Accepted Answer · 2023-12-15T09:58:42.090

2

In Bash you can do:

doit() { parallel --header : --colsep , -n4 echo UserID {1} Consumption {2} {4} {6} {8}; }
export -f doit
cat table.csv | parallel --pipe --colsep , --header : --group-by UserID -kN1 doit

I do not see you can do it in a single parallel instance. What you want is to mix --pipe and normal mode, and GNU Parallel can't really do that.

edited Dec 15 '23 at 09:58

answered Dec 15 '23 at 09:13

Ole Tange

35,514

Thanks @OleTange. Is -n the number of lines and {m} the m'th argument across those lines? I'm not sure why -n4 and Consumption goes up to {8}. – Iain Samuel McLean Elder Dec 15 '23 at 10:22
1

Yeah, I can see why that is confusing. The manual is clearly wrong (written before --colsep). -n m reads m records. These records are then split using --colsep which results in a number of replacement strings that each has a number. In our case we read 4 records (i.e. 4 lines), split these on , thus giving us 8 replacement strings. – Ole Tange Dec 15 '23 at 10:42

Can I use GNU parallel to group command arguments by a column value?

1 Answers1