2

Consider the data from the GNU parallel manual's example for --group-by:

cat > table.csv <<"EOF"
UserID, Consumption
123,    1
123,    2
12-3,   1
221,    3
221,    1
2/21,   5
EOF

Is there a way to group records by one column and write all the values from another column in the group as command-line arguments?

This command doesn't group but otherwise gives me the output structure I want.

cat table.csv | parallel --colsep , --header : -kN1 echo UserID {1}  Consumption {2}
UserID 123 Consumption 1
UserID 123 Consumption 2
UserID 12-3 Consumption 1
UserID 221 Consumption 3
UserID 221 Consumption 1
UserID 2/21 Consumption 5

What command would give me output like this?

UserID 123 Consumption 1 2
UserID 12-3 Consumption 1
UserID 221 Consumption 3 1
UserID 2/21 Consumption 5

I also want to limit the number of "Consumption" values.

Say there were more than 4 in one of the groups.

cat > table.csv <<"EOF"
UserID, Consumption
123,    1
123,    2
123,    3
123,    4
123,    5
123,    6
123,    7
12-3,   1
221,    3
221,    1
2/21,   5
EOF

I want the command line to contain no more than 4 "Consumption" values.

UserID 123 Consumption 1 2 3 4
UserID 123 Consumption 5 6 7
UserID 12-3 Consumption 1
UserID 221 Consumption 3 1
UserID 2/21 Consumption 5

The manual shows how to use --group-by to select the correct groups.

cat table.csv | \
parallel --pipe --colsep , --header : --group-by UserID -kN1 wc

4 lines of wc output mean that it operates on 4 groups. The first group for example has 3 lines, 6 words, and 40 characters.

      3       6      40
      2       4      30
      3       6      40
      2       4      30

To make the group input clearer I swap wc for cat.

cat table.csv | \
parallel --pipe --colsep , --header : --group-by UserID -kN1 cat

The cat output shows that parallel passes the original input lines to the job and copies the header line as the first line of each group.

UserID, Consumption
123,    1
123,    2
UserID, Consumption
12-3,   1
UserID, Consumption
221,    3
221,    1
UserID, Consumption
2/21,   5

The problem is that --group-by makes Parallel use standard input instead of command-line arguments. I don't see a way around that.

Do I need to change the way I pass the arguments to GNU parallel? Do I need to use another tool to create the correct format before using GNU parallel to execute?

I'm using GNU parallel version 20231122.

1 Answers1

2

In Bash you can do:

doit() { parallel --header : --colsep , -n4 echo UserID {1} Consumption {2} {4} {6} {8}; }
export -f doit
cat table.csv | parallel --pipe --colsep , --header : --group-by UserID -kN1 doit

I do not see you can do it in a single parallel instance. What you want is to mix --pipe and normal mode, and GNU Parallel can't really do that.

Ole Tange
  • 35,514
  • Thanks @OleTange. Is -n the number of lines and {m} the m'th argument across those lines? I'm not sure why -n4 and Consumption goes up to {8}. – Iain Samuel McLean Elder Dec 15 '23 at 10:22
  • 1
    Yeah, I can see why that is confusing. The manual is clearly wrong (written before --colsep). -n m reads m records. These records are then split using --colsep which results in a number of replacement strings that each has a number. In our case we read 4 records (i.e. 4 lines), split these on , thus giving us 8 replacement strings. – Ole Tange Dec 15 '23 at 10:42