File transformation

Question

I have a two column file of the format;-

A,val1
A,val2
A,val3
B,val1
B,val2
B,val3

What I need is to be able to transform the output of this so that the second column is horizontal for each unique value in the first column thus;-

A,val1,val2,val3
B,val1,val2,val3

I am not sure if the best way is with BASH or AWK - perhaps a combination of both? If anyone could please point me in the right direction.

Are there always three rows for each unique value in the first column? Or is this number variable? — sjy, Sep 15 '20 at 09:23
Please don't post images of text. Instead, copy/paste the text into your question and use the formatting tools to format it as code. Please [edit] your question and replace the images with text we can actually copy and use to test our answers. — terdon, Sep 15 '20 at 09:24
Is it ensured that the lines are grouped by the first field, i.e. first all lines starting with A, then the lines starting with B and so on? Do you want exactly the same order of the vals within the output line as they were within the input file, or is additional sorting required? — AdminBee, Sep 15 '20 at 10:27

score 3 · Answer 1 · answered Sep 15 '20 at 09:46

3

There are various ways of solving this problem with shell scripts, but I prefer to reach for a tool that is not quite standard yet: Miller. You can install it with apt install miller on Ubuntu/Debian. I find that Miller's verbs are a more natural tool for thinking about this kind of problem than bash or awk.

If the data specified in the question is stored in INPUT_FILE:

A,val1
A,val2
A,val3
B,val1
B,val2
B,val3

Then Miller's nest verb can be used to pack multiple records (rows) into a single record with multiple values in field 2, and expand field 2 into multiple fields:

mlr --ocsv --headerless-csv-output \
  nest --implode --values --across-records -f 2 then \
  nest --explode --values --across-fields -f 2 INPUT_FILE

This produces the output you want:

A,val1,val2,val3
B,val1,val2,val3

There's probably an even simpler way to do this in Miller, but that was the first solution I found.

answered Sep 15 '20 at 09:46

sjy

896

Hi - many thanks for that answer which looks great, the only problem is that i am not authorised to install packages on the server so miller is out unfortunately. I had thought that the best way to solve the issue would be run a shell script with two loops, one for the unique value of the first field and then iterate through the second column and use some kind of formatting in the output to force all on to the same line, but I'm getting stuck! – Jonty Sep 15 '20 at 10:10
1

@Jonty usually the advice is against using shell loops for text processing for performance reasons. – AdminBee Sep 15 '20 at 10:25
1

@Jonty even if you can't install packages, you might still be able to download the mlr.linux.x86_64 binary (~4 MB) from Miller's releases page, set the execute bit with chmod +x mlr.linux.x86_64, and invoke it from the current directory as eg. ./mlr.linux.x86_64 --help. – sjy Sep 15 '20 at 10:29
Hi sjy - that was a great suggestion and I took your advice and just ran it as a standalone executable - did the job a treat - many thanks :-) – Jonty Sep 15 '20 at 11:56

score 3 · Accepted Answer · answered Sep 15 '20 at 12:09

3

Awk alone:

$ awk -F, 'BEGIN{OFS=FS} {a[$1] = a[$1] == "" ? $2 : a[$1] FS $2} END {for(i in a) print i,a[i]}' file
A,val1,val2,val3
B,val1,val2,val3

Note that the output order is not guaranteed - that's easy to fix with GNU awk, but harder with other implementations. The input data need not be sorted.

Otherwise, with GNU datamash

datamash -t, groupby 1 collapse 2 < file

(if the input isn't sorted, add -s) or with Miller

mlr --nidx --fs ',' nest --implode --values --across-records --nested-fs ',' -f 2 file

or more compactly with more recent versions

mlr --nidx --fs ',' nest --ivar ',' -f 2 file

answered Sep 15 '20 at 12:09

steeldriver

81,074

Thank you very much - I love this AWK approach - very speedy and pure - – Jonty Sep 15 '20 at 13:18
1

@Jonty you're welcome - be aware that it slurps the whole file into memory, so it's likely not the best approach for large files – steeldriver Sep 15 '20 at 13:52
1

A minor tweak just to remove one small piece ot duplication: a[$1] = a[$1] == "" ? $2 : a[$1] FS $2 = a[$1] = (a[$1] == "" ? "" : a[$1] FS) $2 so you don't have to specify $2 twice. – Ed Morton Sep 15 '20 at 23:43

score 0 · Answer 3 · answered Sep 15 '20 at 17:39

To guarantee order in the output use the following awk code. Here we maintain a hash aka associative array, seen[...], which is keyed on an increasing kounter whenever a new key ($1) is encountered.

$ awk -F "," '
    prev != $1 { prev = $1 }
    !($1 in a) { seen[++n] = $1 }
    { a[$1] = a[$1] FS $2 }
    END {
      for (i=1; i<=n; i++) {
        print seen[i] a[seen[i]] 
      }
    }
  ' file
A,val1,val2,val3
B,val1,val2,val3

score 0 · Answer 4 · answered Sep 15 '20 at 23:37

Using any awk in any shell on every Unix box and preserving the output line order while only storing 1 $1-keyed block at a time in memory:

$ awk '
    BEGIN { FS=OFS="," }
    $1!=p { printf "%s%s", rec, sep; rec=p=$1; sep=ORS }
    { rec = rec OFS $2 }
    END { print rec }
' file
A,val1,val2,val3
B,val1,val2,val3

File transformation

4 Answers4