0

Sorry I am very new to Linux and I am uncertain the capability of bash for what I want to achieve.

I want to merge column-2 values if column-1 and column-3 values are identical. In this case, if there are same error descriptions and same merchant, I want to merge RefNo fields by comma.

Error Desc|RefNo|Merchant
===================================
Category code invalid|03077|merchanta
Category code invalid|09877|merchanta
Invalid ID|12345|merchanta
Invalid ID|07323|merchantc
Invalid ID|03523|merchantc
No valid reason|78653|merchantb

Expected outcome:

Error Desc|RefNo|Merchant
===================================
Category code invalid|03077,09877|merchanta
Invalid ID|12345|merchanta
Invalid ID|07323,03523|merchantc
No valid reason|78653|merchantb

Similar posts I've found but it's removing duplicate and I don't want to remove and merge the column-2 instead. Keeping unique rows based on information from 2 of three columns.

404
  • 1
  • (1) Your problem statement is jumbled. You say you want to merge column-3 values if column-1 and column-2 values are identical, but then you show an example where you want to merge column-2 values if column-1 and column-3 values are identical.  (2) While it might be possible to do this purely in bash, that would be a terrible idea. This is a text-processing problem, that is best solved with text-processing tools, and “the capability of bash” is largely irrelevant. – G-Man Says 'Reinstate Monica' Mar 23 '20 at 08:16

2 Answers2

0

With GNU datamash you could do:

datamash -t'|' groupby 1,3 collapse 2 < <(tail -n+3 file)

Output:

Category code invalid|merchanta|03077,09877
Invalid ID|merchanta|12345
Invalid ID|merchantc|07323,03523
No valid reason|merchnatb|78653

This groups on the first and third field and collapses the values of the second field. The tail -n+3 is used to skip the two header lines.

You could use awk to swap the second and third column of the output and head to add the header lines:

{
  head -n2 file
  datamash -t'|' groupby 1,3 collapse 2 < <(tail -n+3 file) | 
    awk 'BEGIN{OFS=FS="|"}{print $1,$3,$2}'
}

Output:

Error Desc|RefNo|Merchant
===================================
Category code invalid|03077,09877|merchanta
Invalid ID|12345|merchanta
Invalid ID|07323,03523|merchantc
No valid reason|78653|merchnatb
Freddy
  • 25,565
  • Thanks much Freddy.It looks magnificent . Unfortunately GNU datamsh is not available to use in my environment. – 404 Mar 23 '20 at 06:07
0

I’m probably overlooking something — it’s probably possible to make this shorter — but this works:

awk '
    BEGIN   {   FS="|"; OFS="|" }
    NR <= 2
    NR > 2  {
                seen_desc[$1]++
                seen_merc[$3]++
                if (ref[$1,$3] == "")
                        ref[$1,$3] = $2
                else
                        ref[$1,$3] = ref[$1,$3] "," $2
            }
    END     {
                for (desc in seen_desc) {
                        for (merc in seen_merc) {
                                if (ref[desc,merc] != "") {
                                        print desc, ref[desc,merc], merc
                                }
                        }
                }
            }'