How to convert a CSV with multiple sub-row items (from a column) to new columns in Awk?

Question

I have a CSV file with the general format shown in the picture below.

In that CSV there are multiple rows that belong to a certain column (desc) and I would like to to extract those items and add them to new columns called name, size, weight, glass, respectively. I have highlighted (in red) those sub-row items for the entries.

Original Structure:

Expected Structure:

The original CSV:

page,item,id,desc,price
1,2,F1,Alpha,111.11
1,,,380 x 2280 mm (size),
1,,,Weight: 33.0 kg,
1,,,Glass: AAA,
1,,,XXX,
1,3,F2,Beta,222.22
1,,,880 x 2280 mm (size),
1,,,Weight: 68.4 kg,
1,,,Glass: BBB,
1,,,YYY,
1,4,F3,Gamma,333.33
1,,,1980 x 1580 mm (size),
1,,,Weight: 78.2 kg,
1,,,Glass: CCC,
1,,,XXX,
1,,,YYY,
1,,,ZZZ,

Expected resulting CSV:

page,item,id,name,size,weight,glass,price
1,2,F1,Alpha,380 x 2280,33.0,AAA,111.11
1,3,F2,Beta,880 x 2280,68.4,BBB,222.22
1,4,F3,Gamma,1980 x 1580,78.2,CCC,333.33

where name would take the place of the first row in desc.

UPDATE:

Under certain conditions, some Awk solutions may work with the above, but fail when adding a 4th item. To fully test, consider adding this to the above:

1,7,F4,Delta,111.11
1,,,11 x 22 mm (size),
1,,,Weight: 33.0 kg,
1,,,Glass: DDD,
1,,,Random-1,

So 3 important points:

The number of sub-rows in the desc column may vary.
Any sub-rows after Glass:... should be ignored.
There may be items that doesn't have any sub-rows in the desc column, they should also be ignored.

Q: How can I remap those sub-rows into new columns, using Awk?
(Or are there more suitable tools for doing this in bash?)

Possibly related (but not very helpful) Questions:

Hi @EdMorton, no, but since the actual data is OCR extracted from a PDF file, there certainly could be garbage. I've tried to do some rudimentary pre-filtering using: awk -F'[,"]' '($2~/[0-9]+/ && $3~/F[0-9]+/) || (!$2 && !$3)', because there may be rows that look like: sometimes,shit,happens,with,data. — not2qubit, Apr 29 '21 at 16:31
-F'[,"]' is pretty suspect. Actually so are the other conditions. If you post a new question we can help you with that filtering. — Ed Morton, Apr 29 '21 at 16:38

αғsнιη · Accepted Answer · 2021-04-29T17:43:16.933

awk 'BEGIN{ FS=OFS=","; print "page,item,id,name,size,weight,glass,price" }
    $2!=""{ price=$5; data=$1 FS $2 FS $3 FS $4; desc=""; c=0; next }
          { gsub(/ ?(mm \(size\)|Weight:|kg|Glass:) ?/, "") }
    ++c<=3{ desc=(desc==""?"":desc OFS) $4; next }
    data  { print data, desc, price; data="" }
' infile

including explanation:

awk 'BEGIN{ FS=OFS=","; print "page,item,id,name,size,weight,glass,price" }
     #this block will be executed only once before reading any line, and does: 
            #set FS (Field Separator), OFS (Output Field Separator) to a comma character
            #print the "header" line  ....
$2!=&quot;&quot;{ price=$5; data=$1 FS $2 FS $3 FS $4; desc=&quot;&quot;; c=0; next }
#this blocks will be executed only when column#2 value was not empty, and does:
        #backup column#5 into &quot;price&quot; variable
        #also backup columns#1~4 into &quot;data&quot; variable
        #reset the &quot;desc&quot; variable and also counter variable &quot;c&quot;
        #then read next line and skip processing the rest of the code 

      { gsub(/ ?(mm \(size\)|Weight:|kg|Glass:) ?/, &quot;&quot;) }
        #this block runs for every line and replace strings above with empty string

++c&lt;=3{ desc=(desc==&quot;&quot;?&quot;&quot;:desc OFS) $4; next }
#this block runs at most 3reps and 
       #joining the descriptions in column#4 of every line
       #and read the next line until counter var &quot;c&quot; has value &lt;=3 

 data { print data, desc, price; data=&quot;&quot; }
 #if &quot;data&quot; variable has containing any data, then
       #print the data, desc, price and empty &quot;data&quot; variable 

' infile

How to convert a CSV with multiple sub-row items (from a column) to new columns in Awk?

1 Answers1

Linked