0

I have a CSV file with the general format shown in the picture below.

In that CSV there are multiple rows that belong to a certain column (desc) and I would like to to extract those items and add them to new columns called name, size, weight, glass, respectively. I have highlighted (in red) those sub-row items for the entries.

Original Structure:

enter image description here

Expected Structure:

enter image description here


The original CSV:

page,item,id,desc,price
1,2,F1,Alpha,111.11
1,,,380 x 2280 mm (size),
1,,,Weight: 33.0 kg,
1,,,Glass: AAA,
1,,,XXX,
1,3,F2,Beta,222.22
1,,,880 x 2280 mm (size),
1,,,Weight: 68.4 kg,
1,,,Glass: BBB,
1,,,YYY,
1,4,F3,Gamma,333.33
1,,,1980 x 1580 mm (size),
1,,,Weight: 78.2 kg,
1,,,Glass: CCC,
1,,,XXX,
1,,,YYY,
1,,,ZZZ,

Expected resulting CSV:

page,item,id,name,size,weight,glass,price
1,2,F1,Alpha,380 x 2280,33.0,AAA,111.11
1,3,F2,Beta,880 x 2280,68.4,BBB,222.22
1,4,F3,Gamma,1980 x 1580,78.2,CCC,333.33

where name would take the place of the first row in desc.


UPDATE:

Under certain conditions, some Awk solutions may work with the above, but fail when adding a 4th item. To fully test, consider adding this to the above:

1,7,F4,Delta,111.11
1,,,11 x 22 mm (size),
1,,,Weight: 33.0 kg,
1,,,Glass: DDD,
1,,,Random-1,

So 3 important points:

  • The number of sub-rows in the desc column may vary.
  • Any sub-rows after Glass:... should be ignored.
  • There may be items that doesn't have any sub-rows in the desc column, they should also be ignored.

Q: How can I remap those sub-rows into new columns, using Awk?
(Or are there more suitable tools for doing this in bash?)

Possibly related (but not very helpful) Questions:

not2qubit
  • 1,666
  • Can the item value ever be the number 0? – Ed Morton Apr 29 '21 at 13:58
  • Hi @EdMorton, no, but since the actual data is OCR extracted from a PDF file, there certainly could be garbage. I've tried to do some rudimentary pre-filtering using: awk -F'[,"]' '($2~/[0-9]+/ && $3~/F[0-9]+/) || (!$2 && !$3)', because there may be rows that look like: sometimes,shit,happens,with,data. – not2qubit Apr 29 '21 at 16:31
  • -F'[,"]' is pretty suspect. Actually so are the other conditions. If you post a new question we can help you with that filtering. – Ed Morton Apr 29 '21 at 16:38

1 Answers1

2
awk 'BEGIN{ FS=OFS=","; print "page,item,id,name,size,weight,glass,price" }
    $2!=""{ price=$5; data=$1 FS $2 FS $3 FS $4; desc=""; c=0; next }
          { gsub(/ ?(mm \(size\)|Weight:|kg|Glass:) ?/, "") }
    ++c<=3{ desc=(desc==""?"":desc OFS) $4; next }
    data  { print data, desc, price; data="" }
' infile

including explanation:

awk 'BEGIN{ FS=OFS=","; print "page,item,id,name,size,weight,glass,price" }
     #this block will be executed only once before reading any line, and does: 
            #set FS (Field Separator), OFS (Output Field Separator) to a comma character
            #print the "header" line  ....
$2!=&quot;&quot;{ price=$5; data=$1 FS $2 FS $3 FS $4; desc=&quot;&quot;; c=0; next }
#this blocks will be executed only when column#2 value was not empty, and does:
        #backup column#5 into &quot;price&quot; variable
        #also backup columns#1~4 into &quot;data&quot; variable
        #reset the &quot;desc&quot; variable and also counter variable &quot;c&quot;
        #then read next line and skip processing the rest of the code 

      { gsub(/ ?(mm \(size\)|Weight:|kg|Glass:) ?/, &quot;&quot;) }
        #this block runs for every line and replace strings above with empty string

++c&lt;=3{ desc=(desc==&quot;&quot;?&quot;&quot;:desc OFS) $4; next }
#this block runs at most 3reps and 
       #joining the descriptions in column#4 of every line
       #and read the next line until counter var &quot;c&quot; has value &lt;=3 

 data { print data, desc, price; data=&quot;&quot; }
 #if &quot;data&quot; variable has containing any data, then
       #print the data, desc, price and empty &quot;data&quot; variable 

' infile

αғsнιη
  • 41,407