Is there a way to accurately parse a true CSV file?

Question

I'm using bash shell. I want to parse a CSV file, in which the CSV file observes true CSV formats. From this thread -- https://stackoverflow.com/questions/4286469/how-to-parse-a-csv-file-in-bash , I got this

#!/bin/bash

file_path=$1
echo $1

while IFS=, read -r ID name address zipcode
do
    echo "I got:$ID|$name|$address|$zipcode"
done < $file_path

However, in a CSV file, since some cells might themselves contain a comma, there are quotes around those items. So the below file doesn't parse properly

1,1871,"222 W. Merchandise Mart Plaza, Suite 1212",60605

Is there a way to modify the above script (or produce a new one) in which a CSV file can be accurately parsed?

score 2 · Answer 1 · answered Jan 28 '20 at 00:21

cvskit is what you need: it has robust grep, cut, join, ..., for CSV files. I will use just csvformat.

In this kind of situations I like to:

(1) convert the input data to a "good" separator (ex: "," to "§") removing unnecessary quotes

csvformat -d § input | ...

(2) process the data with the "good" separator

... | awk 'BEGIN{FS=OFS="§"} ...' | ...

(3) convert to CSV again ("§" to ",") adding quotes when necessary

csvformat -D § > final

Yurko · Answer 2 · 2020-01-27T20:50:44.383

Input file a.csv:

1,1111,"111 W. Merchandise Mart's Plaza, Suite 1111",10101
2,2222,"222 Ben's St, Suite 222",20202

One liner to parce input file using python 3:

$ cat a.csv|python -c $'import csv,sys;reader=csv.reader(sys.stdin);\nfor row in reader: print row'

Output:

['1', '1111', "111 W. Merchandise Mart's Plaza, Suite 1111", '10101']
['2', '2222', "222 Ben's St, Suite 222", '20202']

More complicated one liner with output:

$ cat a.csv |python -c $'import csv,sys;reader=csv.reader(sys.stdin);\na=0\nb=0\nfor row in reader:\n\ta+=1\n\tprint "Column",a\n\tfor col in row:\n\t\tb+=1\n\t\tprint "\tColumn",b,":",col'
Column 1
        Column 1 : 1
        Column 2 : 1111
        Column 3 : 111 W. Merchandise Mart's Plaza, Suite 1111
        Column 4 : 10101
Column 2
        Column 5 : 2
        Column 6 : 2222
        Column 7 : 222 Ben's St, Suite 222
        Column 8 : 20202

Readable version of the second one liner:

import csv,sys;
reader=csv.reader(sys.stdin);
a=0
b=0
for row in reader:
    a+=1
    print "Column",a
    for col in row:
        b+=1
        print "    Column",b,":",col

Gnu awk (gawk):

$ cat a.csv| awk -vFPAT='[^,]*|"[^"]*"' '{for (i=1; i<=NF; i++) {print ">"$i"<"}; print ""}'
>1<
>1111<
>"111 W. Merchandise Mart's Plaza, Suite 1111"<
>10101<

>2<
>2222<
>"222 Ben's St, Suite 222"<
>20202<

Paul_Pedant · Answer 3 · 2020-01-30T13:13:56.140

I have an awk script somewhere that deals with anything I ever saw Excel output: variable number of fields; commas, newlines and double-quotes in fields; quoted fields even where not needed; column header line; trim white space; CRLF conversion. I'll need to look for it -- I think it converted between several formats and took around 300 lines (including its own man page). Also does some stats on columns to help you specify DDL.

The main issue is in deciding what format you want the output in. I settled for "|" bar separation for fields, and $ for newlines, to avoid propagating the same issues into output. Would be easy to use (say) ASCII 0x1f (Unit separator) for columns, and 0x1D (Group separator) for in-field newlines.

I can see the references to a previous thread which mentions various tools and libraries. I wrote this when I was stuck with a locked-down critical system -- no downloads. I wasn't even allowed to use C in case I broke their vendor's product, but they conceded "I could write scripts -- you can't do any damage with them."

Is there a way to accurately parse a true CSV file?

3 Answers3