Parsing CSV that has multiple line values with carriage returns

Question

I have a CSV that I'm trying to remove the double quote values from using the following sed one-liner. Details about this can be found in my original post to figure that out

sed -i.bak 's/^"//g;s/","/,/g;s/"$//g' $1

Although this now works in most cases I have an issue with one of my exports that has multiple lines and apparently is broken up with carriage returns. Here is a proof sample of the data

$ cat -v Quote.csv

"Id","Blob","Employee","Etc"^M
"0Q01N000001MxPbSAK","Job to happen late day/ evening", "Employee 1", "more stuff"^M
"0Q01N000001N4klSAC","Daytime work during normal businesses hours ^M
some details ^M
some more details ^M
conclusion","Employee 2", "more stuff"%

When I attempt this on the full file I get the following error:

CSV error: record 2 (line: 4, byte: 101): found record with 2 fields, but the previous record has 4 fields

I believe this is because the alignment of columns and rows is distorted even though it "appears" fine in excel.

Any ideas on how to properly parse this so I can get around this issue. I need the double quotes removed so that when I import the CSV into an ArangoDB the values are typed properly.

I found another forum with essentially the same issue and one proposed solution was this.

sed 's/$/~/' Quote.csv |tr '\n' ' '   |sed 's/~ "KEY-/\n"KEY-/g'

I believe if I could reverse engineer it to work with my ID field, then perhaps it could work. I also noticed that I have <br> characters and I'm not sure if they would need to be tr'd out as well (seems like that would then mess up the data from having the line breaks it is expected to have)

Better analyse those lines that throw errors instead of believing. Their cookbook here says that multiline entries are fine in CSVs. https://docs.arangodb.com/3.0/Cookbook/Administration/ImportingData.html — , Dec 07 '18 at 20:13
What is it that throws the error? Have you counted the fields and are sure that the error is due to the carriage returns? The double quotes can't be removed if fields span multiple lines, that would break the CVS format (this is perfectly legal in CSV files). — Kusalananda, Dec 07 '18 at 20:16
The script is currently not attempting to import to arrangodb the error is specific to the sed call. The sed call is there because when I get to the import I need arango to maintain the Type and not assume they are all strings — Xtremefaith, Dec 07 '18 at 22:50
perl's Text::CSV module is the best I've seen for handling newlines in CSV data. — glenn jackman, Dec 10 '18 at 15:37

score 2 · Answer 1 · answered Dec 31 '18 at 11:51

the real problem of your output is that it has \r\n both for record endings and carriage returns inside a cells.

Instead a right CSV is in this way

If you correct the process that create the output to have something like above, and use in example Miller http://johnkerl.org/miller/doc and sed

tail -n +2 input.csv | \ mlr --implicit-csv-header --rs '\r\n' --icsvlite --ocsv put '$2=gsub($2,"\n"," ")' then label Id,Blob,Employee,Etc | \ sed 's/"//g'

You will have in output

Id,Blob,Employee,Etc 0Q01N000001MxPbSAK,Job to happen late day/ evening,Employee 1,more stuff 0Q01N000001N4klSAC,Daytime work during ... details conclusion,Employee 2, more stuff

Parsing CSV that has multiple line values with carriage returns

1 Answers1