2

I'm using the following gawk script to read values from the first column of the csv file file.csv.
I use gawk since I don't want any embedded commas to be ignored.

col=`gawk ' 
BEGIN {
FPAT="([^,]+)|(\"[^\"]+\")"
}
{print $1 }' file.csv`

For example, file.csv is:

col1,col2
"a,a","a,a1"
,"b1"
"c","c1"

The problem is that since the second row of the first column is empty, when it reads the values from the first column it takes the value of the second column as the value of the second row.

echo $col

returns

a,a
b1
c

but I would like it to acknowledge the empty string as follows:

a,a

c

How could I achieve such behaviour?

Thank you!


UPDATE:
I noticed that if the empty string/space is in the last row, this method ignores it.

col=`gawk ' 
BEGIN {
FPAT="([^,]*)|(\"[^\"]*\")+"
}
{print $1 }' file.csv`

For example, if the file.csv is the following:

col1,col2
"a,a","a,a1" 
"b","b1" 
,"c1"  

The result would be

col1
a,a
b 

instad of

col1
a,a
b

What can I do to fix this issue?

lilek3
  • 77

2 Answers2

3

Change the +s (1 or more repetitions) to *s (0 or more repetitions) in your FPAT so it allows for empty fields:

$ awk '
BEGIN { FPAT="([^,]*)|(\"[^\"]*\")+" }
{ print $1 }
' file.csv
col1
"a,a"

"c"

I also added a final + so it'll allow for escaped (doubled) quotes in your quoted fields, e.g. "foo""bar".

See https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk for more info on parsing CSVs with awk even if the fields include newlines.


As mentioned in the comments, that will work in gawk 5.1.0 and later but you may have trouble using the above with gawk 4.1.4 due to 2 bugs related to FPAT processing:

  1. https://lists.gnu.org/archive/html/bug-gawk/2017-04/msg00000.html
  2. https://lists.gnu.org/archive/html/bug-gawk/2019-11/msg00000.html

If so, you can work around the bugs by either:

  1. Just referencing NF at the start of the script, e.g. by adding { NF } as the first line, should be all you need but if that doesn't work then
  2. By clearing then re-assigning FPAT at the start of the script, e.g. by adding { oFPAT=FPAT; FPAT=""; FPAT=oFPAT } as the first line.
Ed Morton
  • 31,617
  • 1
    Seems to work in GNU Awk 5.1.0 but not in GNU Awk 4.1.4 apparently... – steeldriver Jul 28 '21 at 23:43
  • @steeldriver gawk 4.1.4 had 2 bug2 related to FPAT, I've updated my answer to include workarounds for them. – Ed Morton Jul 28 '21 at 23:57
  • 1
    Wow gnarly - thanks. I couldn't get the NF trick to work but can confirm that the oFPAT=FPAT;FPAT="";FPAT=oFPAT one did – steeldriver Jul 28 '21 at 23:57
  • Yeah, I wasn't sure which bug you were hitting, nor whether referencing NF worked around both of them - glad I had a solution handy! – Ed Morton Jul 28 '21 at 23:58
  • Thank you Ed. I don't know if I should create a new question for this but perhaps you could help me with this. Would you happen to know how to place the values from the first column, including that empty string, into an array? Something as such: arr=("a,a" "" "c") I tried IFS=$'\n' read -rd '' -a arr <<<"$col" but it seems to ignore the empty string. – lilek3 Jul 29 '21 at 00:22
  • You're welcome. You should ask a new question as chameleon questions are strongly discouraged and make sure when talking about variables and arrays you say "shell array" or "awk array" so it's clear which you mean. It's also extremely unlikely that using a shell array for whatever you have in mind instead of just doing all the processing in awk is the best approach so make sure your question is clear on what it is you're trying to do (not just how you're trying to do it). – Ed Morton Jul 29 '21 at 00:35
  • I noticed that if the empty string/space is in the last row, this method ignores it. For example, if the file.csv is the following: col1,col2 "a,a","a,a1" "b","b1" ,"c1" The result would be col1 "a,a" "b" instad of col1 "a,a" "b" What can I do to fix this issue? – lilek3 Jul 29 '21 at 18:57
  • No, the method I posted does not ignore anything. – Ed Morton Jul 29 '21 at 21:20
0
awk -F ',"' 'NR>1{gsub(/\"/,"",$1);print $1}' filename

output

a,a

c