0

I'm using the following gawk script to read values from the first column of the csv file file.csv.
I use gawk since I don't want any embedded commas to be ignored.

col=`gawk ' 
BEGIN {
FPAT="([^,]*)|(\"[^\"]*\")+"
}
{print $1 }' file.csv`

However, I noticed that if the empty string/space is in the last row, this method ignores it.

For example, if the file.csv is the following:

col1,col2
"a,a","a,a1" 
"b","b1" 
,"c1"  

The result would be

col1
a,a
b 

instad of

col1
a,a
b

What can I do to fix this issue?

Thank you!

Related post: Reading empty string from CSV file in BASH

lilek3
  • 77
  • The command substitution (backticks) strips trailing newlines. What is your end goal here? there may be better ways than trying to grab a multiline string into a shell variable. – steeldriver Jul 29 '21 at 20:53
  • I'm processing the values later on so it's important for me to know all the values in every row, even if it's an empty string. I want to read a column from a csv file into a shall variable which I then turn into a shell array such as arr=("a,a" "b" "") – lilek3 Jul 29 '21 at 21:01
  • 2
    If your version of bash supports process substitution, I suggest you skip the scalar variable and read the lines straight into the array arr=(); while IFS= read -r line; do arr+=("$line"); done < <(gawk ...) like we discussed in Putting string with newline separated words into array – steeldriver Jul 29 '21 at 22:25
  • It's not without reason CSV has never been embraced on *nix (to my knowledge). It is a terrible format for text-processing. Are these static files? Could you convert it to another format first? Any option to get what ever generates these files to use another format? Even TSV would likely be a lot better. I.e. https://unix.stackexchange.com/q/359832/140633 – ibuprofen Jul 30 '21 at 00:06
  • It's extremely unlikely that using awk to print the fields from your CSV into a shell variable which you then read to populate a shell array is the best starting point for whatever it is you're trying to do, In fact there's an excellent chance that whatever it is should just be all done in the one call to awk. Ask a new question about that if you'd like help and be sure to include what it is you're trying to do as opposed to just how you're trying to do it. – Ed Morton Jul 30 '21 at 15:00
  • @ibuprofen you're right that quotes wouldn't be removed but the last line would be a null string as shown, not "c1" as the OP is printing the first field of each row of the CSV. and the first field in that last row IS a null string. All the awk and CSV stuff is irrelevant to this question anyway though, that's just a carryover from the OPs previous question. – Ed Morton Jul 30 '21 at 16:48
  • @EdMorton Well, I ran the gawk on the sample to be sure. And at least here it produces: col1, "a,a", "b", "c1" – I'll check again – ibuprofen Jul 30 '21 at 16:55
  • @ibuprofen If you're running gawk 4.1.4 then you're probably hitting one of the FPAT bugs in that version. See my answer to the previous question for info on those bugs and workarounds. – Ed Morton Jul 30 '21 at 16:58
  • 1
    @EdMorton Indeed. This fails https://termbin.com/mxkx - but this is OK https://termbin.com/senr , I'll have to look more at that later. Thanks for the heads up. – ibuprofen Jul 30 '21 at 17:05

1 Answers1

1

As mentioned in the comments under your previous question, this has nothing to do with CSVs or your awk script, it's all about how you're saving the output of a command.

$ printf 'a\nb\n\n'
a
b

$ col=$(printf 'a\nb\n\n') $ printf '%s' "$col" a b$

$ col=$(printf 'a\nb\n\n'; printf x)
$ printf '%s' "$col"
a
b

x$ $ col="${col%x}" $ printf '%s' "$col" a b

$

Note that with the above you're getting the whole output of the command saved in the variable, including the final newline that command substitution would have stripped off. If you want to remove a final newline too then do a subsequent:

$ col="${col%$'\n'}"
$ echo "$col"
a
b

$ printf '%s' "$col" a b $

The reason to remove the x and the \n in 2 steps rather than doing a single col="$(col%$'\n'x}" is that that would fail if the command had produced no output or output that didn't end in a \n because then \nx wouldn't exist in col:

Right:

$ col=$(printf 'a'; printf x)
$ col="${col%x}"
$ col="${col%$'\n'}"
$ printf '%s' "$col"
a$

Wrong:

$ col=$(printf 'a'; printf x)
$ col="${col%$'\n'x}"
$ printf '%s' "$col"
ax$

To learn more about the issue take a look at "Command Substitution" in:

  1. The POSIX standard's Shell Execution Environment section where it says:

The shell shall expand the command substitution by executing command in a subshell environment (see Shell Execution Environment) and replacing the command substitution (the text of command plus the enclosing "$()" or backquotes) with the standard output of the command, removing sequences of one or more characters at the end of the substitution.

  1. https://mywiki.wooledge.org/CommandSubstitution where it discusses the issue further and provides the workaround I used above.
Ed Morton
  • 31,617
  • @lilek3 if this answered your question then see https://unix.stackexchange.com/questions/tagged/awk for what to do next. If it didn't please let me know in what way it's lacking. – Ed Morton Aug 01 '21 at 18:02