0

I have this input file on a Linux machine where there are mutliple lines with such:

123, 'John, Nesh', 731, 'ABC, DEV, 23', 6, 400 'Text'
123, 'John, Brown', 140, 'ABC, DEV, 23', 6, 500 'Some other, Text'
123, 'John, Amazing', 1, 'ABC, DEV, 23', 8, 700 'Another, example, Text'

etc. And I want to remove any , that is found within a single quoted field. Expected output:

 123, 'John Nesh', 731, 'ABC DEV 23', 6, 400 'Text'
 123, 'John Brown', 140, 'ABC DEV, 23', 6, 500 'Some other Text'
123, 'John Amazing', 1, 'ABC DEV, 23', 8, 700 'Another example, Text'
John
  • 11

2 Answers2

1

bash 5.2 has a new loadable module dsv for parsing "delimiter-separated" values:

$ echo $BASH_VERSION
5.2.0(2)-release
$ cat input.csv
'123','ABC, DEV 23','345','534.202','NAME'
$ enable dsv
$ dsv -S -p -a fields "$(head -1 input.csv)"
$ declare -p fields
declare -a fields=([0]="'123'" [1]="'ABC, DEV 23'" [2]="'345'" [3]="'534.202'" [4]="'NAME'")
$ fields=( "${fields[@]//,/}" )     # remove commas from all elements
$ (IFS=,; echo "${fields[*]}")
'123','ABC DEV 23','345','534.202','NAME'

The help text for the dsv command:

dsv: dsv [-a ARRAYNAME] [-d DELIMS] [-Sgp] string

Read delimiter-separated fields from STRING.

Parse STRING, a line of delimiter-separated values, into individual fields, and store them into the indexed array ARRAYNAME starting at index 0. The parsing understands and skips over double-quoted strings. If ARRAYNAME is not supplied, "DSV" is the default array name. If the delimiter is a comma, the default, this parses comma- separated values as specified in RFC 4180.

The -d option specifies the delimiter. The delimiter is the first character of the DELIMS argument. Specifying a DELIMS argument that contains more than one character is not supported and will produce unexpected results. The -S option enables shell-like quoting: double- quoted strings can contain backslashes preceding special characters, and the backslash will be removed; and single-quoted strings are processed as the shell would process them. The -g option enables a greedy split: sequences of the delimiter are skipped at the beginning and end of STRING, and consecutive instances of the delimiter in STRING do not generate empty fields. If the -p option is supplied, dsv leaves quote characters as part of the generated field; otherwise they are removed.

The return value is 0 unless an invalid option is supplied or the ARRAYNAME argument is invalid or readonly.

glenn jackman
  • 85,964
  • Wow. When did this happen? Also interestingly, the manual states it parses according to https://www.rfc-editor.org/rfc/rfc4180 but that page seems to require double quotes, not single. – terdon Oct 26 '22 at 14:51
  • 1
    5.2 was release... recently. The -S option handles single quotes which is non-standard CSV. – glenn jackman Oct 26 '22 at 15:03
0

With perl:

perl -pe "s{'.*?'}{\$& =~ s/,//gr}ge" < your-file

It assumes quoted strings never span several lines, and that there's no escaped 's within the '...' quoted strings (though it would still work if they were escaped as '' as is common in csvs).