1

I have a txt file:

 ,"Hi, I am Tom",,"16"
"I3","Hi, I am Jerry","Mouse","17"

in which the file is UTF-8. What I'm trying to do is to replace the commas with | except for the ones in quotes "". So the new resulting txt file would be:

|"Hi, I am Tom"||"16"
"I3"|"Hi, I am Jerry"|"Mouse"|"17"

I do not have much knowledge on sed or awk script but i heard it can be used to accomplish this. Could anyone show me how it's done?

Maxxx
  • 821
  • 1
    Do you really have spaces around some of your field-separating commas? If so then it's not a valid CSV per any of the accepted "standards" (e.g. as per RFC 4180 or as can be generated by MS-Excel). If you don't then please fix your posted sample input. See https://stackoverflow.com/q/45420535/1745001 for more info on parsing CSVs with awk. – Ed Morton Mar 07 '20 at 17:24
  • Also - your quoted fields seem to be just free text so can any of your quoted fields contain a | and, if so, what do you want done with that? Can they contain newlines? Can they contain escaped double quotes, e.g. "foo""bar" or "foo\"bar"? – Ed Morton Mar 07 '20 at 17:29
  • 4
    Does this answer your question? change delimiter in a csv file – αғsнιη Mar 07 '20 at 18:50
  • @EdMorton my bad, yes there isn't any spaces. I'll change that in a sec.And no, the quoted fields wouldn't contain a | – Maxxx Mar 08 '20 at 00:35

7 Answers7

4

If you fix the spaces around the commas as Ed commented, then

$ cat text.csv
,"Hi, I am Tom",,"16"
"I3","Hi, I am Jerry","Mouse","17"

$ csvformat -D '|' text.csv
|Hi, I am Tom||16
I3|Hi, I am Jerry|Mouse|17

csvformat is part of csvkit: https://csvkit.readthedocs.io/en/1.0.2/scripts/csvformat.html


other languages you may already have installed come with CSV modules, such as ruby:

$ ruby -rcsv -e 'CSV.foreach(ARGV.shift) {|row| puts CSV.generate_line(row, col_sep: "|")}' text.csv
|Hi, I am Tom||16
I3|Hi, I am Jerry|Mouse|17
glenn jackman
  • 85,964
  • You can ignore the spaces following the delimiter with csvformat's -S option. – Freddy Mar 07 '20 at 19:10
  • 1
    apt-get install csvkit wants to install 34 python packages on my machine -- but the cvs format spec (rfc4180) if 5 pages long, including the 3 or 4 pages of boilerplate ;-) –  Mar 07 '20 at 19:11
2

That's a common CSV problem, see whats-the-most-robust-way-to-efficiently-parse-csv-using-awk.

Just for input such as you showed us and using GNU awk for FPAT after getting rid of the blanks around your field-separator commas and setting RS to \r\n since your input file has DOS line endings:

$ cat -v file
,"Hi, I am Tom",,"16"^M
"I3","Hi, I am Jerry","Mouse","17"^M

$ awk -v RS='\r\n' -v FPAT='[^,]*|"[^"]+"' -v OFS='|' '{NF; $1=$1} 1' file
|"Hi, I am Tom"||"16"
"I3"|"Hi, I am Jerry"|"Mouse"|"17"

The NF; is to work around this current bug in gawk 5.0.1: https://lists.gnu.org/archive/html/bug-gawk/2019-11/msg00003.html

Of course now you need to ask yourself - what should I do if a quoted field contains|s, escaped double quotes ("" or \") or newlines?

Ed Morton
  • 31,617
  • thanks and no the quoted fields wouldn't contain | at all. They are just addresses enclosed in quotes apart from the example I've shown. Could you explain on the "DOS line endings"? – Maxxx Mar 08 '20 at 00:41
  • I'm getting something like: ,"Hi,|I|am|Tom",,"16"|"I3","Hi,|I|am|Jerry","Mouse","17" as the output. Does the ^M in the first command mean anything? – Maxxx Mar 08 '20 at 01:06
  • See https://stackoverflow.com/q/45772525/1745001 for info on DOS line endings. It sounds like you aren't running GNU awk as I mentioned is required for the answer I posted. – Ed Morton Mar 08 '20 at 15:20
2

Expanding on @RudiC's idea:

awk -v RS='"' -v ORS= '{if(NR % 2){gsub(",","|"); print} else print RS $0 RS}' file

| "Hi, I am Tom"||"16" "I3"| "Hi, I am Jerry"|"Mouse"|"17"

This should work when the " is quoted by doubling it, as in """Hi, I am Tom"", said the DOG" (the way it's done in standard CSV), not when it's backslash-escaped. Except for the alternating unquoted and quoted text, this does not care about the format of the file; it doesn't have to be valid CSV.

With GNU gawk (gawk) this could be simplified to

gawk -v RS='"' -v ORS= 'NR % 2 {gsub(",","|")} {print $0 RT}' file

The same thing with perl:

perl -pe 'BEGIN{$/=q/"/} s/,/|/ if $. % 2' file
1

With perl, assuming the quoted parts don't contain newlines nor escaped quotes:

perl -pe 's{(".*?")|,}{$1//"|"}ge' < your-file
0

One approach would be to change the commas within double quotes to some char unused in your text, change all the other commas to the target char, then change the tokens back to commas:

$ awk -F'"' '
        {for (i=2; i<=NF; i+=2) gsub (/,/, "\001", $i)
         gsub (/,/, "|")
         gsub (/\001/, ",")
        }
1
' OFS='"' file
 | "Hi, I am Tom"||"16"
"I3"| "Hi, I am Jerry"|"Mouse"|"17"

You seem to have removed spaces as well in your sample output?

RudiC
  • 8,969
  • 1
    For that approach, rather than picking a string you hope won't be in the record, it's best to use the value of RS as the temporary replacement string since you KNOW it cannot be present in the record. – Ed Morton Mar 07 '20 at 17:22
  • Yes, or \r whose use in not that prominent in *nix... – RudiC Mar 07 '20 at 18:10
  • The default value of awk's SUBSEP variable is another good choice. – glenn jackman Mar 07 '20 at 18:54
  • 2
    @glennjackman no, both \r and SUBSEP can actually occur in a record. Only RS cannot occur in a record. –  Mar 07 '20 at 18:56
  • Case in point the OPs sample input which does contain \rs :-). – Ed Morton Mar 07 '20 at 18:59
  • @RudiC Note that quoted fields are always even numbered. So you don't need this elaborate setup , do the transformation "," ----> "|" only in the odd numbered fields. – Rakesh Sharma Mar 09 '20 at 17:51
0

We may approach the pbm in a simplified manner of building from the ground up the regexes for how the fields look:

Q=\"                    # the quote char 
NQ="[^${Q}]"      # a non-quote char 
nqF="[^${Q},]*"   # a non quoted field
qF="${Q}${NQ}*${Q}"   # a quoted fieldwith no nested quotes 
qFwq= "\(\(${qF}\)\{2\}\)*" # a quoted field with nested quotes 

Posix sed:

$ sed -e " 
     H;s/.*//;x;     # to place a marker at the beginning of pattern space 
     :a
        s/\(\n\)\(${qFwq}\),/\2|\1/
        s/\(\n\)\(${qF}\),/\2|\1/
        s/\(\n\)\(${nqF}\),/\2|\1/
     ta
     s/\n//
" file.csv

Using Perl:

 $ perl -F\" -lane 'local $|;
      $|-- or tr/,/|/ for @F;
      print join q["], @F;
  ' file.csv

Assumes that csv records do not spill over to the next line.

0

Yet another method to deal with this is: using Perl we read in a record (newline separated) and split it on comma.

Not just any comma but a "special" comma, when we stand there and see an even number of double quotes upto the end of record.

Then set the array element separator to pipe "|" :

$ perl \
  -F',(?=(?:(?:[^"]*"){2})*[^"]*$)' \
  -pale '$"=q[|]; $_ = "@F"' file.csv