34

I have a input file delimited with commas (,). There are some fields enclosed in double quotes that are having a comma in them. Here is the sample row

123,"ABC, DEV 23",345,534.202,NAME

I need to remove all the comma's occuring within inside the double quotes and the double quotes as well. So the above line should get parsed into as shown below

123,ABC DEV 23,345,534.202,NAME

I tried the following using sed but not giving expected results.

sed -e 's/\(".*\),\(".*\)/\1 \2/g'

Any quick tricks with sed, awk or any other unix utility please?

jubilatious1
  • 3,195
  • 8
  • 17
mtk
  • 27,530
  • 35
  • 94
  • 130
  • I'm not sure what you're trying to do, but the utility "csvtool" is far better for parsing csv than generic tools like sed or awk. It's in just about every distro of linux. – figtrap Nov 11 '15 at 21:42

12 Answers12

42

If the quotes are balanced, you will want to remove commas between every other quote, this can be expressed in awk like this:

awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' infile

Output:

123,ABC DEV 23,345,534.202,NAME

Explanation

The -F" instructs awk to do field-splitting on double-quote characters, this means that every other field will be the inter-quote text. The for-loop runs gsub, short for globally substitute, on every other field, replacing comma (",") with nothing (""). The 1 at the end invokes the default code-block: { print $0 }.

Thor
  • 17,182
  • 1
    Please can you elaborate on gsub and explain in short, how this one liner works?? please. – mtk Sep 20 '12 at 11:46
  • Thank you! This scripts works really well, but could you explain the lonely 1 at the end of the script? -- } 1' -- – CocoaEv Jun 18 '14 at 23:59
  • @CocoaEv: It executes { print $0 }. I added that to the explanation as well. – Thor Jun 22 '14 at 13:26
  • 2
    this approach has a problem: sometimes the csv has rows that span several lines, such as: prefix,"something,otherthing[newline]something , else[newline]3rdline,and,things",suffix (ie: several lines, and nested "," anywhere within a multi-line double-quoting : the whole "...." part should be rejoined and inside , should be replaced/removed ... ) : your script won't see pairs of double quotes in that case, and it's not really easy to solve (need to "rejoin" lines that are in an "open" (ie, odd-numbered) double quote...+take extra care if there is also an escaped \" inside the string) – Olivier Dulac Sep 22 '16 at 17:42
  • @OlivierDulac: Yes, that is worth keeping in mind. It is worth noting that many CSV parsers do not support embedded newlines. The most robust parser I have tested is the one from here Lorance. – Thor Sep 23 '16 at 00:47
  • 1
    Loved this solution but I tweaked it given I often like to keep the commas but still want to delimit. Instead, I switched the commas outside the quotes to pipes, converting the csv to a psv file: awk -F'"' -v OFS='"' '{ for (I=1; i<=NF; i+=2) gsub(",", "|", $i) } 1' infile – Danton Noriega Oct 09 '18 at 17:25
10

There is a good response, using sed simply one time with a loop:

echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME'|
  sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta'
123,"ABC  DEV 23",345,534,"some more  comma-separated  words",202,NAME

Explanation:

  • :a; is a label for furter branch
  • s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 / could contain 3 enclosed parts
    • first the 2nd: [^"]*,\?\|"[^",]*",\? match for a string containing no double quote, maybe followed by a coma or a string enclosed by two double quote, without coma and maybe followed by a coma.
    • than the first RE part is composed by as many repetition of previously described part 2, followed by 1 double quote and some caracteres, but no double-quote, nor comas.
    • The first RE part as to be followed by a coma.
    • Nota, the rest of the line don't need to be touched
  • ta will loop to :a if previous s/ command did some change.

Once loop done, you could even add s/ */ /g:

echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME'|
    sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta;s/  */ /g'

will suppress double spaces:

123,"ABC DEV 23",345,534,"some more comma-separated words",202,NAME
  • Works also with nested quotes. Awesome, thanks! – tricasse May 27 '15 at 17:47
  • Tried it on the command itself. :-) echo ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta' | sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta' becomes

    :a;s/^\(\([^"]* \?\|"[^" ]*",\?\)*"[^",]*\),/\1 /;ta

    – Anne van Rossum Aug 27 '20 at 08:55
  • echo '1,2,"abc,de,f",10' | sed ':a;s/"\([^",]*\),\(.*\)"/"\1 \2"/;ta' – avocado Sep 14 '21 at 07:30
5

A general solution that can also handle several commas between balanced quotes needs a nested substitution. I implement a solution in perl, which process every line of a given input and only substitute commas in every other pair of quotes:

perl -pe 's/ "  (.+?  [^\\])  "               # find all non escaped 
                                              # quoting pairs
                                              # in a non-greedy way

           / ($ret = $1) =~ (s#,##g);         # remove all commas within quotes
             $ret                             # substitute the substitution :)
           /gex'

or in short

perl -pe 's/"(.+?[^\\])"/($ret = $1) =~ (s#,##g); $ret/ge'

You can either pipe the text you want to process to the command or specify the textfile to be processed as last command line argument.

user1146332
  • 2,234
  • 1
    The [^\\] is going to have the undesired effect of matching the last character inside the quotes and removing it (non \ character), i.e., you should not consume that character. Try (?<!\\) instead. – tojrobinson Sep 20 '12 at 09:13
  • Thanks for your objection, I have corrected that. Nevertheless I think we don't need look behind assertion here, or do we!? – user1146332 Sep 20 '12 at 09:26
  • 1
    Including the non \ in your capture group produces an equivalent result. +1 – tojrobinson Sep 20 '12 at 09:31
  • 1
    +1. after trying a few things with sed, I checked sed's docs and confirmed that it can't apply a replace to just the matching portion of a line...so gave up and tried perl. Ended up with a very similar approach but this version uses [^"]* to make the match non-greedy (i.e. matches everything from one " to the next "): perl -pe 's/"([^"]+)"/($match = $1) =~ (s:,::g);$match;/ge;'. It does not acknowledge the outlandish idea that a quote might be escaped with a backslash :-) – cas Sep 20 '12 at 10:21
  • Thanks for your comment. Would be interesting if either the [^"]* approach or the explicit non-greedy approach consumes less cpu time. – user1146332 Sep 20 '12 at 10:53
  • @CraigSanders: with GNU sed you sort of can with the execute flag and sub-shelling, e.g.: sed -r 's/(.*)"([^"]+)"(.*)/(echo -n "\1"; echo -n "\2" | tr -d ','; echo -n "\3")/e' – Thor Sep 20 '12 at 11:04
  • @Thor: nice. I started playing with /e but couldn't get it to replace multiple commas inside quotes or cope with multiple quote-delimited sections on a line. also, perl is easier to read and understand when sed scripts start getting too complicated. – cas Sep 20 '12 at 11:28
  • @CraigSanders: Agreed, that soon becomes too complicated. I did find a simpler two tier sed alternative which I added to my answer. – Thor Sep 20 '12 at 12:04
  • That solution doesn't work in cases like echo '"", "asd"' | perl -pe 's/"(.+?[^\\])"/($ret = $1) =~ (s#,##g); $ret/ge' – Stéphane Chazelas Nov 22 '12 at 10:44
3

Your second quotes are misplaced:

sed -e 's/\(".*\),\(.*"\)/\1 \2/g'

In addition, using regular expressions tend to match the longest possible part of the text, meaning this will not work if you have more than one quoted field in the string.

A way that handles multiple quoted fields in sed

sed -e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' -e 's/\"//g'

This is also a way to solve this, however, with input that may contain more than one comma per quoted field the first expression in the sed would have to be repeated as many times as the maximum comma content in a single field, or until it does not change the output at all.

Running sed with the more than one expression should be more efficient than several sed processes running and a "tr" all running with open pipes.

However, this may have undesired consequences if the input is not properly formatted. i.e. nested quotes, unterminated quotes.

Using the running example:

echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME' \
| sed -e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' \
-e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' -e 's/\"//g'

Output:

123,ABC  DEV 23,345,534,some more  comma-separated  words,202,NAME
Didi Kohen
  • 1,841
  • You can make it more general with conditional branching and more readable with ERE, e.g. with GNU sed: sed -r ':r; s/("[^",]+),([^",]*)/\1 \2/g; tr; s/"//g'. – Thor Feb 25 '13 at 12:56
3

I would use a language with a proper CSV parser. For example:

ruby -r csv -ne '
  CSV.parse($_) do |row|
    newrow = CSV::Row.new [], []
    row.each {|field| newrow << field.delete(",")}
    puts newrow.to_csv
  end
' < input_file
glenn jackman
  • 85,964
  • while I liked this solution initially, it turned out to be incredible slow for big files ... – KIC Oct 30 '16 at 19:37
3

Consider csvkit:

< file csvformat -D ^ | tr -d , | tr ^ ,

Explanation

  • -D ^ set the delimiter to a character not present in the file, such as carat
  • tr -d , now that commas are no longer the delimiting character, delete commas
  • tr ^ , convert carat back to comma
  • Although I would have used csvformat to convert back to commas again at the end. This would sort out quoting correctly too. – Kusalananda Nov 03 '21 at 07:03
  • 1
    You'd use csvformat -d '^' to convert the file back to using commas as delimiters. However, what you're doing seems to work, but only because all non-delimiting commas are removed. – Kusalananda Nov 04 '21 at 21:22
2

In perl - you can use Text::CSV to parse this, and do it trivially:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV; 

my $csv = Text::CSV -> new();

while ( my $row = $csv -> getline ( \*STDIN ) ) {
    #remove commas in each field in the row
    $_ =~ s/,//g for @$row;
    #print it - use print and join, rather than csv output because quotes. 
    print join ( ",", @$row ),"\n";
}

You can print with Text::CSV but it tends to preserve quotes if you do. (Although, I'd suggest - rather than stripping quotes for your output, you could just parse using Text::CSV in the first place).

choroba
  • 47,233
Sobrique
  • 4,424
1

I created a function to loop thru every character in the string.
If the character is a quotation then the check (b_in_qt) is marked true.
While b_in_qt is true, all commas are replaced with a space.
b_in_qt is set to false when the next comma is found.

FUNCTION f_replace_c (str_in  VARCHAR2) RETURN VARCHAR2 IS
str_out     varchar2(1000)  := null;
str_chr     varchar2(1)     := null;
b_in_qt     boolean         := false;

BEGIN
    FOR x IN 1..length(str_in) LOOP
      str_chr := substr(str_in,x,1);
      IF str_chr = '"' THEN
        if b_in_qt then
            b_in_qt := false;
        else
            b_in_qt := true;
        end if;
      END IF;
      IF b_in_qt THEN
        if str_chr = ',' then
            str_chr := ' ';
        end if;
      END IF;
    str_out := str_out || str_chr;
    END LOOP;
RETURN str_out;
END;

str_in := f_replace_c ("blue","cat,dog,horse","",yellow,"green")

RESULTS
  "blue","cat dog horse","",yellow,"green"
1

GoCSV makes quick and obvious work of this:

sample.csv:

Col1,Col2,Col3,Col4
01299,"ABC,DEV 23",1,0
26528,"ABC,DEV 03",0,0
38080,"DEF,STG 35",0,0
38081,"GHI,STG 19",0,0

Use its replace command to substitute all , with only in Col2:

gocsv replace -c 'Col2' -regex ',' -repl ' ' sample.csv 

yields:

Col1,Col2,Col3,Col4
01299,ABC DEV 23,1,0
26528,ABC DEV 03,0,0
38080,DEF STG 35,0,0
38081,GHI STG 19,0,0
Zach Young
  • 220
  • 2
  • 5
  • So GoCSV removes internal quotes by default? Not sure I understand the changing of , comma to . dot as the OP seems to prefer no replacement (i.e. deletion of commas internal to paired doublequotes). – jubilatious1 Oct 26 '22 at 19:27
  • 2
    @jubilatious1, thanks for pointing out the discrepancy in my sample, I've updated it to match OP's ask. As for the double quotes, GoCSV (which use's go's csv pkg) will strip unnecessary quote chars. When Col2 had commas as part of the values, we needed the double quotes to protect them from being interpreted as field delimiters; without the commas, we just don't need the double quotes and they're simply not written to the output. – Zach Young Oct 26 '22 at 20:44
  • Thank you for your prompt reply! It's great that you now have the internal commas deleted. However I believe the OP wanted "...to remove ...the double quotes as well." My question whether or not there's a switch that can turn on/off GoCSV's internal quotation setting. You seem to have restored internal quotes in your output, but I don't see a setting to remove them (which is what the OP wanted). Thanks again for your prompt attention. – jubilatious1 Oct 26 '22 at 22:02
  • 2
    @jubilatious1, agh!, geez, copy-paste error. I've replaced it with the actual output from GoCSV, which does the right thing. And, no, there's no switch; go's csv package is quite clear about providing a very minimal API which just does the right thing by the RFC-4180 spec for reading/writing CSV files. Better than my sloppy examples, the doc very clearly outlines the quoting behavior. – Zach Young Oct 26 '22 at 22:10
1

Using the CSV-aware utility Miller (mlr) to remove all commas from the second field of the header-less CSV data in the file called file:

$ cat file
123,"ABC, DEV 23",345,534.202,NAME
$ mlr --csv -N put '$2 = gsub($2,",","")' file
123,ABC DEV 23,345,534.202,NAME

If you want to retain the original quotes:

$ mlr --csv -N --quote-original put '$2 = gsub($2,",","")' file
123,"ABC DEV 23",345,534.202,NAME

If you have named fields (i.e. a header line), then use the name of the field instead of the 2, and drop the -N option:

mlr --csv --quote-original put '$departement = gsub($departement,",","")' file

Would you want to process all fields, then do something like

$ mlr --csv -N --quote-original put 'for (k,v in $*) { is_string(v) { $[k] = gsub(v,",","") } }' file
123,"ABC DEV 23",345,534.202,NAME
Kusalananda
  • 333,661
1

Using Raku (formerly known as Perl_6)

~$ raku -pe 's:g/ \" ~ \" (<-[",]>+) ** 2..* % "," /{$0.join}/;'  file

Raku is a programming language in the Perl-family. Raku's regex notation has been significantly reworked over the approximately fifteen years of Perl6/Raku development, and many people find it easier to learn.

Above we see the familiar -pe autoprinting linewise flags, as in Perl. We also see the familiar s/// substution notation. One point of interest: substitution modifiers are denoted with a : colon and go at the head of the notation , so s:g/// stands for "substitute-global".

Within the recognition half of the s/// notation, we see two important improvements within the Raku Regex "dialect":

  1. First of all, nested structures can be denoted with Raku's new ~ tilde (nested) notation, such that \" ~ \" [.+] means "one-or-more characters surrounded by " double-quotes".

  2. Second of all, repeated structures can be denoted with Raku's new % modified quantifier for repeated structures. The notation [.+] ** 2..* % "," means "one-or-more characters separated by , commas with this pattern repeated ** 2..* two or more times. [If there happens to be a trailing separator (e.g. comma), use a %% instead of % in the syntax].

So, knowing Perl or PCRE syntax, at becomes much easier to read the regex notation above. To enhance this point, the syntax <-[",]> simply denotes a custom character class that does not contain either , comma or " doublequote.

Sample Input:

123,"ABC, DEV 23",345,534.202,NAME

Sample Output:

123,ABC DEV 23,345,534.202,NAME

Example Captures:

~$ raku -ne 'say $<> if m:g/ \" ~ \" (<-[",]>+)** 2..* % "," /;'  file
(「"ABC, DEV 23"」
 0 => 「ABC」
 0 => 「 DEV 23」)

Note: the replacement half uses {$0.join} to concatenate captured values. If you want a space inserted where the comma has been removed, simply use $0 in the replacement. Finally, for anything complex, consider using a CSV-parser like Raku's Text::CSV module. URLs below.

https://docs.raku.org/language/regexes#Tilde_for_nesting_structures
https://docs.raku.org/language/regexes#Modified_quantifier:_%,_%%
https://raku.land/github:Tux/Text::CSV
https://github.com/Tux/CSV
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
0

Using Python

''.join([item if index % 2 == 0 else re.sub(',', '', item) for index, item in enumerate(row.split('"')) ])
LoMaPh
  • 505