4

I have a comma-separated file which has numeric and string columns. String columns are quoted and can have comma in between the quotes. How do I identify the columns with FS =","?

Sample records

"prabhat,kumar",19,2000,"bangalore,India"

In AWK it should be

$1 = "prabhat,kumar"
$2 = 19
$3 = "2000"
$4 = "bangalore,india"

Setting FS="," is creating the problem.

Input is:

"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"
"prabhat,kumar",19,2000,"bangalore,India","ABC,DEF","GHI",123,"KLM","NOP,QRS"

Output should be:

"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"
"prabhat,kumar"|19|2000|"bangalore,India"|"ABC,DEF"|"GHI"|123|"KLM"|"NOP,QRS"

Code I am trying:

awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="\n"} {if($i ~ /^\"/ || $i ~ /\"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^\,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS="|"}}' ORS="|" OFS=, p.txt
  • 5
    And that is why you should use a proper library that parses CSV (and XML and HTML etc.) rather than trying to write your own. Take a look at CSVfix – HBruijn Aug 11 '14 at 12:04

6 Answers6

4

First of all, you should use a proper CSV parser. For example, in Perl, you can use Text::CSV:

  1. Install cpanm (if you use Perl, you'll thank me later)

    $ sudo apt-get install cpanminus
    

    If you're not on a Debian based system, you should be able to install it using your distribution's package manager.

  2. Install the Text::CSV module

    $ sudo cpanm Text::CSV
    
  3. Parse your file

    $ perl -MText::CSV -le '
        $csv = Text::CSV->new({binary=>1}); 
        while ($row = $csv->getline(STDIN)){
        print "1:$row->[0], 2:$row->[1], 3:$row->[2], 4:$row->[3]"}' < file.csv 
    1:prabhat,kumar, 2:19, 3:2000, 4:bangalore,India
    

    As you can see above, the 1st field is $row->[0], the second $row->[1] etc.


That was the correct way. A simpler, but dirty, hack would be to replace any quoted commas with another character. Then, use awk normally and finally, switch them back to commas again. I am using ### here but you can use anything you are sure will never be in one of your fields.

$ sed -r 's/("[^",]+),([^",]+")/\1###\2/g' file.csv | 
    awk -F, '{print $1,$3}' | sed 's/###/,/g'
"prabhat,kumar" 2000
terdon
  • 242,166
  • Your script has a problem if the quoted field doesn't contain a comma.echo '"prabhat kumar",19,2000,"bangalore,India"'|sed -r 's/("[^"]+),([^"]+")/\1###\2/g' prints "prabhat kumar",19###2000,"bangalore,India" – doneal24 Aug 11 '14 at 16:56
  • First field in the above should be "prabhat kumar" - sorry – doneal24 Aug 11 '14 at 17:03
  • I cut the wrong line: echo '"prabhat kumar",19,2000,"bangalore,India"'| sed -r 's/("[^"]+),([^"]+")/\1###\2/g' prints "prabhat kumar",19###2000,"bangalore,India" – doneal24 Aug 11 '14 at 17:04
  • @DougO'Neal yeah, I see the problem, thanks. I don't have time to fix it right now but anyway, as I said, that's just a dirty hack and a proper parser should be used instead. – terdon Aug 11 '14 at 17:07
  • OP asked for awk – Andrew Aug 15 '14 at 19:37
  • @andrew, first of all it is standard practice on this site to offer answers using different tools, especially when the OP is using the wrong tool for the job. Remember that our objective is to provide answers that help future users as well as the OP. In any case, the OP did not actually ask for awk, they just happen to have tried awk first. – terdon Aug 20 '14 at 22:50
4

If you have GNU awk:

$ awk -vFPAT='[^,]*|"[^"]*"' '{ gsub("^\"|\"$","",$1); gsub("^\"|\"$","",$4); print $1 $4} '
prabhat,kumarbangalore,India

The output format is a little ugly as I've only printed $1 and $4 next to each other - I'm sure you can alter it to your taste.

If you need to retain the double quotes around the fields, remove both gsub(); functions.

Explanation:

Normally, awk separates fields in a record by the content of the FS (Field Separator) variable, which defaults to any whitespace (tab, space and newline). The separator tells awk where a record ends. In a csv file, a record ends at a comma (passed to awk as -vFS=,), but of course, in examples similar to yours, this is too simple and it breaks.

Alternatively, the FPAT (Field Pattern) defines the record within awk. Instead of telling awk where the record ends, you create a definition that encompasses a whole record. In the complex csv of your example, this is [^,]*|"[^"]*"

This breaks down as follows:-

  • Characters that are not commas ([^,]) as many times as possible (*). Everything between two commas is a field.
  • Or (|)
  • A single double-quote (") followed by no double-quotes ([^"]) as many times as possible (*) followed by a single double-quote ("). Everything within double-quotes (including a comma) counts as one field.
garethTheRed
  • 33,957
1

Ruby is handy for CSV parsing:

ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS
prabhat,kumar|19|2000|bangalore,India|ABC,DEF|GHI|123|KLM|NOP,QRS

Note that there are no quotes in the outputs. This is because none of the fields contain the field separator. If you need quotes, you can forcibly quote all fields (even integers):

ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|",:force_quotes=>true)' file
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
"prabhat,kumar"|"19"|"2000"|"bangalore,India"|"ABC,DEF"|"GHI"|"123"|"KLM"|"NOP,QRS"
glenn jackman
  • 85,964
0

You can use perl instead:

$ echo '"prabhat,kumar",19,2000,"bangalore,India"' |
  perl -F',(?![[:alpha:]])' -anle 'print "$F[0] $F[3]"'
"prabhat,kumar" "bangalore,India"
cuonglm
  • 153,898
  • 1
    Clever! However, it will fail for cases where there is a non letter after the comma. For example prabhat, (kumar) or a letter after a non-quoted comma: "prabhat,kumar",A 19, – terdon Aug 11 '14 at 12:53
  • @terdon: Yes, of course, as it's desired. Your answer using Text:CSV is better for this job. – cuonglm Aug 11 '14 at 14:06
0

This worked for me:

$ echo '"prabhat,kumar",19,2000,"bangalore,India"' | 
  awk -F, '{print $1,$2,$3,$4,$5,$6}'| 
    awk -F\" '{print $2,$3,$4}'|awk -F\  '{print $1","$2,$3,$4,$5","$6}'`
terdon
  • 242,166
Andrew
  • 1,205
  • 1
    please add an explanation to your answer – Nidal Aug 11 '14 at 13:54
  • 1
    This will only work if you know the format before hand. What if it is the 3d field that's quoted and not the 1st? – terdon Aug 11 '14 at 14:08
  • OP gave a specific format, no need nor expectation for the format to change. "Last,First",Age,Year,"City,Country". Bangalore is a city in Karnataka, so state is undefined in the data. First and fourth field are text, let's expect quotes. Second and third field is integer and would/should never be quoted. Understand the data and avoid over complicating the problem. – Andrew Aug 15 '14 at 19:34
0

Probably a qsv would be more your speed?

sed 's/.*/,&,/;:t
s/,"/"/;s/",/"/;s/,\([^"]*\),/"\1/;tt
' <<\DATA
"prabhat,kumar",19,2000,"bangalore,India"
DATA

OUTPUT

"prabhat,kumar"19"2000"bangalore,India"
mikeserv
  • 58,310