0

I wish to keep only certain columns of a CSV file, based on the structure of the header line.

Description of data:

  1. In the header line (fields of country), several fields are empty
  2. The number of columns per country vary. There can be 3 columns or 10 columns per country.
  3. The number of columns in the header line, which starting from the position where the fields are not empty is dynamic. It can have 2 columns or 100 columns.

The objective is to keep the first field of each country if countries exists on first line. How can I do this using awk please? The example is like this:

  • input: file.csv
    ,,,fr,fr,fr,ch,ch,ch
    num,nom,date reg,match flag,date1,date2,match flag,date1,date2
    0001,AA,2020-05-15,reg1,2019-02-03,2019-02-05,reg2,2019-05-06,2019-06-10
    0002,AAA,2020-05-20,,,,reg3,2020-05-06,2020-06-10
    
  • Desired output: file1.csv
    ,,,fr,ch
    num,nom,date reg,match flag_fr,match flag_ch
    0001,AA,2020-05-15,reg1_fr,reg2_ch
    0002,AAA,2020-05-20,,reg3_ch
    

Thank four your help.

terdon
  • 242,166
lindo
  • 3
  • Can any of the header fields be empty after the country columns start? or are all the potentially empty header fields at the start? – steeldriver Jul 08 '20 at 12:39
  • No, the headers fields can only be empty before the country columns. There must be at least one non-empty header fields. Thanks – lindo Jul 08 '20 at 13:13
  • Can your data fields contain commas? This, for example, is three fields in a CVS file, but with naïve parsing might be four: abc,"def,ghi",jkl – Chris Davies Jul 09 '20 at 08:23

1 Answers1

0

You need to iterate over the fields of the first record, conditionally adding their indices to a list; then iterate over the fields of each record and print the fields whose indices are in that list.

Ex.

$ awk -F, '
  BEGIN {OFS=FS}
  NR==1 {
    for(i=1;i<=NF;i++) if($i=="" || !seen[$i]++){inds[i]=1; lasti = i}
  }
  {
    for(i=1;i<=NF;i++) if(inds[i]) printf "%s%s", $i, i==lasti ? ORS : OFS
  }
' file.csv
,,,fr,ch
num,nom,date reg,match flag,match flag
0001,AA,2020-05-15,reg1,reg2
0002,AAA,2020-05-20,,reg3
steeldriver
  • 81,074
  • Thank you for your help. This gives exactly what it takes. Otherwise I have a another problem in the file: I have line feed characters that I can't delete with awk. The only way was to use sed (probably Windows line feed). Data example is the same like this: ,,,fr,fr,fr,ch,ch,ch num,nom,date reg,match flag,date1,date2,match flag,date1,date2 0001,AA,2020-05-15,reg1,2019-02-03,2019-02-05,reg2,2019-05-06,2019-06-10 0002,AAA,2020-05-20,,,,reg3,2020-05-06,2020-06-10 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,, Do you have a solution? Thank you. – lindo Jul 08 '20 at 18:02
  • @lindo you can try setting the awk record separator to account for the Windows line endings (awk -vRS='\r\n' ...). Otherwise, use one of the methods described here What is ^M and how do I get rid of it? – steeldriver Jul 08 '20 at 18:05
  • Hello @steeldriver, Thank you for your response. I'm use: awk -F, ' BEGIN {OFS=FS} NR==1 { for(i=1;i<=NF;i++) if($i=="" || !seen[$i]++){inds[i]=1; lasti = i} } { for(i=1;i<=NF;i++) if(inds[i]) printf "%s%s", $i, i==lasti ? ORS : OFS }'< $file.csv | awk 'BEGIN {OFS=",";FS=","}!/^,+$/{print}'>file1.csv The result is correct. How we can add the code directly in the first awk instead of doing a pipe? . Thanks – lindo Jul 13 '20 at 10:28
  • @lindo you should be able to apply your !/^,+$/ pattern directly to the main {...} action – steeldriver Jul 13 '20 at 12:56
  • Ok, I put it directly in th awk code and the result is correctly output: awk -F, ' BEGIN {OFS=FS} NR==3 { for(i=1;i<=NF;i++) if($i=="" || !seen[$i]++){inds[i]=1; lasti = i} } !/^,+$/{ for(i=1;i<=NF;i++) if(inds[i]) printf "%s%s", $i, i==lasti ? ORS : OFS }' <file.csv `. Thank you – lindo Jul 13 '20 at 14:00
  • Hello @steeldriver, how to transpose the columns to obtain the following result(output), using the same code? a-Input: ,,,fr,ch num,nom,date reg,match flag,match flag 0001,AA,2020-05-15,reg1,reg2 0002,AAA,2020-05-20,,reg3 b-output:num,nom,date reg,match flag,country 0001,AA ,2020-05-15,reg1_fr,fr 0002,AAA,2020-05-20, _fr,fr 0001,AA ,2020-05-15,reg2_ch,ch 0002,AAA,2020-05-20,reg3_ch,ch` Thanks. – lindo Jul 14 '20 at 20:56