Keeping columns of csv file based on condition of columns of header line

Question

I wish to keep only certain columns of a CSV file, based on the structure of the header line.

Description of data:

In the header line (fields of country), several fields are empty.
The number of columns per country vary. There can be 3 columns or 10 columns per country.
The number of columns in the header line, which starting from the position where the fields are not empty is dynamic. It can have 2 columns or 100 columns.

The objective is to keep the first field of each country if countries exists on first line. How can I do this using awk please? The example is like this:

input: file.csv

,,,fr,fr,fr,ch,ch,ch
num,nom,date reg,match flag,date1,date2,match flag,date1,date2
0001,AA,2020-05-15,reg1,2019-02-03,2019-02-05,reg2,2019-05-06,2019-06-10
0002,AAA,2020-05-20,,,,reg3,2020-05-06,2020-06-10

Desired output: file1.csv

,,,fr,ch
num,nom,date reg,match flag_fr,match flag_ch
0001,AA,2020-05-15,reg1_fr,reg2_ch
0002,AAA,2020-05-20,,reg3_ch

Thank four your help.

Can any of the header fields be empty after the country columns start? or are all the potentially empty header fields at the start? — steeldriver, Jul 08 '20 at 12:39
No, the headers fields can only be empty before the country columns. There must be at least one non-empty header fields. Thanks — lindo, Jul 08 '20 at 13:13
Can your data fields contain commas? This, for example, is three fields in a CVS file, but with naïve parsing might be four: abc,"def,ghi",jkl — Chris Davies, Jul 09 '20 at 08:23

score 0 · Answer 1 · answered Jul 08 '20 at 13:16

0

You need to iterate over the fields of the first record, conditionally adding their indices to a list; then iterate over the fields of each record and print the fields whose indices are in that list.

Ex.

$ awk -F, '
  BEGIN {OFS=FS}
  NR==1 {
    for(i=1;i<=NF;i++) if($i=="" || !seen[$i]++){inds[i]=1; lasti = i}
  }
  {
    for(i=1;i<=NF;i++) if(inds[i]) printf "%s%s", $i, i==lasti ? ORS : OFS
  }
' file.csv
,,,fr,ch
num,nom,date reg,match flag,match flag
0001,AA,2020-05-15,reg1,reg2
0002,AAA,2020-05-20,,reg3

answered Jul 08 '20 at 13:16

steeldriver

81,074

Thank you for your help. This gives exactly what it takes. Otherwise I have a another problem in the file: I have line feed characters that I can't delete with awk. The only way was to use sed (probably Windows line feed). Data example is the same like this: ,,,fr,fr,fr,ch,ch,ch num,nom,date reg,match flag,date1,date2,match flag,date1,date2 0001,AA,2020-05-15,reg1,2019-02-03,2019-02-05,reg2,2019-05-06,2019-06-10 0002,AAA,2020-05-20,,,,reg3,2020-05-06,2020-06-10 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,, Do you have a solution? Thank you. – lindo Jul 08 '20 at 18:02
@lindo you can try setting the awk record separator to account for the Windows line endings (awk -vRS='\r\n' ...). Otherwise, use one of the methods described here What is ^M and how do I get rid of it? – steeldriver Jul 08 '20 at 18:05
Hello @steeldriver, Thank you for your response. I'm use: awk -F, ' BEGIN {OFS=FS} NR==1 { for(i=1;i<=NF;i++) if($i=="" || !seen[$i]++){inds[i]=1; lasti = i} } { for(i=1;i<=NF;i++) if(inds[i]) printf "%s%s", $i, i==lasti ? ORS : OFS }'< $file.csv | awk 'BEGIN {OFS=",";FS=","}!/^,+$/{print}'>file1.csv The result is correct. How we can add the code directly in the first awk instead of doing a pipe? . Thanks – lindo Jul 13 '20 at 10:28
@lindo you should be able to apply your !/^,+$/ pattern directly to the main {...} action – steeldriver Jul 13 '20 at 12:56
Ok, I put it directly in th awk code and the result is correctly output: awk -F, ' BEGIN {OFS=FS} NR==3 { for(i=1;i<=NF;i++) if($i=="" || !seen[$i]++){inds[i]=1; lasti = i} } !/^,+$/{ for(i=1;i<=NF;i++) if(inds[i]) printf "%s%s", $i, i==lasti ? ORS : OFS }' <file.csv `. Thank you – lindo Jul 13 '20 at 14:00
Hello @steeldriver, how to transpose the columns to obtain the following result(output), using the same code? a-Input: ,,,fr,ch num,nom,date reg,match flag,match flag 0001,AA,2020-05-15,reg1,reg2 0002,AAA,2020-05-20,,reg3 b-output:num,nom,date reg,match flag,country 0001,AA ,2020-05-15,reg1_fr,fr 0002,AAA,2020-05-20, _fr,fr 0001,AA ,2020-05-15,reg2_ch,ch 0002,AAA,2020-05-20,reg3_ch,ch` Thanks. – lindo Jul 14 '20 at 20:56

Keeping columns of csv file based on condition of columns of header line

1 Answers1

Linked