0

I have a large number of CSV files in a specific directory. All have at least 41 columns with matching headers, but can be as wide as 200 columns. I need to grab just the first 40 columns and append them to create a single CSV with headers. I am relatively new and was trying to follow this example How do I keep the first 200 lines of all the csv files in a directory using bash? in combination with that one Merging contents of multiple .csv files into single .csv file. I am trying to limit it to a one-liner is possible and am thinking I need a combination of "cut" and "cat" commands. I unsuccessfully tried to run something like this:

$ for file in *.csv do cut -d ',' -f1-40 "$file" > "$file"; done

then

cat *csv > combined.csv

without any luck.

Any advice is greatly appreciated. Thank you.

2 Answers2

2

Instead of trying to overwrite each file and to concatenate later, get cut to cut all the files and output the result into combined.csv directly.

You'll want to make sure combined.csv itself is not included in the list, or you could end up with an infinite loop filling up your filesystem.

(rm -f combined.csv && set ./*.csv && cut -d, -f1-40 "$@" > combined.csv)

Or (assuming GNU xargs or compatible):

(
  rm -f combined.csv &&
    set ./*.csv &&
    printf '%s\0' "$@" |
      xargs -r0 cut -d, -f1-40 > combined.csv
)

In case the list of csv files is so big that you get a "argument list too long" error.

You'd need a loop if you wanted to remove the header for all but the first file, but even, then you'd rather redirect the output of the loop than editing each file in place and concatenate later.

(
  rm -f combined.csv && set ./*.csv &&
  {
    cut -d, -f1-40 < "$1"
    shift
    for file do
      tail -n+2 < "$file" | cut -d, -f1-40
    done
  } > combined.csv
)

In any case, note that using tail and cut like that assumes csv cells don't contain , or newline characters. To be able to deal with csvs with arbitrary contents, you'd want to use proper csv manipulation utilities such as mlr or csvtool or proper programming languages such as perl or python and their csv modules.

  • Thank you @KamilMaciorowski. Would headers be repeated? – Annabanana Oct 04 '21 at 22:11
  • @Annabanana cut has no concept of a header, it processes lines. All lines will be processed, including these you call headers. …; do tail -n +2 "$file" | cut -d ',' -f1-40; done … will skip the first line from every file, including the first one. – Kamil Maciorowski Oct 04 '21 at 22:15
  • So I would have to also do something like tail -n+2 to get rid of the headers? – Annabanana Oct 04 '21 at 22:22
  • @Annabanana If my previous comment is not enough, consider asking another question (after searching, maybe it's already answered). Don't let this thread become chameleon, even in comments. – Kamil Maciorowski Oct 04 '21 at 22:26
  • Fair point. I edited my original question to make it more clear that I want to retain headers (without repeating them.) Thank you. – Annabanana Oct 05 '21 at 01:17
0

If your system/constraints allow it, consider a CSV-dedicated tool. I like (and now maintain a fork of) GoCSV.

Its select subcommand has a syntax and function very much like cut:

for file in *.csv; do 
  gocsv select -c 1-40 $file > processed_$file
done

and then you can "stack" all the cut-down, processed CSVs together:

gocsv stack processed_*.csv > combined.csv

Because GoCSV is aware of the CSV format and headers, you should get the result you're looking for with only a few lines of shell.

stack also has its -filenames option which adds the filename to a special grouping column, so you can reference any row back to its original file (and pretty-print w/viewmd):

gocsv stack --filenames processed_*.csv | gocsv viewmd
foo File
1 processed_file1.csv
2 processed_file1.csv
3 processed_file2.csv
4 processed_file2.csv

Zach Young
  • 220
  • 2
  • 5