How do I loop through all csv files in a directory, select a range of columns and combine to single csv?

Question

I have a large number of CSV files in a specific directory. All have at least 41 columns with matching headers, but can be as wide as 200 columns. I need to grab just the first 40 columns and append them to create a single CSV with headers. I am relatively new and was trying to follow this example How do I keep the first 200 lines of all the csv files in a directory using bash? in combination with that one Merging contents of multiple .csv files into single .csv file. I am trying to limit it to a one-liner is possible and am thinking I need a combination of "cut" and "cat" commands. I unsuccessfully tried to run something like this:

$ for file in *.csv do cut -d ',' -f1-40 "$file" > "$file"; done

then

cat *csv > combined.csv

without any luck.

Any advice is greatly appreciated. Thank you.

Check if you truncated your files to zero size. I hope you have backups. — Kamil Maciorowski, Oct 04 '21 at 22:00
I sure did @KamilMaciorowski. Thankfully I ran in a test directory. — Annabanana, Oct 04 '21 at 22:02
Does this answer your question? Simple efficient concatenation of CSV files — Kamil Maciorowski, Oct 05 '21 at 04:23
I see the shell gurus helped you out. I added a solution that uses a CSV-aware tool. Dependency on 3rd-party software, but much less shell for the tradeoff. — Zach Young, Oct 21 '21 at 22:49

score 2 · Accepted Answer · edited Oct 05 '21 at 06:52

2

Instead of trying to overwrite each file and to concatenate later, get cut to cut all the files and output the result into combined.csv directly.

You'll want to make sure combined.csv itself is not included in the list, or you could end up with an infinite loop filling up your filesystem.

(rm -f combined.csv && set ./*.csv && cut -d, -f1-40 "$@" > combined.csv)

Or (assuming GNU xargs or compatible):

(
  rm -f combined.csv &&
    set ./*.csv &&
    printf '%s\0' "$@" |
      xargs -r0 cut -d, -f1-40 > combined.csv
)

In case the list of csv files is so big that you get a "argument list too long" error.

You'd need a loop if you wanted to remove the header for all but the first file, but even, then you'd rather redirect the output of the loop than editing each file in place and concatenate later.

(
  rm -f combined.csv && set ./*.csv &&
  {
    cut -d, -f1-40 < "$1"
    shift
    for file do
      tail -n+2 < "$file" | cut -d, -f1-40
    done
  } > combined.csv
)

In any case, note that using tail and cut like that assumes csv cells don't contain , or newline characters. To be able to deal with csvs with arbitrary contents, you'd want to use proper csv manipulation utilities such as mlr or csvtool or proper programming languages such as perl or python and their csv modules.

edited Oct 05 '21 at 06:52

Stéphane Chazelas

544,893

answered Oct 04 '21 at 22:08

Kamil Maciorowski

21,864

Thank you @KamilMaciorowski. Would headers be repeated? – Annabanana Oct 04 '21 at 22:11
@Annabanana cut has no concept of a header, it processes lines. All lines will be processed, including these you call headers. …; do tail -n +2 "$file" | cut -d ',' -f1-40; done … will skip the first line from every file, including the first one. – Kamil Maciorowski Oct 04 '21 at 22:15
So I would have to also do something like tail -n+2 to get rid of the headers? – Annabanana Oct 04 '21 at 22:22
@Annabanana If my previous comment is not enough, consider asking another question (after searching, maybe it's already answered). Don't let this thread become chameleon, even in comments. – Kamil Maciorowski Oct 04 '21 at 22:26
Fair point. I edited my original question to make it more clear that I want to retain headers (without repeating them.) Thank you. – Annabanana Oct 05 '21 at 01:17

score 0 · Answer 2 · answered Oct 21 '21 at 22:46

If your system/constraints allow it, consider a CSV-dedicated tool. I like (and now maintain a fork of) GoCSV.

Its select subcommand has a syntax and function very much like cut:

for file in *.csv; do 
  gocsv select -c 1-40 $file > processed_$file
done

and then you can "stack" all the cut-down, processed CSVs together:

gocsv stack processed_*.csv > combined.csv

Because GoCSV is aware of the CSV format and headers, you should get the result you're looking for with only a few lines of shell.

stack also has its -filenames option which adds the filename to a special grouping column, so you can reference any row back to its original file (and pretty-print w/viewmd):

gocsv stack --filenames processed_*.csv | gocsv viewmd



  foo
  File




  1
  processed_file1.csv


  2
  processed_file1.csv


  3
  processed_file2.csv


  4
  processed_file2.csv

foo	File
1	processed_file1.csv
2	processed_file1.csv
3	processed_file2.csv
4	processed_file2.csv

How do I loop through all csv files in a directory, select a range of columns and combine to single csv?

2 Answers2