How to concatenate a variable number of csv, removing their header rows?

Question

I have a directory with several hundred csv files whose filenames start with two digits {01..84}. Several hundred >> 84, so obviously some filenames start with the same prefix. I wish to concatenate the files whose filenames start with the same prefix. Here's what I've got:

#!/bin/bash
for i in {01..84}; do
        #declare array to store files with same prefix
        declare -a files=()
        echo "Processing $i"
        for j in `ls $i*.csv`; do
                #add files with same prefix to array
                files=("${files[@]}" "$j")
        done    
        #cat first file including header with the rest of the files without the headers 
        cat < ${files[@]:0:1} <(tail -n+2 ${files[@]:1}) > "$i".csv
done

So far so good ... only, it grinds to a halt at $i=22 halfway through (repeatable error), and pollutes the output files with blank lines and headers like "==> 19XXX.csv <==" (without quotes).

What should I change in the code to just get a nice clean csv file for each prefix without the script crashing?
Are there any precompiled bash utilities that I can call to do any of this quicker and easier?

The script fails if there is no (or possibly only one) input file for a prefix. — jofel, Jan 12 '15 at 11:08
Also the tail command works differently as you have expected. If there are multiple files, "==> ..." headers are included in its output by default. Use the -q option to suppress them. — jofel, Jan 12 '15 at 11:15
@Escher including an answer in your question is not how this site works. Accept one of the answers (after suggesting changes if necessary), ask Stephane to provide his comment as an answer and accept that, or provide your own code as an answer and accept that. — Anthon, Feb 11 '15 at 06:47

score 3 · Answer 1 · answered Jan 12 '15 at 11:16

#!/bin/bash
for i in {01..84}; do
    x=$(printf '%02d' $i)
    set -- $x?*.csv
    if [ -f "$1" ]; then
        cp "$1" $i.csv
        shift
        if [ -f "$1" ]; then
            tail -q -n +2 "$@" >> $x.csv
        fi
    fi
done

For each prefix, it sets the list of files with that prefix as the arguments so that you can use $1 to access the first, etc.

If $1 is a file (to catch the case where there are no files with the given prefix) then copy that file to prefix.csv. Then check whether there was more than one file with that prefix by shifting off the first file and checking the next also is a file. If so, skip the header line of each file via the tail command and append that to the prefix.csv.

The -q option to tail is to suppress the header line tail itself will add if more than one file is passed on the argument list; this is where your ==> 19XXX.csv <== lines were coming from.

Probably only the -q option is needed in your solution but I find it overly complicated, requiring bash to buffer the output from the tail command etc. which may be the cause of the script stopping (crashing?) prematurely.

EDIT: added x=$(printf '%02d' $i) as {01..84} expands to 1 2 3 ... without leading zeroes.

You're right about the -q; thanks. $> time yoursolution.sh = real 2.167s and $> time mine.sh = real 1.400s. I was surprised because your code doesn't have a scripted O(n^2) loop. — Escher, Jan 12 '15 at 11:35
Make sure to run such tests at least 3 times and to ignore the first couple of measurements; buffer cache is very influential on real times in such cases. — wurtel, Jan 12 '15 at 11:45

alexises · Answer 2 · 2015-01-14T19:12:58.420

1

#!/bin/sh
for i in {01..84}
do
  cat $i*.csv > $i.csv-concat
  rm $i*.csv
  mv $i.csv-concat $i.csv
done

don't forget cat, it's a concatenation tool, tail can also do the job and remove header.

#!/bin/sh
pushd [workdir]
for i in {01..84}
do
  echo $i*.csv | xargs -n 1 tail -n+2 > $i.csv-concat
  rm $i*.csv
  mv $i.csv-concat $i.csv
done
popd

edited Jan 14 '15 at 19:12

answered Jan 12 '15 at 16:30

alexises

533

Doesn't do the required job; the middle of the output file will have header rows from the constituent files. – Escher Jan 13 '15 at 06:40

score 0 · Accepted Answer · answered Feb 12 '15 at 23:51

Working code solution for anyone who just came here for a copy-paste based on wurtel's:

#!/bin/bash
for i in {01..84}; do
    #declare array to store files with same prefix
    declare -a files=()
    echo "Processing $i"
    for j in `ls $i*.csv`; do
        #add files with same prefix to array
        files=("${files[@]}" "$j")
    done
    #cat first file including header with the rest of the files without the headers
    if [ ${#files[@]} -gt 1 ]; then
        cat <(cat ${files[@]:0:1}) <(tail -q -n+2 ${files[@]:1}) > "$i".csv
    else
        cat <(cat ${files[@]:0:1}) > "$i".csv
    fi
done

Stéphane Chazelas' way using awk. Much cleaner.

#!/bin/bash
for i in {01..84}; do
        echo "processing $i"
        awk 'NR==FNR||FNR>1' $i?*.csv >> "$i".csv
done

How to concatenate a variable number of csv, removing their header rows?

3 Answers3

Linked

Related