46

I want to write a script that merges contents of several .csv files in one .csv file, i.e appends columns of all other files to the columns of first file. I had tried doing so using a "for" loop but was not able to proceed with it.

Does anyone know how to do this in Linux?

rmb
  • 863

9 Answers9

65

The simplest approach for achieving that would be typing the following command

cat *csv > combined.csv

This file would contain the contents of all your csv files just in the way you mentioned.

zsha
  • 857
  • 6
    Wouldn't this copy the rows in a common file instead of the columns? – fschmitt Jul 04 '16 at 13:53
  • @fschmitt Copying the rows is appending to the columns, right? – Kusalananda Jul 04 '16 at 13:55
  • @Kusalananda no, it's just concatenating the files. – cas Jul 04 '16 at 14:00
  • @cas Matter of interpretation. If I have a row and append values to it, they go on the same row. If I have a column and append values to it, they go in the same column. If I want to append the columns to columns, do they go as new columns, or are the appended to the end of the existing ones? – Kusalananda Jul 04 '16 at 14:19
  • 7
    In this particular case, what the OP wants is: for each individual line of all input files, append the fields line to make one long line with all the columns, and then repeat the process for all subsequent lines. This is hard to describe unambiguously in plain English - the algorithm in my perl script (or the more memory-efficient algorithm described after it) explains it better and is easier to follow. – cas Jul 04 '16 at 14:34
  • 38
    This answer will duplicate the headers. Use head -n 1 file1.csv > combined.out && tail -n+2 -q *.csv >> combined.out where file1.csv is any of the files you want merged. This will merge all the CSVs into one like this answer, but with only one set of headers at the top. Assumes that all CSVs share headers. It is called combined.out to prevent the statements from conflicting. – hLk Oct 12 '19 at 01:00
  • Even if you want to concatenate two csv files by rows and not columns, this is also incorrect if both files have headers. This will just copy the two files together and duplicate the headers. – Dave Nov 23 '22 at 18:48
31
awk '(NR == 1) || (FNR > 1)' *.csv > 1000Plus5years_companies_data.csv
Paulo Tomé
  • 3,782
  • 7
    This would be a useful answer if you described what the code is doing and what expectations you have on the input data. – Kusalananda Dec 26 '19 at 07:31
  • 9
    I found this useful as well so I'll expand...

    Both NR and FNR represent the number of the line being processed (1 based index). FNR is the current line within each File while NR is the current total Number of Records across all files.

    So (NR == 1) is including the the first line of the first file (header), while (FNR > 1) skips the first line of each subsequent file.

    – Nick Feb 06 '20 at 19:00
  • 1
    that doesnt work. It just concatenate one csv below the other and not column side by side. – Paulo Sergio Schlogl Dec 18 '20 at 15:24
  • If on macOS awk is different, do a brew install gawk then do the above with gawk .... Maybe this was your issue @PauloSergioSchlogl? – Josh Hibschman Mar 15 '23 at 16:12
  • Paulo's issue is that he expected a different result. He wanted columns added as opposed to rows added. – trebor Apr 13 '23 at 13:23
  • This works well for me since I have to skip 2 lines (a # commend and a CSV header) and can use awk '(NR <= 2) || (FNR > 2)' *.csv > merged.csv to achieve that. csvstack did not handle the comment (granted, it's non-standard). – Lucas May 22 '23 at 13:52
20

use paste

paste -d ',' file1.csv file2.csv ... fileN.csv
Chris
  • 961
  • 7
  • 20
17

Use csvstack from csvkit:

 csvstack *.csv  > out.csv
Dan F
  • 279
  • This the only solution that deals with the non-obvious cases, even though it requires a dedicated tool. Thanks! Exactly what I was looking for. – Tomáš M. Aug 31 '21 at 16:57
  • csvstack works great combined with {00..99} bash globbing to merge multi part SPARK CSV files csvstack part-{00..99}-abcdef.csv > out.csv – M.Viking Sep 17 '21 at 13:05
  • This does what I was looking for: combine all the rows of multiple csv files which have the same header (which only is output once). That is vs the clarified original post which wants columns merged. – nealmcb Mar 01 '22 at 01:23
  • Note to get this I had to pip3 install csvkit, I tested this command based on what Dan F shared and It only takes rows and appends them, not columns. This did not work for me. Maybe the command needs to be adjusted? – Dave Nov 16 '22 at 16:17
4

Here's a perl script that reads in each line of each file specified on the command line and appends it to elements in the array (@csv). When there's no more input, it prints out each element of @csv.

The .csv files will be appended in the order that they are listed on the command line.

WARNING: This script assumes that all input files have the same number of lines. Output will likely be unusable if any file has a different number of lines from any of the others.

#!/usr/bin/perl

use strict;

my @csv=();

foreach (@ARGV) {
  my $linenum=0;

  open(F,"<",$_) or die "couldn't open $_ for read: $!\n";

  while (<F>) {
    chomp;
    $csv[$linenum++] .= "," . $_;
  };

  close(F);
};

foreach (@csv) {
  s/^,//;   # strip leading comma from line
  print $_,"\n";
};

Given the following input files:

==> 1.csv <==
1,2,3,4
1,2,3,4
1,2,3,4
1,2,3,4

==> 2.csv <==
5,6,7,8
5,6,7,8
5,6,7,8
5,6,7,8

==> 3.csv <==
9,10,11,12
9,10,11,12
9,10,11,12
9,10,11,12

it will produce the following output:

$ ./mergecsv.pl *.csv
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12

OK, now that you've read this far it's time to admit that this doesn't do anything that paste -d, *.csv doesn't also do. So why bother with perl? paste is quite inflexible. If your data is exactly right for what paste does, you're good - it's perfect for the job and very fast. If not, it's completely useless to you.

There are any number of ways a perl script like this could be improved (e.g. handling files of different lengths by counting the number of fields for each file and adding the correct number of empty fields to @csv for each of the file(s) which are missing lines. or at least detecting different lengths and exiting with an error) but this is a reasonable starting point if more sophisticated merging is required.

BTW, this uses a really simple algorithm and stores the entire contents of all input files in memory (in @csv) at once. For files up to a few MB each on a modern system, that's not unreasonable. If, however, you are processing HUGE .csv files, a better algorithm would be to:

  • open all the input files and, while there's still input to read:
    • read a line from each file
    • append the lines (in @ARGV order)
    • print the appended line
cas
  • 78,579
0

The below command basically takes all *.csv files , performs a unique sorting to remove duplicates ( headers ) and creates a new file having all the requisite data

cat users/*.csv | sort -u > users.csv

Pre-Requisites

  1. Unique Data
  2. Every .csv file must have a line break in the end
AdminBee
  • 22,803
  • Welcome to the site. I'm afraid your solution is not what the OP wants. Your approach concatenates the files line-wise, whereas the OP wants to concatenate them colums-wise (i.e. place the entire file 2 "to the right" of file 1, and so on) as indicated by the accepted answer. Also, your approach would remove not only headers, but all duplicates, including those which may be intentional. – AdminBee Aug 11 '20 at 11:40
  • Yes I've mentioned that as a pre-req that it must be unique data only – Rohit Salecha Aug 11 '20 at 11:41
-1

Assuming all the csv's have a header and you only want it to show once heres the solution. Make bash script

#! /usr/bin/env bash
regex_pattern=$1
output_file=$2
first_file=`ls $regex_pattern | awk '{if(NR == 1) print $0}'`
# get header from first line of file
read -r first_line < $first_file
# print header 
echo $first_line > $output_file
# cat all files. Use grep to remove the headers. append to output file
cat $regex_pattern | grep -v $first_line >> $output_file 
# delete old csvs
rm $regex_pattern

then run bash csv_combiner.sh "path/to/dir/data.*.csv" combine_data.csv

  • Note that since you rely on the shell to do filename globbing on a pattern (not regular expression, by the way) provided by the user, and since you consequently avoid quoting variable expansions, your script will not work for filenames containing spaces, or files in subdirectories whose names contain spaces. Given the pattern /dir/my * files/*.csv, you effectively delete all files in the current directory. It's also easier to skip the first line of a file using tail -n +2 or sed 1d. – Kusalananda Aug 03 '22 at 19:49
  • Note also that it's unclear how to interpret this particular question. You seem to interpret is as "adding rows", while the actual question says "append columns". – Kusalananda Aug 03 '22 at 19:50
  • Sigh. (1) As Kusalananda says, what you call regex_pattern is *not* a regex.  Please learn the vocabulary.  (2) Please learn when to quote.  (You should always quote shell variables unless you have a good reason not to, and you’re sure you know what you’re doing.)  (3) Please use $(…) instead of \…``.  (4) It’s better just to write your output to stdout, so the user can send it to a pipe.  Don’t make the user specify an output file on the command line.  (5) Do a search for “UUOC”.  … (Cont’d) – G-Man Says 'Reinstate Monica' Aug 04 '22 at 01:40
  • (Cont’d) …  (6) Your code is pushing every line of data through grep.  This is potentially using the CPU excessively (unnecessarily) and, in some corner cases, may give wrong results.  For the problem that you’re solving (which may or may not be what the question is actually asking), counting lines makes *much* more sense.  (7) This may be difficult, but try to figure out when to include -- in your commands. … (Cont’d) – G-Man Says 'Reinstate Monica' Aug 04 '22 at 01:40
  • (Cont’d) … (8) Give your scripts simple names; do not use extensions like .sh.  (9) If you’re putting a shebang in your script, you don’t need to explicitly invoke the shell on the command line.  (10) I don’t see anything in your script that’s bash-specific; you might as well just say /bin/sh. – G-Man Says 'Reinstate Monica' Aug 04 '22 at 01:40
-2

If all CSV file contains same columns, this simple command could do the magic to collate CSV Summary from multiple CSV files.

cat *csv | awk '!a[$0]++' > summary.csv
-2

I wrote this simple file merge script in which you can merge CSV files to a single CSV file (row wise not column wise though).

Once cloned or copy the file-merge-script.sh can execute the below command to merge csv files.

/bin/bash file-merge-script.sh -d ~/myfiles -o 'result.csv' -f 1 -s 2

In above command -f 1 gets all content (from line 1) of first matching csv file and -s 2 gets all matching file contents from line 2 of the rest of the files.

Get the file-merge-script.sh from here https://github.com/prabushitha/file-merge-script