46

I have multiple files with the same header and different vectors below that. I need to concatenate all of them but I want only the header of first file to be concatenated and I don't want other headers to be concatenated since they are all same.

for example: file1.txt

<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
A
B 
C

file2.txt

<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
D
E 
F

I need the output to be

<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
A
B
C
D
E 
F

I could write a script in R but I need it in shell?

Jana
  • 729

8 Answers8

68

Another solution, similar to "cat+grep" from above, using head and GNU tail or compatible:

  1. Write the header of the first file into the output:

    head -n 2 file1.txt > all.txt
    

    -- head -n 2 gets 2 first lines of the file.

  2. Add the content of all the files:

    tail -n +3 -q file*.txt >> all.txt
    

    -- -n +3 makes tail print lines from 3rd to the end; GNU tail can take more than one filename as argument (as a common extension over the standard), and with -q (also a GNU extension, also supported on FreeBSD and NetBSD) tells it not to print the header with the file name (read man), >> adds to the file, not overwrites it as >.

And sure you can put both commands in one line:

head -n 2 file1.txt > all.txt; tail -n +3 -q file*.txt >> all.txt

or instead of ; put && between them for success check.

Note that shell glob expansions are sorted lexically by default. That means that while file1.txt to file9.txt will be sorted numerically, file10.txt will sort in between file1.txt and file2.txt (or possibly even before file1.txt depending on the locale). If using zsh, use file*.txt(n) for a numerical sort.

xealits
  • 2,153
  • 6
    I suggest to further simply it to: (head -2 file1.txt ; tail -n +3 -q file*.txt ) > all.txt or (head -2 file1.txt && tail -n +3 -q file*.txt ) > all.txt – HongboZhu Jun 13 '17 at 12:39
28

If you know how to do it in R, then by all means do it in R. With classical unix tools, this is most naturally done in awk.

awk '
    FNR==1 && NR!=1 { while (/^<header>/) getline; }
    1 {print}
' file*.txt >all.txt

The first line of the awk script matches the first line of a file (FNR==1) except if it's also the first line across all files (NR==1). When these conditions are met, the expression while (/^<header>/) getline; is executed, which causes awk to keep reading another line (skipping the current one) as long as the current one matches the regexp ^<header>. The second line of the awk script prints everything except for the lines that were previously skipped.

  • Thanks Gilles. Each of my files are in GBs. R won't be efficient do this. That's why I asked. – Jana Jan 08 '13 at 17:54
  • @Jana Are there lines that look like headers but aren't at the top of the file? If not, the fastest way is to use grep (like in sputnik's answer). – Gilles 'SO- stop being evil' Jan 08 '13 at 18:39
  • No the header lines are similar to all files and they are just at the top of each file. Yeah grep was faster. Thanks both of you – Jana Jan 08 '13 at 21:54
  • 1
    @Jana By the way, if all your files have the same number of header lines, here's another way (which I expect to be even faster): head -n 10 file1.txt >output.txt && tail -q -n +11 file*.txt >>output.txt (if you have 10 header lines). Also, if your files have numbers in their names, beware that file9.txt is sorted between file89.txt and file90.txt. If your files have numbers like file001.txt, …, files009.txt, files010.txt, …, then files*.txt will list them in the right order. – Gilles 'SO- stop being evil' Jan 08 '13 at 22:01
  • 1
    A better solution (from http://stackoverflow.com/a/16890695/310441) that doesn't require regex matching: awk 'FNR==1 && NR!=1{next;}{print}' *.csv – Owen Mar 16 '17 at 15:28
  • @Owen The additional code in my answer is necessary because there's a multiline header. – Gilles 'SO- stop being evil' Mar 16 '17 at 15:39
  • @Gilles'SO-stopbeingevil', your solution worked well! But I am a little greedy and would like to find a way to make the code more "general-purpose". For now, I have to first figure out what the header line is. Is there a way to save this step? – user3768495 Sep 29 '21 at 22:00
  • @user3768495 I don't understand the question. What defines a header line depends on the format of the data. – Gilles 'SO- stop being evil' Sep 30 '21 at 12:18
  • @Gilles'SO-stopbeingevil', you are exactly right! And what I meant to say is, is it possible to have a script that works every time, no matter what the actual header line is in the data file? – user3768495 Sep 30 '21 at 16:17
  • @user3768495 How is it supposed to recognize what is a header line? In this question, a header line starts with <header>. Without specifying this, there's no way to know what is the last header line and what is the first data line. – Gilles 'SO- stop being evil' Sep 30 '21 at 18:40
5

Try doing this :

$ cat file1.txt; grep -v "^<header" file2.txt
<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
A
B 
C
D
E 
F

NOTE

  • the -v flag means to invert the match of
  • ^ in REGEX, means beginning of the string
  • if you have a bunch of files, you can do

:

array=( files*.txt )
{ cat ${array[@]:0:1}; grep -v "^<header" ${array[@]:1}; } > new_file.txt

It's a array slicing technique.

1

The tail command (on GNU, at least) has an option to skip a given number of initial lines. To print from the second line onward, i.e. skip a one-line header, do: tail -n+2 myfile

So, to keep the two-line header of the first file but not the second, in Bash:

cat file1.txt <(tail -n+3 file2.txt) > combined.txt

Or, for many files:

head -n1 file1.txt > combined.txt
for fname in *.txt
do
    tail -n+3 $fname >> combined.txt
done

If a certain string is known to be present in all header lines but never in the rest of the input files, grep -v is a simpler approach, as sputnik showed.

etal
  • 111
1

Shorter (not necessarily faster) with sed:

sed -e '3,${/^<header>/d' -e '}' file*.txt > all.txt

This will delete all lines beginning with <header>... starting from line 3, so the first header is preserved and the other headers are removed. If there's a different number of lines in the header adjust the command accordingly (e.g. for 6-line header use 7 instead of 3).
If the number of lines in the header is unknown you could try like this:

sed '1{
: again
n
/^<header>/b again
}
/^<header>/d
' file*.txt > all.txt
don_crissti
  • 82,805
1

array=( *.txt );head -1 ${array[0]} > all.txt; tail -n +2 -q ${array[@]:0} >> all.txt

Assuming you are using a folder with .txt files with the same header that need to be combined/concatenated , this code would combine the txt files all into all.txt with just one header. the first line (lines separated by semicolons) gathers all the text files to concatenate, the second lines outputs the header from the first txt file into all.txt, and the last line concatenates all the text files gathered without the header (by starting the concatenation from row 2 onwards) and appends it to all.txt.

Eric
  • 21
0

Here's a lazy script to help with this. Not totally robust, but good enough.

function concat_with_header() {
  # Quoted suffix to pattern match for concatenation (e.g. '*.csv')
  local suffix="${1}"
  # Name of the output file
  local output="${2:-combined.out}"
  # Number of lines to use for the header
  local header_length="${3:-1}"
  # Grab the header from the first file
  local header=`echo -e "$(ls -b *$suffix | head -n$header_length)"`
  head -1 $header_file > $output; tail -n +"`expr $header_length + 1`" -q *$suffix >> $output
}
fny
  • 383
  • the last two lines seem to have errors: "header" variable is not used anywhere, the head command options appear to be switched (you want the first filename - not the $header_length first few files but the first $header_length lines of the $header_file - not necessarily the first line. "header_file" variable is not defined anywhere, quoting variables (last line of the function) is strongly recommended. – sborsky Aug 23 '21 at 18:20
0

If all your files have the same number of header lines (2 in your case), another very simple solution for lazy people like me is this:

head -2 file1.txt > all.txt
awk 'FNR>2{print}' file*.txt >> all.txt

Very similar to using head + tail as in the first answer:

  1. First you write the header
  2. Then you append all other files without headers