1

I want to concatenate multiple files from different directories.

Directory 1: Chr1 containing (in this example) four files:

ABC.1 
DEF.1
GHI.1 
JKL.1 

Directory 2: Chr2

ABC.2  
DEF.2 
GHI.2
JKL.2 

There are 22 directories. Each file has 20 columns with a header. The headers are the same for all files.

I want to concatenate all into one file (one global output file for the concatenation of all files from all directories).

I tried this but this does not work.

cat */Chr{1..22}/*.{1..22} > */final_file

It said that "No such file or directory", as there are no files, for example, for *.1~21 for the files in the directory of chr22.

Do you have any ideas? Thank you in advance.

Chris Davies
  • 116,213
  • 16
  • 160
  • 287

3 Answers3

2

Just use the zsh shell instead:

cat -- */Chr<1-22>/*.<1-22>(n) > final_file

In zsh, <x-y> is a glob operator that matches on ranges of decimal integer numbers, and the n glob qualifier toggles the numericglobsort option which causes glob expansions to be sorted numerically.

From within another shell, you can do:

zsh -c 'cat -- */Chr<1-22>/*.<1-22>(n) > final_file'

To skip the header on all but the first file, and assuming the GNU or busybox implementation of tail (the most common on systems using Linux as their kernel), you can do:

(){
  cat < $1; shift; (($#)) && tail -qn +2 -- "$@"
} */Chr<1-22>/*.<1-22>(n) > final_file 
1

The problem with your approach is that repeated wildcard characters are not interpreted (="expanded") in a "synchronized" way, but anew and independently for every occurence on a command-line. Hence, you will need to operate with nested shell loops.

You could try the following shell script. Notice that it uses bash features (your question didn't include what shell you are using)

#!/bin/bash

hdr=0   # initialize variable to keep track of whether the header is already printed

# loop over directories
for d in Chr*
do
    # extract trailing number from dir name by removing 'Chr' part (bash feature!)
    n="${d#Chr}"

    # loop over all files
    for f in "$d/"*".$n"
    do
       if (( hdr == 0 )) # if header wasn't printed yet, output entire file
       then
           cat "$f" > final_file
           hdr=1
       else              # otherwise, output file content starting with line 2
           tail -n +2 "$f" >> final_file
       fi
    done
done

You can name the script concatenate.sh, make it executable, and run it from the directory where all your Chr{1..22} subdirectories are located. The final_file will be created in that directory, too.

Notice I couldn't test it very far, but it shouldn't destroy anything ...

AdminBee
  • 22,803
0

If you want to capture all the files in all your subdirectories Chr.* you could use this

cat Chr*/* >final_file

If you need to restrict the set of files in each subdirectory to match the suffix of that directory's name (so in Chr1 we only consider files matching *.1) you will need a loop

shopt nullglob    # This is bash-specific
for i in {1..22}
do
    cat Chr$i/*.$i
done >final_file

The optional shopt nullglob tells the shell that if a wildcard cannot match it's to be removed rather than being left as a literal asterisk.

As an alternative, since there seems to be some feeling that you want all but the first header line omitted from your concatenated files, this expanded loop can handle it

first=yes
for i in {1..22}
do
    for f in Chr$i/*.$i
    do
        [[ -n "$first" ]] && head -n1 "$f" && first=
        cat "$f"
    done
done >final_file

Or if your header line exists as the first line of the first file, and can thereafter be removed whereever it's encountered, you can strip it using a construct like this

for i in {1..22}
do
    cat Chr$i/*.$i
done |
    awk '$0 != header { print } header == "" { header = $0 }' >final_file
Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • The problem is that the first line of every file contains a header that is to appear only once in the overall result file, so you will need to include some test on that ... – AdminBee Apr 29 '20 at 13:56
  • It doesn't say that anywhere in the question. The OP's example doesn't attempt to handle that either – Chris Davies Apr 29 '20 at 14:10
  • Well, I don't see any other reason why the OP should mention the (globally identical) header in the first place if it isn't to be treated specially ... – AdminBee Apr 29 '20 at 14:12
  • I've seen a lot of questions where information is supplied simply because the asker doesn't know if it's relevant or not. – Chris Davies Apr 29 '20 at 14:13