I've got 10k+ files totaling over 20GB that I need to concatenate into one file.
Is there a faster way than
cat input_file* >> out
?
The preferred way would be a bash command, Python is acceptable too if not considerably slower.
I've got 10k+ files totaling over 20GB that I need to concatenate into one file.
Is there a faster way than
cat input_file* >> out
?
The preferred way would be a bash command, Python is acceptable too if not considerably slower.
Nope, cat is surely the best way to do this. Why use python when there is a program already written in C for this purpose? However, you might want to consider using xargs
in case the command line length exceeds ARG_MAX
and you need more than one cat
. Using GNU tools, this is equivalent to what you already have:
find . -maxdepth 1 -type f -name 'input_file*' -print0 |
sort -z |
xargs -0 cat -- >>out
find
is piped through sort
. Without this, the files would be listed in an arbitrary order (defined by the file system, which could be file creation order).
– scai
Mar 05 '14 at 13:43
bash
glob. Otherwise I don't see any cases where xargs
or cat
would not behave as expected.
– Graeme
Mar 05 '14 at 14:05
ARG_MAX
problem? I can see only one call of cat
in the displayed code, so it does not address that problem at all.
– Marc van Leeuwen
Mar 06 '14 at 08:46
xargs
will call as may cat
as is necessary to avoid an E2BIG error of execve(2).
– Stéphane Chazelas
Mar 06 '14 at 09:41
xargs
behaviour documented anywhere on my (Ubuntu Linux) system.
– Marc van Leeuwen
Mar 06 '14 at 10:51
man
page for GNU xargs
is pretty bad and misses a couple of major points of xargs
operation.
– Graeme
Mar 06 '14 at 11:12
Allocating the space for the output file first may improve the overall speed as the system won't have to update the allocation for every write.
For instance, if on Linux:
size=$({ find . -maxdepth 1 -type f -name 'input_file*' -printf '%s+'; echo 0;} | bc)
fallocate -l "$size" out &&
find . -maxdepth 1 -type f -name 'input_file*' -print0 |
sort -z | xargs -r0 cat 1<> out
Another benefit is that if there's not enough free space, the copy will not be attempted.
If on btrfs
, you could copy --reflink=always
the first file (which implies no data copy and would therefore be almost instantaneous), and append the rest. If there are 10000 files, that probably won't make much difference though unless the first file is very big.
There's an API to generalise that to ref-copy all the files (the BTRFS_IOC_CLONE_RANGE
ioctl
), but I could not find any utility exposing that API, so you'd have to do it in C (or python
or other languages provided they can call arbitrary ioctl
s).
If the source files are sparse or have large sequences of NUL characters, you could make a sparse output file (saving time and disk space) with (on GNU systems):
find . -maxdepth 1 -type f -name 'input_file*' -print0 |
sort -z | xargs -r0 cat | cp --sparse=always /dev/stdin out
fallocate
fails to pre-allocate, e.g. because the filesystem does not support it (currently only btrfs
, ext4
, ocfs2
, and xfs
support fallocate
); since there is little harm done if pre-allocation fails, i guess it's safer to use fallocate -l "$size" out; find . ...
– umläute
Mar 05 '14 at 15:51
>
nor >>
, but 1<>
as I said to write into the file.
– Stéphane Chazelas
Mar 05 '14 at 16:41
1<>
, could you please post a link to reference / explanation?
– grebneke
Mar 05 '14 at 20:21
<>
is the standard Bourne/POSIX read+write redirection operator. See your shell manual or the POSIX spec for details. The default fd
is 0
for the <>
operator (<>
is short for 0<>
, like <
is short for 0<
and >
short for 1>
), so you need the 1
to explicitly redirect stdout. Here, it's not so much that we need read+write (O_RDWR
), but that we don't want O_TRUNC
(as in >
) which would deallocate what we've just allocated.
– Stéphane Chazelas
Mar 05 '14 at 20:30
<>
was in the Bourne shell from the start (1979) but initially not documented.
– Stéphane Chazelas
Mar 05 '14 at 20:40
seek()
in bash (?), what are common real-world usages, except for skipping O_TRUNC
? man bash
is really terse on the subject, and it's hard to usefully google bash <>
– grebneke
Mar 05 '14 at 20:48
dd
or via reading.
– Stéphane Chazelas
Mar 05 '14 at 21:15
fallocate
will negate the overhead of the extra find
, even though it will be faster the second time round. btrfs
certainly opens up some interesting possibilities though.
– Graeme
Mar 06 '14 at 01:28
find
does not sort files the same as a shell glob. – Graeme Mar 05 '14 at 13:18out
is located on another disk. – Mar 05 '14 at 22:41