Append huge files to each other without copying them

Question

There are 5 huge files ( file1, file2, .. file5) about 10G each and extremely low free space left on the disk and I need to concatenate all this files into one. There is no need to keep original files, only the final one.

Usual concatenation is going with cat in sequence for files file2 .. file5:

cat file2 >> file1 ; rm file2

Unfortunately this way requires a at least 10G free space I don't have. Is there a way to concatenate files without actual copying it but tell filesystem somehow that file1 doesn't end at original file1 end and continues at file2 start?

ps. filesystem is ext4 if that matters.

I'd be interested to see a solution, but I suspect it's not possible without messing with the filesystem directly. — Kevin, Jun 23 '13 at 15:51
Why do you need to have a single physical file that is so large? I'm asking because maybe you can avoid concatenating—which, as current answers show, is pretty bothersome. — liori, Jun 23 '13 at 17:05
@liori these files are chunks of a solid disk image and I have no idea how to mount it without concatenation. — rush, Jun 23 '13 at 19:47
@rush: then this answer might help: http://serverfault.com/a/487692/16081 — liori, Jun 23 '13 at 20:02
@liori: thanks. that looks even better than the dances with dd. — rush, Jun 23 '13 at 20:28
Alternative to device-mapper, less efficient, but easier to implement and results in a partitionable device and can be used from a remote machine is to use the "multi" mode of nbd-server. — Stéphane Chazelas, Jun 23 '13 at 20:33
They always call me stupid when I tell that I think this should be cool. — n611x007, Jul 12 '13 at 13:54

score 21 · Accepted Answer · edited Apr 13 '17 at 12:36

AFAIK it is (unfortunately) not possible to truncate a file from the beginning (this may be true for the standard tools but for the syscall level see here). But with adding some complexity you can use the normal truncation (together with sparse files): You can write to the end of the target file without having written all the data in between.

Let's assume first both files are exactly 5GiB (5120 MiB) and that you want to move 100 MiB at a time. You execute a loop which consists of

copying one block from the end of the source file to the end of the target file (increasing the consumed disk space)

truncating the source file by one block (freeing disk space)

for((i=5119;i>=0;i--)); do
  dd if=sourcefile of=targetfile bs=1M skip="$i" seek="$i" count=1
  dd if=/dev/zero of=sourcefile bs=1M count=0 seek="$i"
done

But give it a try with smaller test files first, please...

Probably the files are neither the same size nor multiples of the block size. In that case the calculation of the offsets becomes more complicated. seek_bytes and skip_bytes should be used then.

If this is the way you want to go but need help for the details then ask again.

Warning

Depending on the dd block size the resulting file will be a fragmentation nightmare.

Looks like this is the most acceptable way to concatenate files. Thanks for advice. — rush, Jun 23 '13 at 19:53
if there is no sparse file support then you could block-wise reverse the second file in place and then just remove the last block and add it to the second file — ratchet freak, Jun 24 '13 at 00:04
I haven't tried this myself (although I'm about to), but http://seann.herdejurgen.com/resume/samag.com/html/v09/i08/a9_l1.htm is a Perl script that claims to implement this algorithm. — zwol, Nov 26 '13 at 17:58

Rob Bos · Answer 2 · 2013-06-24T04:28:42.243

18

Instead of catting the files together into one file, maybe simulate a single file with a named pipe, if your program can't handle multiple files.

mkfifo /tmp/file
cat file* >/tmp/file &
blahblah /tmp/file
rm /tmp/file

As Hauke suggests, losetup/dmsetup can also work. A quick experiment; I created 'file1..file4' and with a bit of effort, did:

for i in file*;do losetup -f ~/$i;done

numchunks=3
for i in `seq 0 $numchunks`; do
        sizeinsectors=$((`ls -l file$i | awk '{print $5}'`/512))
        startsector=$(($i*$sizeinsectors))
        echo "$startsector $sizeinsectors linear /dev/loop$i 0"
done | dmsetup create joined

Then, /dev/dm-0 contains a virtual block device with your file as contents.

I haven't tested this well.

Another edit: The file size has to be divisible evenly by 512 or you'll lose some data. If it is, then you're good. I see he also noted that below.

edited Jun 24 '13 at 04:28

answered Jun 23 '13 at 16:51

Rob Bos

4,380

It's a great idea to read this file once, unfortunately it has no ability to jump over fifo backward/forward, isn't it? – rush Jun 23 '13 at 19:50
7

@rush The superior alternative may be to put a loop device on each file and combine them via dmsetup to a virtual block device (which allows normal seek operations but neither append nor truncate). If the size of the first file is not a multiple of 512 then you should copy the incomplete last sector and the first bytes from the second file (in sum 512) to a third file. The loop device for the second file would need --offset then. – Hauke Laging Jun 23 '13 at 20:30
elegant solutions. +1 also to Hauke Laging who suggest a way to workaround the problem if the first file(s)'s size is not a multiple of 512 – Olivier Dulac Jun 24 '13 at 18:09

score 10 · Answer 3 · answered Jun 23 '13 at 16:02

10

You'll have to write something that copies data in bunches that are at most as large as the amount of free space you have. It should work like this:

Read a block of data from file2 (using pread() by seeking before the read to the correct location).
Append the block to file1.
Use fcntl(F_FREESP) to deallocate the space from file2.
Repeat

answered Jun 23 '13 at 16:02

Celada

44,132

1

I know... but I couldn't think of any way that didn't involve writing code, and I figured writing what I wrote was better that writing nothing. I didn't think of your clever trick of starting from the end! – Celada Jun 23 '13 at 16:10
Yours, too, wouldn't work without starting from the end, would it? – Hauke Laging Jun 23 '13 at 16:16
No, it works from the beginning because of fcntl(F_FREESP) which frees the space associated with a given byte range of the file (it makes it sparse). – Celada Jun 23 '13 at 16:19
That's pretty cool. But seems to be a very new feature. It is not mentioned in my fcntl man page (2012-04-15). – Hauke Laging Jun 23 '13 at 16:31
4

@HaukeLaging F_FREESP is the Solaris one. On Linux (since 2.6.38), it's the FALLOC_FL_PUNCH_HOLE flag of the fallocate syscall. Newer versions of the fallocate utility from util-linux have an interface to that. – Stéphane Chazelas Jun 23 '13 at 20:12

score 0 · Answer 4 · answered Jun 24 '13 at 17:51

0

I know it's more of a workaround than what you asked for, but it would take care of your problem (and with little fragmentation or headscratch):

#step 1
mount /path/to/... /the/new/fs #mount a new filesystem (from NFS? or an external usb disk?)

and then

#step 2:
cat file* > /the/new/fs/fullfile

or, if you think compression would help:

#step 2 (alternate):
cat file* | gzip -c - > /the/new/fs/fullfile.gz

Then (and ONLY then), finally

#step 3:
rm file*
mv /the/new/fs/fullfile  .   #of fullfile.gz if you compressed it

answered Jun 24 '13 at 17:51

Olivier Dulac

6,285

Unfortunately external usb disk requires physical access and nfs requires additional hardware and I have nothing of it. Anyway thanks. =) – rush Jun 24 '13 at 17:58
I thought it would be that way... Rob Bos's answer is then what seems your best option (without risking losing data by truncating-while-copying, and without hitting FS limitations as well) – Olivier Dulac Jun 24 '13 at 18:12

Append huge files to each other without copying them

4 Answers4

Linked

Related