49

There are 5 huge files ( file1, file2, .. file5) about 10G each and extremely low free space left on the disk and I need to concatenate all this files into one. There is no need to keep original files, only the final one.

Usual concatenation is going with cat in sequence for files file2 .. file5:

cat file2 >> file1 ; rm file2

Unfortunately this way requires a at least 10G free space I don't have. Is there a way to concatenate files without actual copying it but tell filesystem somehow that file1 doesn't end at original file1 end and continues at file2 start?

ps. filesystem is ext4 if that matters.

slm
  • 369,824
rush
  • 27,403
  • 2
    I'd be interested to see a solution, but I suspect it's not possible without messing with the filesystem directly. – Kevin Jun 23 '13 at 15:51
  • 1
    Why do you need to have a single physical file that is so large? I'm asking because maybe you can avoid concatenating—which, as current answers show, is pretty bothersome. – liori Jun 23 '13 at 17:05
  • @liori these files are chunks of a solid disk image and I have no idea how to mount it without concatenation. – rush Jun 23 '13 at 19:47
  • 6
    @rush: then this answer might help: http://serverfault.com/a/487692/16081 – liori Jun 23 '13 at 20:02
  • @liori: thanks. that looks even better than the dances with dd. – rush Jun 23 '13 at 20:28
  • 1
    Alternative to device-mapper, less efficient, but easier to implement and results in a partitionable device and can be used from a remote machine is to use the "multi" mode of nbd-server. – Stéphane Chazelas Jun 23 '13 at 20:33
  • 1
    They always call me stupid when I tell that I think this should be cool. – n611x007 Jul 12 '13 at 13:54

4 Answers4

21

AFAIK it is (unfortunately) not possible to truncate a file from the beginning (this may be true for the standard tools but for the syscall level see here). But with adding some complexity you can use the normal truncation (together with sparse files): You can write to the end of the target file without having written all the data in between.

Let's assume first both files are exactly 5GiB (5120 MiB) and that you want to move 100 MiB at a time. You execute a loop which consists of

  1. copying one block from the end of the source file to the end of the target file (increasing the consumed disk space)
  2. truncating the source file by one block (freeing disk space)

    for((i=5119;i>=0;i--)); do
      dd if=sourcefile of=targetfile bs=1M skip="$i" seek="$i" count=1
      dd if=/dev/zero of=sourcefile bs=1M count=0 seek="$i"
    done
    

But give it a try with smaller test files first, please...

Probably the files are neither the same size nor multiples of the block size. In that case the calculation of the offsets becomes more complicated. seek_bytes and skip_bytes should be used then.

If this is the way you want to go but need help for the details then ask again.

Warning

Depending on the dd block size the resulting file will be a fragmentation nightmare.

Hauke Laging
  • 90,279
  • Looks like this is the most acceptable way to concatenate files. Thanks for advice. – rush Jun 23 '13 at 19:53
  • 4
    if there is no sparse file support then you could block-wise reverse the second file in place and then just remove the last block and add it to the second file – ratchet freak Jun 24 '13 at 00:04
  • 1
    I haven't tried this myself (although I'm about to), but http://seann.herdejurgen.com/resume/samag.com/html/v09/i08/a9_l1.htm is a Perl script that claims to implement this algorithm. – zwol Nov 26 '13 at 17:58
18

Instead of catting the files together into one file, maybe simulate a single file with a named pipe, if your program can't handle multiple files.

mkfifo /tmp/file
cat file* >/tmp/file &
blahblah /tmp/file
rm /tmp/file

As Hauke suggests, losetup/dmsetup can also work. A quick experiment; I created 'file1..file4' and with a bit of effort, did:

for i in file*;do losetup -f ~/$i;done

numchunks=3
for i in `seq 0 $numchunks`; do
        sizeinsectors=$((`ls -l file$i | awk '{print $5}'`/512))
        startsector=$(($i*$sizeinsectors))
        echo "$startsector $sizeinsectors linear /dev/loop$i 0"
done | dmsetup create joined

Then, /dev/dm-0 contains a virtual block device with your file as contents.

I haven't tested this well.

Another edit: The file size has to be divisible evenly by 512 or you'll lose some data. If it is, then you're good. I see he also noted that below.

Rob Bos
  • 4,380
  • It's a great idea to read this file once, unfortunately it has no ability to jump over fifo backward/forward, isn't it? – rush Jun 23 '13 at 19:50
  • 7
    @rush The superior alternative may be to put a loop device on each file and combine them via dmsetup to a virtual block device (which allows normal seek operations but neither append nor truncate). If the size of the first file is not a multiple of 512 then you should copy the incomplete last sector and the first bytes from the second file (in sum 512) to a third file. The loop device for the second file would need --offset then. – Hauke Laging Jun 23 '13 at 20:30
  • elegant solutions. +1 also to Hauke Laging who suggest a way to workaround the problem if the first file(s)'s size is not a multiple of 512 – Olivier Dulac Jun 24 '13 at 18:09
10

You'll have to write something that copies data in bunches that are at most as large as the amount of free space you have. It should work like this:

  • Read a block of data from file2 (using pread() by seeking before the read to the correct location).
  • Append the block to file1.
  • Use fcntl(F_FREESP) to deallocate the space from file2.
  • Repeat
Celada
  • 44,132
  • 1
    I know... but I couldn't think of any way that didn't involve writing code, and I figured writing what I wrote was better that writing nothing. I didn't think of your clever trick of starting from the end! – Celada Jun 23 '13 at 16:10
  • Yours, too, wouldn't work without starting from the end, would it? – Hauke Laging Jun 23 '13 at 16:16
  • No, it works from the beginning because of fcntl(F_FREESP) which frees the space associated with a given byte range of the file (it makes it sparse). – Celada Jun 23 '13 at 16:19
  • That's pretty cool. But seems to be a very new feature. It is not mentioned in my fcntl man page (2012-04-15). – Hauke Laging Jun 23 '13 at 16:31
  • 4
    @HaukeLaging F_FREESP is the Solaris one. On Linux (since 2.6.38), it's the FALLOC_FL_PUNCH_HOLE flag of the fallocate syscall. Newer versions of the fallocate utility from util-linux have an interface to that. – Stéphane Chazelas Jun 23 '13 at 20:12
0

I know it's more of a workaround than what you asked for, but it would take care of your problem (and with little fragmentation or headscratch):

#step 1
mount /path/to/... /the/new/fs #mount a new filesystem (from NFS? or an external usb disk?)

and then

#step 2:
cat file* > /the/new/fs/fullfile

or, if you think compression would help:

#step 2 (alternate):
cat file* | gzip -c - > /the/new/fs/fullfile.gz

Then (and ONLY then), finally

#step 3:
rm file*
mv /the/new/fs/fullfile  .   #of fullfile.gz if you compressed it
  • Unfortunately external usb disk requires physical access and nfs requires additional hardware and I have nothing of it. Anyway thanks. =) – rush Jun 24 '13 at 17:58
  • I thought it would be that way... Rob Bos's answer is then what seems your best option (without risking losing data by truncating-while-copying, and without hitting FS limitations as well) – Olivier Dulac Jun 24 '13 at 18:12