3

I have huge files {0..9}.bin which I want concatenate into out.bin. I don't need the original files afterwards. So I was wondering, if this is possible by only modifying the filesystem index without copying the file contents (see Append huge files to each other without copying them for efficient copy solutions).

On modern file systems (e.g. btrfs) cp --reflink=always exists. Fifos are on the file system level (at least btrfs send also tracks fifos), so they should have information about the actual data blocks used. Therefore, cp --reflink=always should be able to determine the extend numbers on disk and re-use them.

So I was wondering, if it is possible to use mkfifo in combination with cp --reflink=always?

Update Currently, it is not working:

for i in {1..9}; do dd if=/dev/urandom of="in$i.bin" bs=5M count=200; done;
mkfifo fifo
cat in* >fifo &
cp --reflink=always fifo out.bin

results in cp: failed to clone 'out.bin' from 'fifo': Invalid argument

Probably, it never will, since FIFOs have no information about the storage origin bug instead are just dumb pipes.

  • Is your intention to save space or to save on write operations? – Kusalananda May 20 '20 at 08:53
  • @Kusalanada: Save on write operations (speed). – darkdragon May 20 '20 at 10:26
  • Why do you think a fifo helps you achieve your goal? A fifo (named pipe) is a tool that enables pushing data into and out of a RAM buffer abstracted by a special file. It is not a file and does not map blocks.... Maybe you're on to something, I just can't work it out. – Pedro May 20 '20 at 10:26
  • @Pedro: Yes, I am new to fifos, but inspecting this code https://unix.stackexchange.com/a/80439/234981 leads me to the conclusion that cat *.bin > myfifo.bin does not actually copy the contents into RAM but instead reference them somehow. – darkdragon May 20 '20 at 10:28
  • I believe that would lock until you read off the fifo, so the data goes nowhere until you cat myfifo.bin somewhere. cp --reflink=always refers to copy-on-write block operations - meaning blocks are not duplicated until they're written to, although this refers to files stored on the file system, not a fifo (which isn't actually a file with data blocks). – Pedro May 20 '20 at 11:00
  • I suggest you try cat *.bin > out.bin and check whether it's acceptable performance and space consumption. If it is, you're solving a problem that isn't a problem. I completely understand the logic of your question and conceptually the concatenation will duplicate all data blocks unnecessarily, but I don't think there's a mechanism that avoids this. Have a look at this https://unix.stackexchange.com/questions/118244/fastest-way-to-concatenate-files – Pedro May 20 '20 at 11:04
  • 1
    @Pedro: I read this thread and actually have been inspired for my question after reading it! The use-case is the high-performance cloud storage provider minio. Currently (in fs mode), chunks are saved when they arrive for S3 compliance. After the last chunk has been received, all files are concatenated and and copied to the appropriate location. It is used by TBs of data every day! – darkdragon May 20 '20 at 13:08
  • It may not be worth the effort and/or beyond your capabilities but an elegant solution would be to use FUSE for this. You would have to write a FUSE FS which shows just one file (the combined one, as immutable at best) and maps the reading position into the respective source file. I didn't check. Maybe someone has already done that. ;-) – Hauke Laging May 20 '20 at 16:35
  • This would be a cool feature, but what should happen if each individual file doesn't fully fill its last block? – rickhg12hs May 20 '20 at 18:14
  • @HaukeLaging: Easier than FUSE is losetup/dmsetup. But doing this for thousands of files is not a good idea - especially since read speed is crucial. – darkdragon May 20 '20 at 22:01
  • Yup, I understand your situation. Depending on your purpose for the large file, losetup might enable you to map a contiguous space consisting of those files? It's a guess. – Pedro May 21 '20 at 15:40
  • https://unix.stackexchange.com/questions/17084/mounting-multiple-img-files-as-single-loop-device – Pedro May 21 '20 at 15:42
  • @Pedro: This is similar to what I linked in my comment. As stated, this is not possible for the use-case I am describing. Thanks for your suggestions anyways! – darkdragon May 21 '20 at 23:42

0 Answers0