Does splitting a file in more files necessarily mean that some/all of the overall content will not be where it was?

Question

I guess that given a file of a certain size, not all of its bytes will be contiguous on disk (or will they? Just for existence of the phrase "defragmenting a disk" I assume they will not). But at least from the application point of view, they are. I mean, I can use head -c [-]n+tail -c [-]n to extract a portion of a file thinking of it as a contiguous sequence of bytes.

So say that a file is 10 bytes long and contains all equal bytes, e.g.

$ cat someFile
AAAAAAAAAA

is it possible to spilt it in two files, say someFile.part1 and someFile.part2, such that

$ cat someFile.part1
AAAAA
$ cat someFile.part2
AAAAA

and that actually no byte has been moved anywhere, in the sense that those 10 bytes are still precisely where they were before?

After all, the name someFile must one way or another map to the physical position (on the disk or on some kind of virtual memory that the OS (or the kernel?) makes me deal with) where the content actually starts. In a way, I imagine someFile not being much different from a pointer and length, say 0xabc/10, whereas the target files would be 0xabc/5 and 0xac1/5. Maybe I'm being to much infuenced by my C++ experience and really inexistent file system experience :D

I'm not interesting in doing it per se. I'm curious about understanding how programs like lxsplit work, and where their strengths are. Mostly for curiosity, but maybe, why not, to play at writing one myself.

Does this answer your question? split/reference big file by offset reference — Marcus Müller, Apr 04 '23 at 18:31
what you ask has little to nothing to do with lxsplit (which I've not heard about till today): what any normal split program does is just copy bytes from the original file to the split files — Marcus Müller, Apr 04 '23 at 18:32
@MarcusMüller the reason I mentioned lxsplit is that it is an example of a utility for splitting files. Since there are several, I guess some of them is faster, some them is slower, ... And I was wondering (but not asking) whether the topic I've asked has a relevance in making one such tool faster than it would be otherwise. — Enlico, Apr 04 '23 at 18:46
well, then see the answer to the question I linked to above. — Marcus Müller, Apr 04 '23 at 19:01
I don't this this is a duplicate of the link from MarcusMüller. This question seems to be asking about the capability of Linux file systems, the linked question has answers discussing using loop back devices to create a view rather than manipulating the file system. It's a subtle difference but I think an important one. — Philip Couling, Apr 04 '23 at 19:45
@PhilipCouling please read the second half of my answer there… I address specifically this. — Marcus Müller, Apr 04 '23 at 20:09
@MarcusMüller Possibly a good lesson in writing long answers: always give an initial overview paragraph - it helps people navigate the information more easily. But no I still think the two questions are significantly different. People searching for solutions on how to achieve things are after significantly different information to those searching for understanding of the space. The other question is interesting in this context, but not a direct duplicate. — Philip Couling, Apr 04 '23 at 20:20
@PhilipCouling good advice, though I'll have to say that there's two big bold headings for the two things, one of which I specifically addressed. — Marcus Müller, Apr 04 '23 at 20:22
Are you using a filesystem, like ZFS, that can deduplicate stored data? — Kusalananda, Apr 08 '23 at 09:52

Philip Couling · Answer 1 · 2023-04-05T07:27:04.480

Under normal circumstances, splitting a file will always result in file content being copied. As far as I know, Linux offers no mechanism to split files any other way.

That's almost the whole story, but not quite...

Linux uses a file system driver to manage the file on disk and the filesystem written to disk can sometimes be manipulated by other programs typically while the filesystem is NOT mounted by Linux. Eg debugfs for ext2/ext3/ext4

But there are hard limits on what a program might do if it manipulates the bytes of the filesystem itself...

Background

In most (all?) file systems the file and directory names are stored separately from the file data. The meta data such as file ownership and timestamp may also be stored in a separate place again commonly known as an "Inode".

This allows attaching two file names to the same file (AKA hard links).

File data is stored in blocks. The default block size for ext4 is 4KiB (4096 bytes). Normally, saving just one byte will require a whole block, but then more bytes can be saved to the file "for free" until the block is full.

When the first block is full and more bytes need to be written, the file system must find a new (available) block. Then:

It marks the new block unavailable so that no other file tries to use it
and records where this block is inside the file

Different file systems have different techniques for these two things, but the requirement is always the same. File system drivers do their best to ensure that new blocks follow on immediately from previous ones in the file but this cannot be guaranteed. Fragmentation happens when the next block has already been allocated to a different file.

When might files quietly share some blocks?

Some file systems and some block level storage has the ability to perform de-duplication. If the exact content written to a block matches the exact content of another block then it is possible that only one block may be kept. Sometimes this means the duplicates are never written, sometimes it means they are retrospectively removed.

As Marcus Müller points out some file systems can perform duplicate block removal.

It is also a feature of LVM which happens at the block level, outside the knowledge of the filesystem.

Hard limits

Hypothetically, you could manipulate a file system by creating a new inode, adding a filename for that inode, and then reallocating blocks from one inode to another.

Again Linux does NOT have a mechanism for this, but there's nothing to stop you playing with the actual bytes on your hard drive to make this happen if you know how.

What's not [usually] possible is splitting a file anywhere you like without moving bytes. Block sizes are non-negotiable. You can't have half used blocks in the middle of a file for most file systems.

This means that in your example, a new block will (nearly) certainly be written because 10 byes is not a multiple of 4096 or any other common block size.

However different file systems have different capabilities. BTRFS, for example, supports tail packing.

To do what OP wants, wouldn't directory entries need offsets to (i) the end of one file, and (ii) the beginning of another file, plus something like hard links that can point to any random block in another file? — RonJohn, Apr 04 '23 at 20:23
And I can imagine the chaos involved when the original file gets edited... — RonJohn, Apr 04 '23 at 20:23
well the thing is that this answer really doesn't address the split at all; you can't hard or symlink to "the second half of a file"! That's where my slight negativity towards this answer comes from. Please don't take any of this personally! — Marcus Müller, Apr 04 '23 at 20:25
@MarcusMüller I really haven't suggested you can. I've in fact started by saying you can't. — Philip Couling, Apr 04 '23 at 20:27
@PhilipCouling yes, indeed, that's a bit like saying "you have a nice question, but let me answer a different one", isn't it :) Again, really, please don't take it personally! — Marcus Müller, Apr 04 '23 at 20:39
@MarcusMüller Absolutely not no. The OP has not said "how do I do this" they've said "does this this thing I know, infer that thing I believe?". That calls for a) conformation / rejection of their belief, and b) explanation of why. Those two things are exactly what I have written in this answer. No I reject the idea I've answered a question other than the one which was asked. — Philip Couling, Apr 04 '23 at 20:46
@PhilipCouling and that's very admirable! I think just in this one case you kind of missed the target. But, really, that's only for me to comment/vote on – the asker will accept your answer if they think it hits what they wanted to know. I assume they'll read both our answers and be much smarter, either way, and I'm really not here for the imaginary internet glory points :) — Marcus Müller, Apr 04 '23 at 20:54

score 1 · Answer 2 · answered Apr 08 '23 at 09:49

Generally no - this can't be done. Filesystems are (except for very specialized counter-examples) designed for block-oriented storage systems (disks, SSD, NVMEs ...) and a file consists of blocks of data and metadata. The metadata includes permissions, ACLs and such, but especially a file length and an ordered list of blocks for the file. There are some more advanced organizations of files, but all of these can be resolved to a length and an ordered list of data blocks.

In the example shown it is possible to convert somefile to somefile.part1 by truncation. Truncation involves changing the file length metadata, and de-allocating any blocks no longer needed on the block list. This can be done with any copy. Creating the somefile.part2 is the difficulty. In the example, characters 5 thru 10 "AAAAA" do not lay at the beginning of a block (blocks are powers of 2 bytes in size, typically 512B or 4096B), so cannot be the first part of a file unless copied to the beginning of a block.

Even if we wished to split a files data at a block boundary, the kernel filesystem code does not expose this sort of feature to the userspace, so it would require kernel filesystem modifications.

Does splitting a file in more files necessarily mean that some/all of the overall content will not be where it was?

2 Answers2

Background

When might files quietly share some blocks?

Hard limits