due to the lack of storage space, can a file be split by reference/symbolicly (i.e. virtually splitting the file) that points to specific offsets in the big file?
Not directly, no. Files don't work that way in the thinking of POSIX. They're more independent, atomic units of data.
Two options:
Loopback devices
This is a runtime solution, meaning that it's not an on-disk solution, but needs to be set up manually. That might be an advantage or a disadvantage!
You can set up a loopback device quite easily; if you're using a freedesktop system message bus-compatible session manager (i.e., you're logged in to your machine graphically and are running gnome, xfce4, kde,…), udisks is your friend:
blksize=$((2**20))
udisksctl loop-setup -s $blksize -f /your/large/file
udisksctl loop-setup -s $blksize -o $blksize -f /your/large/file
- The first command gives you a
/dev/loop
which starts at the 0. byte and goes on for 2²⁰ bytes (i.e., a megabyte).
- The second command gives you a
/dev/loop+
which starts at the 2²⁰. byte and goes on for 2²⁰ bytes (i.e., a megabyte). (Note how we're starting to count at 0, so that this is actually exactly after the first chunk.)
You can then use these two loopback devices, e.g. /dev/loop0
and /dev/loop1
. They literally describe a "view" into your file. You change something in these block devices, you change it in your large file.
If you're not logging in graphically, the exact same can be achieved, but you need root privileges:
blksize=$((2**20))
sudo losetup --sizelimit $blksize -f /your/large/file
sudo losetup --sizelimit $blksize -o $blksize -f /your/large/file
Reflinking the storage blocks
This is an on-disk solution. It will also make a "logical" copy, i.e., if you change something in your small files, it will not be reflected in the large file.
You must use a file system that supports reflink
(to the best of my knowledge, these are XFS, Btrfs and some network file systems). (File system blocks through which a split goes would need to be duplicated, but for most filesystems, we're talking less than 4 kB here.)
In that case, your file system can copy files without the copy using any space of its own, as long as the copy or original aren't changed (and even if they are, only the affected parts are duplicated).
So, on such a file system, we have two options:
- make the split into the first and the second half "normally", and then ask a utility (
duperemove
) to compare the three files and deduplicate.
- make the copy of the two halves of your file in a manner that hints the file system to directly avoid using twice the space.
Since the first option temporarily needs twice the space, let's do the second right away. I wrote a small program to do that split for you (full source code) (Attention: Found a bug, which I'm not going to fix. This is just an example). The relevant excerpt is this:
// this is file mysplit.c , a C99 program
// SPDX-License-Identifier: Linux-man-pages-copyleft
// (heavily based on the copy_file_range(2) man page example)
// compile with `cc -o mysplit mysplit.c`
// […]
int main(int argc, char* argv[])
{
[…]
ret = copy_file_range(fd_in /*input file descriptor*/,
&in_offset /*address of input offset*/,
fd_out /*output file descriptor*/,
NULL /*address of output offset*/,
len /*amount of bytes to copy*/,
0 /*flags (reserved, must be ==0)*/);
[…]
}
Let's try that out. We'll have to:
- get and compile my program
- make a XFS file system first, as that's one of the file systems that supports reflinks
- mount it
- put a large file filled with random data inside
- check the free space on that file system
- add splits of the file (using my program)
- check the free space again.
The script below does just that.
# replace "apt" with "dnf" if you're on some kind of redhat/fedora
# replace "apt install" with "pacman" followed by a randomly guessed amount of options involving the letters {y, s, S, u} if you're on arch or manjaro
apt install curl gcc xfsprogs
Download the C program from above
curl https://gist.githubusercontent.com/marcusmueller/d1e0235f9a484cb44626e35460a5c0ac/raw/6295f9f6371b916b87a5d0a5a6edad65f9ea8627/mysplit.c > mysplit.c
Comile
cc -o mysplit mysplit.c
Demo: (doesn't need root privileges if you got udisks)
Make file system in a file system image,
Mount that image
create a 300 MB file in there,
split it,
show there's still nearly the same amount of free space
Make file system in a file system image
fallocate -l 1G filesystemimage # 1 GB in size
this is a bit confusing, but to make the root of the file system
world-writable, we do:
echo "thislinejustforbackwardscompatibility/samefornextline
1337 42
d--777 ${UID} ${GID}" > /tmp/protofile
mkfs.xfs -p /tmp/protofile -L testfs filesystemimage
Mount that image
loopdev=$(LANG=C udisksctl loop-setup -f filesystemimage | sed 's/.* as (.*).$/\1/')
on most systems, that new device is automatically mounted.
sleep 3
udisksctl mount -b "${loopdev}"
create a 300 MB file in there,
target="/run/media/$(whoami)/testfs"
rndfile="${target}/largefile"
dd if=/dev/urandom "of=${rndfile}" bs=1M count=300
Check free space
echo "free space with large file"
df -h "${target}"
split it,
(copy the first 100 MB)
./mysplit "${rndfile}" "${target}/split_1" 0 $((2*20 100))
(copy the next 120 MB, just showing off that splits don't need to be of uniform size)
./mysplit "${rndfile}" "${target}/split_2" "$((220 * 100))" "$((220 * 120))"
show there's still nearly the same amount of free space
Check free space
echo "free space with large file + splits"
df -h "${target}"
map_file_range
methodology in the VFS, it should work just the same with btrfs) when youcopy_file_range
. Added a demo to the end of my post to show that! – Marcus Müller Mar 14 '23 at 09:07