I have deduplicated my Btrfs filesystem with bedup, so now all duplicate files (above a certain size) are "reflink" copies.
Is there any way to see, given a filename, what other files are the same reflinks?
I have deduplicated my Btrfs filesystem with bedup, so now all duplicate files (above a certain size) are "reflink" copies.
Is there any way to see, given a filename, what other files are the same reflinks?
The whole point of having a Copy-On-Write (CoW) filesystem like btrfs is that the content of multiple versions of a file can be efficiently shared. So you might see a file as a collection of ranges with contents, where the content might or might not be shared by other files. Or by other versions of the file. The implementation is more like a tree of extends, where extends may be shared.
The same mechanism which works during writing a change to a file (and therefore producing a new version of that file) is being used to do the deduplification. The implementation is described on https://github.com/g2p/bedup :
Deduplication is implemented using a Btrfs feature that allows for cloning data from one file to the other. The cloned ranges become shared on disk, saving space.
The implementation in the kernel is (for example) at http://lxr.free-electrons.com/source/fs/btrfs/ioctl.c#L2843 ; the comment makes it clear that it is not about 'reflinking' the file, but about ranges:
2843 /**
2844 * btrfs_clone() - clone a range from inode file to another
2845 *
2846 * @src: Inode to clone from
2847 * @inode: Inode to clone to
2848 * @off: Offset within source to start clone from
2849 * @olen: Original length, passed by user, of range to clone
2850 * @olen_aligned: Block-aligned value of olen, extent_same uses
2851 * identical values here
2852 * @destoff: Offset within @inode to start clone
2853 */
So it is not the file which is reflinked, its the range which is shared. A new file could also have been constructed by sharing range with multiple files. Or being shared across volumes. Or (not sure if this is currently supported) even having the same range multiple times in the same file ;)
Therefore, no high-level tool exists to find files which share the whole file since this is a derived concept. Of course it would be possible to write support for it, but that's not the case as far as I know...
I have just released a program called fienode
(← link) which computes a SHA1 hash of the physical extents of a file. Identical CoW copies have the same hash.
In principle, you can run this across all files on the filesystem and then look for identical hashes.
There is also a more detailed answer here, explaining why this is necessary.
Note however, that BTRFS is at liberty to change the physical extents. I've observed a large reflinked file changes its physical extents without provocation, making the fienode
output different, even though the majority of the physical extents were still shared.
filefrag -v
on all the files and find common ranges. – Stéphane Chazelas May 16 '14 at 16:32sync
especially if the flags show unknown_loc or delalloc – Stéphane Chazelas Dec 18 '16 at 21:48cp --reflink=always
and the flag is still inline but the files are (i have to assume cp works, but i can't prove it) reflinked. am I missing something else? – Hilikus Dec 18 '16 at 23:15cp --reflink=always
doesn't always create a reflink. If the file is small enough that its data can be stored with the metadata (inline) then --reflink=always doesn't to any reflinking – Hilikus Dec 18 '16 at 23:31filefrag -v
, comparefilefrag -v | grep -ve inline -e unknown_loc -e delalloc
as those can't be reflinked. You'll notice another clue in that the "physical offset" is always 0 for those (even though of course they're not all located at the start of the block device) – Stéphane Chazelas Dec 19 '16 at 15:53