Most efficient way to thin out a backup which uses hardlinks

Question

I have a backup disk which contains hundreds of backups of the same machine from different dates. The backup was made with rsync and hardlinks, i.e. if a file doesn't change the backup script just makes a hardlink to the file in an older backup. So if a file never changes you have essentially one copy in the backup disk but say 100 hardlinks to it in each directory representing the backup of each date (say back-1, back-2, ... back-n). If I want to thin it out, I delete a subset of them but not all. Suppose I want to delete back_5, back_6, ... back_10 (just as an example, in my real szenario there are many more). Then I try to parallize it via:

echo back_5 back_6 back_10 | xargs -n 1 -P 0 rm -rf

This takes multiple hours. So is there any faster way to do this?

Doesn't the -P 0 parallelisation only apply to threads? I wonder what the bottleneck is here… perhaps it's the read/write speed of the drive? If so, then perhaps parallelising it is having a negative effect. — Sparhawk, May 21 '16 at 06:31
I guess the problem are the hardlinks. But I am not sure. The hard drive is a WD red. It's running for 24 hours now and deleted (according to df -h) just 300GB. But I am not sure if df -h or du -hs report correct numbers because of the hardlinks. — student, May 21 '16 at 07:02
Yes, I'm not sure either. I guess you could always test it. Write out a 100 MB file from /dev/urandom, which should take (0.100/300)2460*60≈30 seconds to delete. Then test it with additional hardlinks still present. Then test it parallelised. — Sparhawk, May 21 '16 at 08:12
I am afraid parallelization won't help due to how unlink() is supposed to work. Speeding up deletion depends on many circumstances. https://unix.stackexchange.com/q/37329/13746 offers a number of good candidates to try. — xebeche, Jun 14 '20 at 09:59

score 1 · Answer 1 · answered May 25 '16 at 02:08

I can't see how your use of xargs in this way is anything but slow. My manpage says -P is the number of processes and -n is the number of arguments. There's no special value for -P0, so that's likely ignored (or, if honored, you get zero processes, which would explain 24 hours of nothing!). And -n1 ensures you get one exec(2) for each filename, which is about the slowest possible.

I doubt parallelizing this work will buy you much. I would think just

$ echo filenames ... | xargs rm -rf

would suffice. You could experiment with values like -P4 if you like. By not limiting the number of command-line arguments, you minimize the invocations of /bin/rm and let it proceed serially through your disk cache.

score 0 · Answer 2 · answered May 25 '16 at 09:51

The df is reporting a small number because you're mostly deleting directories, which are relatively small. Also, depending on the filesystem, changes to directories and changes to the number of links to a file are journaled and/or synced to the disk immediately, since they're critical for fault recovery, and thus slower.

That's actually a testament to the efficiency of your linking!

score 0 · Answer 3 · answered May 25 '16 at 10:02

In my experience, the best way to speed up rsync+hardlink based backups was to decrease the number of files you have.

A large number of small files slows down rsync a lot.

If you can organize your data in such a way so that your mostly small-file, mostly read-only directories get tarred up, you should see a significant speed up in your backup script. (With tools like archivemount, you can then access those archives without extracting them).

Parallelizing the backup script probably won't help or might even slow it down (predictable disk access is more optimal).

The script is essentially this one: http://codereview.stackexchange.com/questions/29429/push-backup-script (including the improvements made by rolfl). Perhaps you can make your suggestion more explicit there. — student, May 26 '16 at 08:45

score 0 · Answer 4 · answered May 25 '16 at 10:49

This is also an experience based response rather that one backed up by hard data.

I find that when deleting many files in similar trees with a lot of cross links it seems faster to delete isolated subtrees in parallel. Let me try and explain with a diagram:

topdir1
    |-a1
    |-b1
    |-c1

topdir2
    |-a2
    |-b2
    |-c2

topdir3
    |-a3
    |-b3
    |-c3

Rather than deleting topdir1, topdir2, topdir3 in parallel, my impression is that it's faster to delete a1, b1, c1 in parallel and then move on to a2, b2, c2, and so on. (My theory for this is that the multiple parallel unlinking of the "same" files causes contention for the inode link reference count, but I stress that I haven't checked this with hard data.)

for topdir in *
do
    echo "Removing $topdir..."
    for sub in "$topdir"/*; do rm -rf "$sub" & done
    wait
    rm -rf "$topdir"
done

However the tree doesn't have depth 1, each sub has again multiple sub directories with also have multiple subdirectories and so on. Perhaps one could implement your idea using find? — student, May 26 '16 at 08:47

Most efficient way to thin out a backup which uses hardlinks

4 Answers4