1

When I use rsync to copy files matching a pattern à la https://unix.stackexchange.com/a/2503/288916 this works but it is excruciatingly slow. find locates the matching files much faster. (Like more than 10 times faster.) Is this normal? Can anything be done about this?

It seems like the better strategy is to just use find and copy only the matching results one by one (with scp or rsync).


Example command (see also the linked question for the explanation of the purpose of the command):

rsync -am --include='*.pdf' --include='*/' --exclude='*' ~/LaTeX/ ~/Output/
Kvothe
  • 413
  • 6
  • 14
  • Roughly how many directories and files to scan? Are ~/LaTeX or ~/Output on a non-native filesystem such as FAT or NTFS, or on any kind of network filesystem? – Chris Davies May 11 '23 at 18:09
  • As a comparison point, using --include='*.*' as my local equivalent to your first include, a tree of 52 directories and 168 files takes 0.05s to scan (rsync --dry-run --verbose ...). One with 1858 directories and 308 files still takes only 0.08s. I'm not convinced it's directly an rsync issue, hence the questions about filesystems, etc. – Chris Davies May 11 '23 at 18:13
  • I was applying it to a directory tree containing 376569 files. find takes roughly 20 seconds while rsync takes roughly 2 minutes. Not familiar with file systems but stat -f -c %T . returns gpfs (see https://unix.stackexchange.com/a/21870/288916). It is also interesting that find seems to be doing some buffering so that the second same search or even just roughly similar is instantaneous. – Kvothe May 12 '23 at 12:07
  • Let me get some performance points for a filesystem with ~ 300K files... – Chris Davies May 12 '23 at 12:08
  • I am no longer sure I am doing a fair comparison. I have to redo my test. Perhaps rsync was just slower because I ran it first in both my tests. See https://serverfault.com/a/626492/601435. Edit: Well rsync is still slow even after a similar rsync command or after the find command. But the difference could perhaps still be in find using data stored in ram better. – Kvothe May 12 '23 at 12:11

1 Answers1

1

When you compare find and rsync bear in mind that you're not really comparing like for like:

  • find is only scanning the source filetree,
  • rsync is not only scanning the source filetree but also doing a metadata comparison to the corresponding destination file (size, datetime, permissions, ownership) to see whether it needs to copy the source across to the target.

I'm not familiar with GPFS but it seems to be a clustering filesystem, which means that your file accesses are also also potentially network bound. However, after the first run-through of the filesystem your Linux-based system will have done its best to cache the file tree so that subsequent accesses are memory based. I get a 30x speed increase on a tree of 140000 files just from this one optimisation.

For comparison you could consider the following code closer to the rsync implementation shown in your question. I have found it's about 200x slower than the corresponding raw find even before the mkdir and cp are considered (prefix them with a : no-op to prevent their action).

cd ~/LaTeX &&
    find . -type f -name '*.pdf' |
        while IFS= read -r src
        do
            dst=~/Output/"$src"
            if [ ! -f "$dst" ] || [ "$(stat -c '%s-%Y' -- "$src")" != "$(stat -c '%s-%Y' -- "$dst")" ]
            then
                mkdir -p -- "${dst%/*}"
                cp -p -- "$src" "$dst"
            fi
        done

It's definitely not as clever as rsync but I believe it's a fair starting point.

Finally, if you want to drive the file selection into rsync with find you can do that too:

cd ~/LaTeX &&
    find . -type f -name '*.pdf' -print0 |
        rsync -av --files-from - --from0 ./ ~/Output/

If you don't have find … -print0 replace it with -print and remove --from0 too. You may hit issues with edge-case file names (those containing newlines and other less common characters) but for most file names it will continue to work correctly.

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • 1
    Thanks for all the effort! As for the difference, in the specific test case I had there were only 1 or 2 matches. I would expect that the fact that rsync does an extra comparison on a match therefore would change very little performance wise. (If implemented well it should not be looking at any meta data for files it already knows it is not going to act on.) I found using find and then rsync on the specific files that I found to be much faster than rsync alone (and this still does all the relevant comparisons that rsync should be doing since it is used at the end). – Kvothe May 12 '23 at 13:18
  • And indeed find definitely seems to be using the cached information much better. – Kvothe May 12 '23 at 13:20
  • It is possible to feed files from find into rsync - answer updated for that too, @Kvothe – Chris Davies May 12 '23 at 13:28