When you compare find
and rsync
bear in mind that you're not really comparing like for like:
find
is only scanning the source filetree,
rsync
is not only scanning the source filetree but also doing a metadata comparison to the corresponding destination file (size, datetime, permissions, ownership) to see whether it needs to copy the source across to the target.
I'm not familiar with GPFS but it seems to be a clustering filesystem, which means that your file accesses are also also potentially network bound. However, after the first run-through of the filesystem your Linux-based system will have done its best to cache the file tree so that subsequent accesses are memory based. I get a 30x speed increase on a tree of 140000 files just from this one optimisation.
For comparison you could consider the following code closer to the rsync
implementation shown in your question. I have found it's about 200x slower than the corresponding raw find
even before the mkdir
and cp
are considered (prefix them with a :
no-op to prevent their action).
cd ~/LaTeX &&
find . -type f -name '*.pdf' |
while IFS= read -r src
do
dst=~/Output/"$src"
if [ ! -f "$dst" ] || [ "$(stat -c '%s-%Y' -- "$src")" != "$(stat -c '%s-%Y' -- "$dst")" ]
then
mkdir -p -- "${dst%/*}"
cp -p -- "$src" "$dst"
fi
done
It's definitely not as clever as rsync
but I believe it's a fair starting point.
Finally, if you want to drive the file selection into rsync
with find
you can do that too:
cd ~/LaTeX &&
find . -type f -name '*.pdf' -print0 |
rsync -av --files-from - --from0 ./ ~/Output/
If you don't have find … -print0
replace it with -print
and remove --from0
too. You may hit issues with edge-case file names (those containing newlines and other less common characters) but for most file names it will continue to work correctly.
~/LaTeX
or~/Output
on a non-native filesystem such as FAT or NTFS, or on any kind of network filesystem? – Chris Davies May 11 '23 at 18:09--include='*.*'
as my local equivalent to your first include, a tree of 52 directories and 168 files takes 0.05s to scan (rsync --dry-run --verbose ...
). One with 1858 directories and 308 files still takes only 0.08s. I'm not convinced it's directly anrsync
issue, hence the questions about filesystems, etc. – Chris Davies May 11 '23 at 18:13find
takes roughly 20 seconds whilersync
takes roughly 2 minutes. Not familiar with file systems butstat -f -c %T .
returnsgpfs
(see https://unix.stackexchange.com/a/21870/288916). It is also interesting thatfind
seems to be doing some buffering so that the second same search or even just roughly similar is instantaneous. – Kvothe May 12 '23 at 12:07find
using data stored in ram better. – Kvothe May 12 '23 at 12:11