Compare massive directories with progress report

Question

I just rsync-ed 2,000,000 files (3TB) from one RAID to another.

I want to make sure my data is intact.

rsync -c takes a really long time.

diff doesn't show me what it's doing.

Is there an alternative that's (a) faster, and (b) will show me progress while it's comparing?

(I'm on Mac, and brew search diff gives me apgdiff colordiff diffstat diffutils fmdiff libxdiff open-vcdiff podiff rfcdiff vbindiff bsdiff diffpdf diffuse dwdiff kdiff3 ndiff perceptualdiff rdiff-backup tkdiff wdiff ... would one of these do the job?)

Duplicate http://superuser.com/questions/708001/alternative-to-diff-with-progress-for-massive-directory-compare — spuder, Jan 28 '14 at 04:44
I'm also confused as to why rsync copied the data at around 150MB/s, yet diff compares at only 60MB/s ... ? — d0g, Jan 28 '14 at 06:19
The copy using rsync is faster b/c rsync by default does not use checksums to compare files, it looks at size and date info. When you use rsync -c all the files need to have their checksums' calculated which is a burdensome task, hence why it's not the default. — slm, Jan 28 '14 at 07:37
Yes, but diff doesn't copy ... it just reads both files; while rsync, to copy, must read each byte, then write it. This was an rsync from scratch, so it was copying every file. — d0g, Jan 28 '14 at 08:06

Lesmana · Answer 1 · 2022-12-06T20:04:54.693

Here is diff with progress report based on file count:

diff -rqs dir1 dir2 | pv -l -s filecount > logfile

You will need pv (pipe viewer): http://www.ivarch.com/programs/pv.shtml

To get file count use the following command:

find dir1 -type f | wc -l

The above commands will report progress based on file count. This works best if there are many smallish files. If your files are huge then you will not have much fun with this. It will still be better than nothing because you still see progress between files. But you will not see how far along the current file it is.

Sadly I do not know of an easy way to report progress based on bytes compared.

Explanation of diff options:

diff -r compare not just the files in the given directories but also the files in the subdirectories and subsubdirectories and so forth.
diff -q if there is difference then print only the filenames instead of the actual differences in the files.
diff -s also print filenames of files that do not differ. this is important for the progress information using pv.

And pv options:

pv -l consider progress based on line count (instead of byte count).
pv -s count estimate time to complete based on count.

The redirect to logfile is for pretty output. Otherwise output from diff will mix with status line from pv.

You can omit -s count. Then you won't have estimate to finish. Only progress so far.

To filter the logfile for files that are different:

grep -v "^Files .* identical$" logfile

This variation will print files that are different in real time while also logging everything in logfile:

diff -rqs dir1 dir2 | pv -l -s filecount | 
    tee logfile | grep -v "^Files .* identical$"

Alternatively you can log only files that are different:

diff -rqs dir1 dir2 | pv -l -s filecount | 
    grep -v "^Files .* identical$" > logfile

If you can find your peace just comparing the metadata (and not the actual content of the files) then you can use rsync. This will be considerably faster.

For more details:

This should be the answer as it actually shows the progress. You may consider adding --speed-large-files for completeness. — midnite, Mar 03 '22 at 18:42

D McKeon · Answer 2 · 2014-01-28T07:41:32.163

edit for correction & option clarity - I forgot '--brief'

diff -rs --brief "$dir1" "$dir2" 

-r, --recursive              recursively compare any subdirectories found
-s, --report-identical-files report when two files are the same
-q, --brief                  report only when files differ
--speed-large-files      assume large files and many scattered small changes

and add other options to taste, depending on what you are comparing:

-i, --ignore-case            ignore case differences in file contents
-b, --ignore-space-change    ignore changes in the amount of white space
-B, --ignore-blank-lines     ignore changes whose lines are all blank
--strip-trailing-cr      strip trailing carriage return on input
--ignore-file-name-case  ignore case when comparing file names

diff -rs will read every byte of the original and copy, and report files that are the same.

The diff output format is defined by POSIX, so it is pretty portable. You may want to add something like:

| tee diff-out.1 | grep -v -Ee 'Files .* and .* are identical'

You could use chksums or hashes, but then you have to keep them sync'd with the file trees, so you would be back to reading every byte of every file anyway.

EDIT - too long to be a comment, in response to:

files over 10GB are not verifying

You may want to try this diff option: --speed-large-files

It is possible that the diff you are using is not coping well with very large files (bigger than system memory, for instance), and is thus reporting differences between files that are actually the same.

I had thought there was a -h option or a 'bdiff' that did better on large files, but I cannot find one in Fedora. I believe that the --speed-large-files options is a successor to a '-h' "half-hearted compare" option.

A different approach would be to repeat the rsync command you used, with '-vin' (verbose, itemize, no_run). This would report any differences that rsync finds - and there should not be any.

To move some files, you're looking at a script something like:

if [ cmp -s "$dir1/$path" "$dir2/$path" ] ; then 
target="$dir2/verified/$path"
mkdir -p $(basename "$target")
mv  "$dir2/$path" "$target"
fi

but I don't recommend doing that. The underlying question is "how can I be sure that rsync copied a file hierarchy correctly?" and if you can demonstrate to yourself that rsync is working well, with diff or some other tool, then you can just rely on rsync, rather than working around it.

rsync -vin will compare based on whatever other options you give it. I thought it defaulted to checksum, but you are right, -c or --checksum is required for that.

The diff utility is really intended for files of lines of text, but it should report 'identical' under -s for binary files.

The --brief should suppress any file content output - my apologies for overlooking it earlier - it was semi-buried in an ugly script.

Is there a way to get it to mv every file it finds to a "verified" folder at the root of the drive, preserving the full path? E.g., if /disk1/a/b/c/file1 is identical to /disk2/a/b/c/file1, then move it to /disk1/verified/a/b/c/file1. Then I could end up with only the badly copied files. (So far LOTS of files over 10GB are not verifying, which is scary.) — d0g, Jan 28 '14 at 06:03
If I run rsync -vin -- does that do a byte-by-byte or checksum comparison? I thought rsync only compared size/date unless you add -c. And from what I've read the speed large files seems to only make a difference with non-binary files ... or am I wrong? — d0g, Jan 28 '14 at 07:12
diff gives me results in the form of "Files __ and ___ differ" ... and I'm running that through sed -e "s/Files /cp -afv /" -e "s/ and / /" -e "s/ differ$//" to try and generate a script for re-copying the bad files. But diff's output is unquoted, so it doesn't work. Can I get it to give me quoted paths? — d0g, Jan 28 '14 at 17:51

score 0 · Answer 3 · answered Jan 28 '14 at 04:48

0

I would look at using some sort of hash application to check data integrity. I know that many duplicate file finding utilities use hashes to identify duplicate/non-duplicates. Seems to me that this is an investigation that might be worthwhile.

answered Jan 28 '14 at 04:48

O T Coder

1

score 0 · Answer 4 · answered Jan 28 '14 at 08:11

0

You can use rdiff-backup for that. Install it on both servers and it will do smart comparisons of checksums and sync what is not yet there.

answered Jan 28 '14 at 08:11

Timo

6,332

Compare massive directories with progress report

4 Answers4

Linked