109

I want to make a fresh new copy of a large number of files from one local drive to another.

I've read that rsync does a checksum comparison of files when sending them to a remote machine over a network.

  1. Will rsync make the comparison when copying the files between two local drives?

  2. If it does do a verification - is it a safe bet? Or is it better to do a byte by byte comparison?

Frez
  • 1,193

7 Answers7

113

rsync always uses checksums to verify that a file was transferred correctly. If the destination file already exists, rsync may skip updating the file if the modification time and size match the source file, but if rsync decides that data need to be transferred, checksums are always used on the data transferred between the sending and receiving rsync processes. This verifies that the data received are the same as the data sent with high probability, without the heavy overhead of a byte-level comparison over the network.

Once the file data are received, rsync writes the data to the file and trusts that if the kernel indicates a successful write, the data were written without corruption to disk. rsync does not reread the data and compare against the known checksum as an additional check.

As for the verification itself, for protocol 30 and beyond (first supported in 3.0.0), rsync uses MD5. For older protocols, the checksum used is MD4.

While long considered obsolete for secure cryptographic hashes, MD5 and MD4 remain adequate for checking file corruption.

Source: the man page and eyeballing the rsync source code to verify.

Kyle Jones
  • 15,015
  • 11
    I hate to burst everyone’s bubble but rsync only does check sum verification if the -c flag is added! –  Jan 21 '13 at 21:32
  • 40
    @clint No, the answer is correct. From the man page's explanation of the -c flag: "Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred, but that automatic after-the-transfer verification has nothing to do with this option's before-the-transfer "Does this file need to be updated?" check." – Michael Mrozek Jan 21 '13 at 21:41
  • @Kyle Jones - so that means when rsync exits, the file is completely received? This is different from scp which just sends the data and doesn't worry about if it is fully received before exiting, right? – David Doria Mar 18 '14 at 14:07
  • 15
    This answer does not make it clear if it actually verifies the file after a copy. If the checksum is computed as the file is being received, then it is not a post-copy checksum and you cannot be sure that the file is written correctly. You would then need to perform an additional comparison. – Andre Miller Mar 24 '15 at 21:26
  • 3
    @AndreMiller Thanks for the comment. I've updated the answer to address that issue. – Kyle Jones Mar 24 '15 at 22:02
  • 13
    Down-voting because I don't like the fact that this answer is detailed well written and technically correct and at the same time so much off topic that it misleads readers. The problem is that the answer goes into great detail on what happens during transfer while the questioner specifically states that he cares about local copies and not network transfers. I'm pretty sure Kyle Jones didn't want to mislead anyone but this answer (IMHO) does. – ndemou Jun 29 '16 at 19:38
  • 3
    @ndemou "Always uses checksums" means for local copies as well as remote copies. rsync's behavior isn't much different when it is copying locally; it still uses two process and uses the same protcol to communicate and transfer files between them. The processes just happen to be running on the same machine. – Kyle Jones Jun 29 '16 at 23:25
  • 3
    Hi Kyle, I believe the question asks: "1) when copying locally does rsync add checks to verify that the copy is identical to the original and 2) how trustworthy are these checks?". You note that rsync spawns 2 processes: one reading the original and one writing to the copy and you also note that the transfer of bytes between these processes is checksummed. But this is marginally (if at all) helpful in guaranteeing the correctness of the copy because the probability of disk related issues corrupting the data is extremely higher that the probability of CPU/RAM issues. – ndemou Jun 30 '16 at 08:36
  • 3
    @ndemou The answer covers the lack of verification of the disk data after it is written. The OP accepted the answer. – Kyle Jones Jul 01 '16 at 04:30
  • 7
    Kyle I don't believe you answer is wrong. I already noted it's "detailed well written and technically correct" but it requires the reader to be unnecessarily focused and careful. Why cover the lack of verification of the disk data which is being questioned halfway through your answer after 117 words which repeatedly describe an other irrelevant verification process? Anyway thanks for your time and interest in this discussion. I sincerely appreciate it. – ndemou Jul 01 '16 at 08:43
  • 3
    Almost four years later, but I somewhat agree with @ndemou. I think it would be best to preface the whole thing with a sentence that indicates "checksum always for transfer, not for disk write". I read the whole answer, and didn't realize until the comments that I misread the second paragraph. – cheshirekow Dec 15 '19 at 21:07
  • 1
    At the receiver end, if it does not re-read the data, how the checksum be verified? At the sender side, checksum is calculated. Then sending [file + its checksum]. Receiver (either another harddisk or over the network) gets [file + its checksum] and write the file on to disk. The receive has to calculate the checksum of her received file, then verify her own calculation against the received checksum. Isn't it? – midnite Mar 10 '22 at 18:14
  • I thought the default check was file size, date , name etc unless specified otherwise, that’s is, ‘-c' the checksum check. – AndrewC Nov 24 '22 at 19:10
  • @AndrewC Those checks determine whether the remote file should be updated. Checksums are used to verify transferred file contents even if you don’t specify -c. – Kyle Jones Nov 25 '22 at 00:35
61

rsync does not do the post-copy verification for local file copies. You can verify that it does not by using rsync to copy a large file to a slow (i.e. USB) drive, and then copying the same file with cp, i.e.:

time rsync bigfile /mnt/usb/bigfile

time cp bigfile /mnt/usb/bigfile

Both commands take about the same amount of time, therefore rsync cannot possibly be doing the checksum—since that would involve re-reading the destination file off the slow disk.

The man page is unfortunately misleading about this. I also verified this with strace—after the copy is complete, rsync issues no read() calls on the destination file, so it cannot be checksumming it. One more you can verify it is with something like iotop: you see rsync doing read and write simultaneously (copying from source to destination), then it exits. If it were verifying integrity, there would be a read-only phase.

jasonwryan
  • 73,126
Felix
  • 619
  • 5
  • 2
  • 18
    There is no post-copy verification for any copies, local or remote. You run rsync -c again if you want to force it to check. – psusi May 06 '13 at 23:50
  • 1
    "The man page is unfortunately misleading about this. I also verified this with strace" Did you strace the remote, running rsync process or the local one? There are two... one runs on the destination, even when you use ssh. – user129070 May 06 '13 at 19:20
  • 3
    The verification is done on the incoming stream as it goes. It's not necessary to read it back from the disk if the filesystem has confirmed it's been written. – OrangeDog Jul 11 '18 at 15:51
  • 1
    rsync man page states "Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred." – openCivilisation Feb 22 '20 at 12:54
  • @openCivilisation, it looks correct, but so what? I myself got confused with that "receiving side" wording, just wrote an answer here for future reference. – Martian2020 Feb 09 '22 at 04:09
  • I am sure the check is file size and dates by default unless rsync -c is specified to do the checksum check. Not sure if using a rsync server is any different. – AndrewC Nov 24 '22 at 19:14
22

rsync makes a checksum comparison before copying (in some cases), to avoid copying what's already there. The point of the checksum comparison is not to verify that the copy was successful. That's the job of the underlying infrastructure: the filesystem drivers, the disk drivers, the network drivers, etc. Individual applications such as rsync don't need to bother with this madness. All rsync needs to do (and does!) is to check the return values of system calls to make sure there was no error.

  • 3
    This seems to contradict the accepted answer... – Jules Jan 13 '16 at 06:45
  • 3
    @djule5 In what way? The accepted answer seems to mostly be about how rsync checks transferred files, but the question, and my answer, are about local copies. – Gilles 'SO- stop being evil' Jan 13 '16 at 10:16
  • 3
    Ok, well in that context I agree it makes more sense. So "The point of the checksum comparison is not to verify that the copy was successful" is true only for local copies; and "checksums are always used on the data transferred between the sending and receiving rsync processes" is true only for transferred copies. I find the accepted answer misleading in regard to the question and believe your answer should be the accepted one (just my 2 cents). – Jules Jan 13 '16 at 18:23
  • I still feel this answer is slightly misleading. For example, it says that the network drivers in particular verify if the copy was successful - but if you were saying that checksum comparison does not verify if the copy was successful for local only, network drivers would not come into play. – Ken Aug 07 '17 at 19:56
  • 1
    @Ken I don't understand the point you're trying to make. I suspect you misread something. The network drivers come into play only if there's a network copy. Rsync itself does a checksum comparison before doing any copy, in order to decide whether to copy. Rsync doesn't do any checksum comparison after copying (because it would be pointless: it knows what it's just copied). – Gilles 'SO- stop being evil' Aug 07 '17 at 20:04
  • From what I understand about rsync, that's not true. It only checksums before the copy if you use the checksum option -c, but it always checksums after network copy to verify it was transfered properly. – Ken Aug 07 '17 at 20:07
  • @Gilles the transfer method is identical for local copies, the data stream just doesn't go through a network interface between the two processes. – OrangeDog Jul 11 '18 at 15:53
17

Quick and dirty answers, directly to the questions.

Q: Will rsync make the comparison when copying the files between two local drives?

A: It will do comparison to figure out what to copy.

Q: If it does do a verification - is it a safe bet? Or is it better to do a byte by byte comparison?

A: as safe as the mathematics behind MD5 checksum of file. You can try to do simple experiment to learn and trust the tool.

Long answer: I guess, you wanted rsync to do file comparison (bit by bit or by checksum) after copying files. If you are one of the few that value data integrity, you might find the below useful:

rsync -avh [source] [destination] && rsync -avhc [source] [destination] 

The above code rsync files folder on first run and if complete without issue, will run rsync again immediately while performing same file name comparison by using hash of entire file.

Bonlenfum
  • 163
M.N.
  • 352
  • The last paragraph is not clear English (sorry no offence intended). Are you saying that those options will cause ‘rsync’ to run twice? Or that the user should run it a second time? – Ed Randall Jul 27 '22 at 07:38
  • 1
    yes, that one liner would run rsync twice pending the result of the first run. – M.N. Sep 20 '22 at 04:27
  • rsync doesn’t always use a checksum the documentation also mentions checks using just file dates and size. – AndrewC Nov 24 '22 at 19:17
  • @AndrewC, that's why the second part of that one liner is forcing rsync to check source and destination using checksum of the entire file instead of relying on mtime/size – M.N. Apr 04 '23 at 02:05
  • 1
    Why not to run only second part ( rsync -avhc [source] [destination] )? What is the advantage of running both parts? – vasili111 Mar 04 '24 at 00:36
8

Using rsync to verify the integrity of a duplicate

To guarantee that this test physically re-reads the files from the drive media, I suggest powering-down both drives and restarting them before running this test. This will clear their internal volatile caches.

If not also restarting Linux, you should at least drop the caches (*) with:

sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Then to re-read both trees and compare their checksums:

rsync --dry-run --checksum --itemize-changes --archive SRC DEST

Modern rsync checksum uses MD5, which is 128 bits. The likelihood of this failing to detect an error in an individual file is astronomically low (some discussion here), but not impossible.

1

This answer is to clarify confusion I had myself and which is (IMO) difficult to understand from comments to accepted question. rsync man page states:

Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file check‐ sum that is generated as the file is transferred

It is probably correct, but it does not mean the actual written files data are checked after copy. First, with local drives both sides are one PC (so rsync might skip the check if side is the same); second, checksum "is generated as the file is transferred" (emphasis mine), not as being written to the media.

P.S. I've just investigated my finding that after copying with rsync I've got files with mismatched checksums (I copied on glitchy PC I knew that beforehand but from reading man page thought rsync would make sure files are copied properly) and found that QA.

Martian2020
  • 1,167
0

Instead of using rsync you can use cp -rp to recursively copy a directory, followed by diff -r. GNU diff accepts two directory trees, which are both read in full when compared.

Rationale (thanks @they): both cp and diff are usually already installed and quite suitable for a onetime action like OP's "I want to make a fresh new copy of a large number of files from one local drive to another."

  • What is the reasoning behind using cp in place of rsync? – Kusalananda Oct 23 '21 at 11:34
  • 1
    It is already installed and quite suitable for a one-time action like OP's "I want to make a fresh new copy of a large number of files from one local drive to another." – Victor Klos Oct 23 '21 at 15:00
  • +1 for answering the requirement – Chris Davies Oct 24 '21 at 09:15
  • 1
    Note that by changing from rsync to cp, you are depending on the capabilities of the local cp utility to support the non-standard option -r, and you assume that it will do exactly what rsync would do (on some systems, the -r option would cause symbolic links to be followed, not copied as-is). You also remove the ability to copy hard links and sparse files (again, depending on the cp implementation), which would be done portably using rsync with its -H and --sparse options. – Kusalananda Oct 24 '21 at 09:55
  • Agreed that it is different. Point of this answer is: in many cases you don't need rsync but you can just get the job done with tools that are already installed. In case of hard links and sparse files, cp has options for those too, but those go beyond the requirement of the question which is to just copy a bunch of files from one local drive to another. BTW I have to disagree with the recursive flag being non-standard in the copy command. With diff that is different, which is why the answer states GNU diff. – Victor Klos Oct 24 '21 at 10:39
  • are the differences these days mainly between BSD vs GNU? – AndrewC Nov 24 '22 at 19:19