Check data integrity after copying thousands of files

Question

I copied thousands of files into an exFAT MicroSD card.

The number of files and bytes is identical, but how do I know whether the data is corrupt or not?

It would be good if the JackPal Android Terminal also supports the command.

See also Does rsync verify files copied between two local drives? (2012). — Peter Mortensen, Nov 05 '21 at 21:21

xenoid · Answer 1 · 2019-01-19T21:06:11.307

19

Using MD5 sums is a good way, but the canonical way to use it is:

cd to the directory of the source files and issue:
```
md5sum * >/path/to/the/checksumfile.md5
```

If you have directories with many levels, you can use shopt -s globstar and replace * by **/*.

Notice that the file specs in the MD5 file are exactly as provided in the command line (relative paths unless your pattern starts with a /).

cd to the directory of the copied files and issue:
```
md5sum -c /path/to/the/checksumfile.md5
```

With -c, md5sum reads the file specs in the provided MD5 file, compute the MD5 of these files, and compares them to the values from the MD5 file (which is why the file specs are usually better left relative, so you can re-use the MD5 file on files in various directories).

Using MD5 sum this ways immediately tells you about MD5 differences, and also about missing files.

edited Jan 19 '19 at 21:06

answered Jan 19 '19 at 20:43

xenoid

8,888

7

if the number of files goes into thousands, letting the shell do the wildcard expansion may cause troubles. Using find -exec is safer. – peterph Jan 19 '19 at 23:37
1

@peterph Either that or a loop such as for or while. Shell built-ins are immune to Argument list too long error – Sergiy Kolodyazhnyy Jan 20 '19 at 03:32
Good remarks. OTOH xargs --showlimits says the max command line is around 2MB (2086956 bytes). ls./*/.mp3` on my 21GB music collection (3600 files, kept as Genre/Artist/Album/Title.mp3, so rather long paths) is 290K bytes. So this is not too likely to happen. – xenoid Jan 20 '19 at 09:15
I recommend the answer to switch from using md5sum to b2sum. It is well-known that MD5 has been cryptographically broken (in collision resistence) for a long time. It was still used because of its speed. But these day BLAKE2 is faster than MD5 and secure (derived from SHA-3 finalist BLAKE. A remark can be added to warn that older GNU/Linux system may not have b2sum. – Alex Vong Jun 06 '19 at 15:47
@AlexVong Can't find b2sum for my 16.04. – xenoid Jun 06 '19 at 20:00
@xenoid You can have a look at this answer: https://askubuntu.com/questions/965525/what-packages-in-ubuntu-repositories-provide-blake2-sums-on-16-04 – Alex Vong Jun 07 '19 at 20:53
@xenoid So basically it isn't available for xenial. I guess you will have to use shasum -a 512256 * for the meantime. – Alex Vong Jun 07 '19 at 21:02
@AlexVong No, because it is twice as slow as md5sum and we don't need crypto here. MD5 is good enough to catch copy errors. – xenoid Jun 07 '19 at 23:10
@xenoid IMO, twice as slow is fine for shell scripts. Anyway, I think you should at least warn about not using MD5 for anything security-related. – Alex Vong Jun 09 '19 at 00:27

sourcejedi · Accepted Answer · 2019-01-20T13:42:12.870

14

Unmount, eject, and remount the device. Then use

diff -r source destination

In case you used rsync to do the copy, rsync -n -c might be very convenient, and it is nearly as good as diff. It doesn't do a bit-for-bit comparison though; it uses an MD5 checksum.

There are some similar answers with other details at: Verifying a large directory after copy from one hard drive to another

edited Jan 20 '19 at 13:42

answered Jan 19 '19 at 20:38

sourcejedi

50,249

I like that. Unfortunately, Android Terminal lacks diff. Good nevertheless. – neverMind9 Jan 19 '19 at 20:48
3

@neverMind9 Install Termux for diff,rsync,etc on Android. – user1133275 Jan 19 '19 at 22:26
@user1133275 Actually, the similar alternatives cmp and comm are included in JackPal's terminal. – neverMind9 Jan 20 '19 at 01:19
1

I tend to prefer diff -rq ... to suppress the diff output for any text files that may be present. I want to know if there are differences, not necessarily what the differences are. – smitelli Jan 20 '19 at 03:45
1

@neverMind9 I think toybox these day provides diff. – Alex Vong Jun 06 '19 at 15:20
rsync can support many checksums. My version v3.2.3 on Arch Linux was compiled with xxh128 xxh3 xxh64 (xxhash) md5 md4. – Yaroslav Nikitenko Feb 17 '21 at 10:39
I assumed offhand that diff would be faster than the md5sum solution as there are no calculations involved, but was surprised. Both were CPU bound on my system, but diff is single-threaded and two md5sums (or your hashing algo of choice) can be run in their own threads. Maybe it's because I have fast disks and slow(er) CPUs, but if your use case is time-critical (e.g. I'm migrating production VMs off hours), try it both ways first to see the perf in your env. I got about 3.5min for MD5 and 4.6min for a diff on a 60GB volume. Considering I have 3TB to verify... it'll add up. – s.co.tt Mar 25 '21 at 05:03

score 8 · Answer 3 · answered Jan 20 '19 at 01:04

rsync -rc original-dir/ copied-dir/

-c causes rsync to compare files by MD5 checksum (without it, it normally uses only the timestamp and size for quicker comparisons).

This will also cause rsync to copy whatever it sees different or missing from the destination. To avoid that, you can also use -n and -i. The former ensures that rsync doesn't do any change and only compares, and the latter causes it to display the differences that it sees.

For example, I have the following dirs:

$ find dir1/ dir2/
dir1/ dir2/
dir1/
dir1/d
dir1/d/a
dir1/d/b
dir1/c
dir2/
dir2/d
dir2/d/a
dir2/d/b

And this:

$ rsync -rcni dir1/ dir2/
>f+++++++++ c
>fc.T...... d/b

Tells me, by way of all those +s, that file c does not exist in dir2, and file d/b does, but is different (indicated by the c in the first column). The T says that it's time would be updated (had we not used -n).

The format of -i's output is described in the manpage for rsync. You can man rsync and get to the part that explains that output by typing /--itemize-changes$ (and hitting Enter).

score 5 · Answer 4 · answered Jan 20 '19 at 06:18

Along with the other fine answers above, I would like to also recommend considering hashdeep, from http://md5deep.sourceforge.net/. It has a good sized userbase in the scientific community, where they frequently have to do this type of thing with terabytes of data scattered across thousands of directories.

neverMind9 · Answer 5 · 2019-01-29T13:59:59.683

It is possible to generate hashsums for individual files and output them into one text file, of which the MD5 hash can be generated. For that text file, you can use any hash function you like because this hash list's size is not large enough to cause any noticeable performance difference when using harder hash functions such as sha512sum.
I use cksum due to it's universal availability (sum and crc32 are not included in JackPal's Android Terminal) and maximum speeds. It is not a cryptographic, secure algorithm like sha512sum, but any hash function is good enough to verify data integrity in an offline environment. However, if you wish all file hashes to have the same length (i.e. 32), use md5sum, the fastest universally supported secure hash algorithm (although it is older, it is much faster than any sha algorithm and will do it's job).

Run these commands on both source and destination:

cksum /path/to/folder/* | tee -a hash.files.txt |cut -f 1 -d " " >>hash.list.txt #extracts pure hashsum string only for the output, to hide the different file path.
md5sum hash.list.txt

…or with a single command:

cksum /path/to/folder/* | tee -a hash.files.txt | cut -f 1 -d " " | tee -a hash.list.txt | sort | md5sum

The name of the hashsum list files (hash.list.txt and hash.files.txt in my example) can be anything you specify. Generating two files to be able to identify damaged files (the first file contains the file names as well, the second file is for comparison).

sort because sh and bash implement alphabetical sorting slightly differently. sort compensates for it.

In the end of md5sum manual it's stated explicitly: "Do not use the MD5 algorithm for security related purposes. Instead, use an SHA-2 algorithm, implemented in the programs sha224sum(1), sha256sum(1), sha384sum(1), sha512sum(1), or the BLAKE2 algorithm, implemented in b2sum(1)" So this is not "secure". — Yaroslav Nikitenko, Feb 17 '21 at 11:03

Check data integrity after copying thousands of files

5 Answers5

Run these commands on both source and destination: