I copied thousands of files into an exFAT MicroSD card.
The number of files and bytes is identical, but how do I know whether the data is corrupt or not?
It would be good if the JackPal Android Terminal also supports the command.
I copied thousands of files into an exFAT MicroSD card.
The number of files and bytes is identical, but how do I know whether the data is corrupt or not?
It would be good if the JackPal Android Terminal also supports the command.
Using MD5 sums is a good way, but the canonical way to use it is:
cd
to the directory of the source files and issue:
md5sum * >/path/to/the/checksumfile.md5
If you have directories with many levels, you can use shopt -s globstar
and replace *
by **/*
.
Notice that the file specs in the MD5 file are exactly as provided in the command line (relative paths unless your pattern starts with a /
).
cd
to the directory of the copied files and issue:
md5sum -c /path/to/the/checksumfile.md5
With -c
, md5sum
reads the file specs in the provided MD5 file, compute the MD5 of these files, and compares them to the values from the MD5 file (which is why the file specs are usually better left relative, so you can re-use the MD5 file on files in various directories).
Using MD5 sum this ways immediately tells you about MD5 differences, and also about missing files.
find -exec
is safer.
– peterph
Jan 19 '19 at 23:37
for
or while
. Shell built-ins are immune to Argument list too long
error
– Sergiy Kolodyazhnyy
Jan 20 '19 at 03:32
xargs --showlimits
says the max command line is around 2MB (2086956 bytes). ls
./*/.mp3` on my 21GB music collection (3600 files, kept as Genre/Artist/Album/Title.mp3, so rather long paths) is 290K bytes. So this is not too likely to happen.
– xenoid
Jan 20 '19 at 09:15
md5sum
to b2sum
. It is well-known that MD5 has been cryptographically broken (in collision resistence) for a long time. It was still used because of its speed. But these day BLAKE2 is faster than MD5 and secure (derived from SHA-3 finalist BLAKE. A remark can be added to warn that older GNU/Linux system may not have b2sum
.
– Alex Vong
Jun 06 '19 at 15:47
shasum -a 512256 *
for the meantime.
– Alex Vong
Jun 07 '19 at 21:02
Unmount, eject, and remount the device. Then use
diff -r source destination
In case you used rsync
to do the copy, rsync -n -c
might be very convenient, and it is nearly as good as diff
. It doesn't do a bit-for-bit comparison though; it uses an MD5 checksum.
There are some similar answers with other details at: Verifying a large directory after copy from one hard drive to another
diff
. Good nevertheless.
– neverMind9
Jan 19 '19 at 20:48
cmp
and comm
are included in JackPal's terminal.
– neverMind9
Jan 20 '19 at 01:19
diff -rq ...
to suppress the diff output for any text files that may be present. I want to know if there are differences, not necessarily what the differences are.
– smitelli
Jan 20 '19 at 03:45
rsync
can support many checksums. My version v3.2.3 on Arch Linux was compiled with xxh128 xxh3 xxh64 (xxhash) md5 md4.
– Yaroslav Nikitenko
Feb 17 '21 at 10:39
diff
would be faster than the md5sum
solution as there are no calculations involved, but was surprised. Both were CPU bound on my system, but diff
is single-threaded and two md5sum
s (or your hashing algo of choice) can be run in their own threads. Maybe it's because I have fast disks and slow(er) CPUs, but if your use case is time-critical (e.g. I'm migrating production VMs off hours), try it both ways first to see the perf in your env. I got about 3.5min for MD5 and 4.6min for a diff on a 60GB volume. Considering I have 3TB to verify... it'll add up.
– s.co.tt
Mar 25 '21 at 05:03
rsync -rc original-dir/ copied-dir/
-c
causes rsync to compare files by MD5 checksum (without it, it normally uses only the timestamp and size for quicker comparisons).
This will also cause rsync to copy whatever it sees different or missing from the destination. To avoid that, you can also use -n
and -i
. The former ensures that rsync doesn't do any change and only compares, and the latter causes it to display the differences that it sees.
For example, I have the following dirs:
$ find dir1/ dir2/
dir1/ dir2/
dir1/
dir1/d
dir1/d/a
dir1/d/b
dir1/c
dir2/
dir2/d
dir2/d/a
dir2/d/b
And this:
$ rsync -rcni dir1/ dir2/
>f+++++++++ c
>fc.T...... d/b
Tells me, by way of all those +
s, that file c
does not exist in dir2
, and file d/b
does, but is different (indicated by the c
in the first column). The T
says that it's time would be updated (had we not used -n
).
The format of -i
's output is described in the manpage for rsync. You can man rsync
and get to the part that explains that output by typing /--itemize-changes$
(and hitting Enter).
Along with the other fine answers above, I would like to also recommend considering hashdeep, from http://md5deep.sourceforge.net/. It has a good sized userbase in the scientific community, where they frequently have to do this type of thing with terabytes of data scattered across thousands of directories.
It is possible to generate hashsums for individual files and output them into one text file, of which the MD5 hash can be generated. For that text file, you can use any hash function you like because this hash list's size is not large enough to cause any noticeable performance difference when using harder hash functions such as sha512sum
.
I use cksum
due to it's universal availability (sum
and crc32
are not included in JackPal's Android Terminal) and maximum speeds. It is not a cryptographic, secure algorithm like sha512sum
, but any hash function is good enough to verify data integrity in an offline environment. However, if you wish all file hashes to have the same length (i.e. 32), use md5sum
, the fastest universally supported secure hash algorithm (although it is older, it is much faster than any sha algorithm and will do it's job).
cksum /path/to/folder/* | tee -a hash.files.txt |cut -f 1 -d " " >>hash.list.txt #extracts pure hashsum string only for the output, to hide the different file path.
md5sum hash.list.txt
…or with a single command:
cksum /path/to/folder/* | tee -a hash.files.txt | cut -f 1 -d " " | tee -a hash.list.txt | sort | md5sum
The name of the hashsum list files (hash.list.txt and hash.files.txt in my example) can be anything you specify. Generating two files to be able to identify damaged files (the first file contains the file names as well, the second file is for comparison).
sort
because sh
and bash
implement alphabetical sorting slightly differently. sort
compensates for it.
md5sum
manual it's stated explicitly: "Do not use the MD5 algorithm for security related purposes. Instead, use an SHA-2 algorithm, implemented in the programs sha224sum(1), sha256sum(1), sha384sum(1), sha512sum(1), or the BLAKE2 algorithm, implemented in b2sum(1)" So this is not "secure".
– Yaroslav Nikitenko
Feb 17 '21 at 11:03