31

Under the assumption that disk I/O and free RAM is a bottleneck (while CPU time is not the limitation), does a tool exist that can calculate multiple message digests at once?

I am particularly interested in calculating the MD-5 and SHA-256 digests of large files (size in gigabytes), preferably in parallel. I have tried openssl dgst -sha256 -md5, but it only calculates the hash using one algorithm.

Pseudo-code for the expected behavior:

for each block:
    for each algorithm:
        hash_state[algorithm].update(block)
for each algorithm:
    print algorithm, hash_state[algorithm].final_hash()
Lekensteyn
  • 20,830
  • You can just start one instance in the background, then both hashes run in parallel: for i in file1 file2 …; do sha256 "$i"& md5sum "$i"; done – Marco Oct 23 '14 at 10:16
  • 2
    @Marco The problem with that approach is that one command may be faster than the other, resulting in a disk cache that gets emptied and refilled later with the same data. – Lekensteyn Oct 23 '14 at 12:08
  • 1
    If you're worried about the disk cache, you can read in the file just once: for i in file1 file2 …; do tee < "$i" >(sha256sum) | md5sum ; done Then you have to add additional code to mark the file name, because it is sent as standard input to md5sum and sha256sum. – Marco Oct 23 '14 at 12:34

7 Answers7

31

Check out pee ("tee standard input to pipes") from moreutils. This is basically equivalent to Marco's tee command, but a little simpler to type.

$ echo foo | pee md5sum sha256sum
d3b07384d113edec49eaa6238ad5ff00  -
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c  -
$ pee md5sum sha256sum <foo.iso
f109ffd6612e36e0fc1597eda65e9cf0  -
469a38cb785f8d47a0f85f968feff0be1d6f9398e353496ff7aa9055725bc63e  -
  • Nice command! I already have this very useful package installed, didn't know of this funny-named utility. – Lekensteyn Oct 23 '14 at 15:38
  • 1
    pee has the best interface, a time comparison with other tools can be found in this post which also demonstrates a multi-threaded Python tool. – Lekensteyn Oct 24 '14 at 11:08
  • Unfortunately, moreutils conflicts with GNU parallel on my Debian system… though, it's good to know there's such a tool. – liori Oct 24 '14 at 11:37
  • @Lekensteyn: I get a conflict on the package level (ie. aptitude doesn't let me have both of the packages at the same time). – liori Oct 24 '14 at 16:27
  • 1
    @liori Too bad that Debian implemented it that way, it might be worth to file a bug on this. On Arch Linux there is a moreutils-parallel name to avoid the conflict. – Lekensteyn Oct 24 '14 at 16:34
10

You can use a for loop to loop over the individual files and then use tee combined with process substitution (works in Bash and Zsh among others) to pipe to different checksummers.

Example:

for file in *.mkv; do
  tee < "$file" >(sha256sum) | md5sum
done

You can also use more than two checksummers:

for file in *.mkv; do
  tee < "$file" >(sha256sum) >(sha384sum) | md5sum
done

This has the disadvantage that the checksummers don't know the file name, because it is passed as standard input. If that's not acceptable, you have to emit the file names manually. Complete example:

for file in *.mkv; do
  echo "$file"
  tee < "$file" >(sha256sum) >(sha384sum) | md5sum
  echo
done > hashfilelist
Marco
  • 33,548
  • 2
    To make the output compatible with the *sum family of tools, this sed expression could be used instead: sed "s;-\$;${file//;/\\;}; (replaced the trailing - by the filename, but ensure that the filename gets properly escaped). – Lekensteyn Oct 23 '14 at 15:37
  • 1
    AFAICS, it only works in zsh. In ksh93 and bash, the output of sha256sum goes to md5sum. You'll want: { tee < "$file" >(sha256sum >&3) | md5sum; } 3>&1. See http://unix.stackexchange.com/q/153896/22565 for the reverse problem. – Stéphane Chazelas Oct 24 '14 at 14:32
  • another snippet for bash, to create a list of prefixed hashes, aka multihash: input_path=/dev/stdin; echo asdf | { cat "$input_path" | pv -r -a | tee >(md5sum | sed -E 's/^([0-9a-f]+)\s.*$/md5:\1/' >&3) | tee >(sha1sum | sed -E 's/^([0-9a-f]+)\s.*$/sha1:\1/' >&3) | tee >(sha256sum | sed -E 's/^([0-9a-f]+)\s.*$/sha256:\1/' >&3) | tiger-hash - | sed -E 's/^([0-9a-f]+)\s.*$/tiger:\1/' >&3; } 3>&1 – milahu Oct 25 '23 at 06:43
7

It's a pity that the openssl utility doesn't accept multiple digest commands; I guess performing the same command on multiple files is a more common use pattern. FWIW, the version of the openssl utility on my system (Mepis 11) only has commands for sha and sha1, not any of the other sha variants. But I do have a program called sha256sum, as well as md5sum.

Here's a simple Python program, dual_hash.py, that does what you want. A block size of 64k appears to be optimal for my machine (Intel Pentium 4 2.00GHz with 2G of RAM), YMMV. For small files, its speed is roughly the same as running md5sum and sha256sum in succession. But for larger files it is significantly faster. Eg, on a 1967063040 byte file (a disk image of an SD card full of mp3 files), md5sum + sha256sum takes around 1m44.9s, dual_hash.py takes 1m0.312s.

dual_hash.py

#! /usr/bin/env python

''' Calculate MD5 and SHA-256 digests of a file simultaneously

    Written by PM 2Ring 2014.10.23
'''

import sys
import hashlib

def digests(fname, blocksize):
    md5 = hashlib.md5()
    sha = hashlib.sha256()
    with open(fname, 'rb') as f:
        while True:
            block = f.read(blocksize)
            if not block:
                break
            md5.update(block)
            sha.update(block)

    print("md5: %s" % md5.hexdigest())
    print("sha256: %s" % sha.hexdigest())

def main(*argv):
    blocksize = 1<<16 # 64kB
    if len(argv) < 2:
        print("No filename given!\n")
        print("Calculate md5 and sha-256 message digests of a file.")
        print("Usage:\npython %s filename [blocksize]\n" % sys.argv[0])
        print("Default blocksize=%d" % blocksize)
        return 1

    fname = argv[1]

    if len(argv) > 2:
        blocksize = int(sys.argv[2])

    print("Calculating MD5 and SHA-256 digests of %r using a blocksize of %d" % (fname, blocksize))
    digests(fname, blocksize)

if __name__ == '__main__':
    sys.exit(main(*sys.argv))

I suppose a C/C++ version of this program would be a little faster, but not much, since most of the work is being done by the hashlib module, which is written in C (or C++). And as you noted above, the bottleneck for large files is IO speed.

Lekensteyn
  • 20,830
PM 2Ring
  • 6,633
  • For a file of 2.3G, this version was has a comparable speed compared to md5sum and sha256sum combined (4.7s+14.2s vs 18.7s for this Python script, file in cache; 33.6s for the cold run). 64KiB vs 1MiB did not change the situation. With code commented, 5.1s was spent on md5 (n=3), 14.6s on sha1 (n=3). Tested on an i5-460M with 8GB RAM. I guess that this could further be improved by using more threads. – Lekensteyn Oct 23 '14 at 12:22
  • C or C++ will probably not matter that much as much of the runtime is spent in the OpenSSL module anyway (used by hashlib). More threads does improve speed, see this post about a multi-threaded Python script. – Lekensteyn Oct 24 '14 at 11:12
  • @PM 2Ring - Just a note. After the print statements in your digests() function, you need to clear at least sha. I can't say whether you should clear md5 or not. I would just use "del sha". If you don't, every file after the first will have an incorrect hash. To prove it, make a tmp dir and copy a file into it. Now make 2 copies of that file, and run your script. You'll get 3 different hashes, which isn't what you want. Edit: I thought the function was reading over a set of files, not just reading a single file at a time... Disregard for this use. ;) – Terry Wendt Sep 30 '18 at 13:15
  • 1
    @TerryWendt You had me worrying there for a second. :) Yes, digests only processes a single file on each call. So even if you did call it in a loop it will make new md5 & sha contexts on each call. FWIW, you may enjoy my resumable SHA-256 hash. – PM 2Ring Sep 30 '18 at 15:26
6

You could always use something like GNU parallel:

echo "/path/to/file" | parallel 'md5sum {} & sha256sum {}'

Alternatively, just run one of the two in the background:

md5sum /path/to/file & sha256sum /path/to/file

Or, save the output to different files and run multiple jobs in the background:

for file in *; do
    md5sum "$file" > "$file".md5 &
    sha256sum "$file" > "$file".sha &
done

That will launch as many md5sum and sha256sum instances as you have files and they will all run in parallel, saving their output to the corresponding file names. Careful though, this can get heavy if you have many files.

terdon
  • 242,166
  • 1
    See the comment to Marco, my worry is that although the command will be parallel, the slow disk gets accessed twice for the same data. – Lekensteyn Oct 23 '14 at 12:09
  • But wouldn't the existence of the disk cache make your worries unnecessary? – Twinkles Oct 23 '14 at 14:19
  • 2
    @Twinkles To quote Lekensteyn above, "The problem with that approach is that one command may be faster than the other, resulting in a disk cache that gets emptied and refilled later with the same data." – Matt Nordhoff Oct 23 '14 at 14:28
  • 2
    @MattNordhoff Yet another thing an intelligent I/O scheduler should notice and optimize for. One may think: "How hard can it be for an I/O scheduler to take this scenario into account?" But with enough different scenarios an I/O scheduler should take into account, it suddenly becomes a hard problem. So I agree that one shouldn't assume that caching will take care of the problem. – kasperd Oct 23 '14 at 16:38
  • 1
    Assuming the IO is significantly slower than any of the tools involved, both tools should be slowed down to the same speed because of IO. Therefore, if one tool manages to get few blocks of data more than the other, the other tool would quickly catch up with the computations using the data in the disk cache. That's theory, I'd love to see some experimental results proving it… – liori Oct 24 '14 at 11:39
4

Out of curiousity whether a multi-threaded Python script would reduce the running time, I created this digest.py script which uses threading.Thread, threading.Queue and hashlib to calculate the hashes for multiple files.

The multi-threaded Python implementation is indeed slightly faster than using pee with coreutils. Java on the other hand is... meh. The results are available in this commit message:

For comparison, for a file of 2.3 GiB (min/avg/max/sd secs for n=10):

  • pee sha256sum md5sum < file: 16.5/16.9/17.4/.305
  • python3 digest.py -sha256 -md5 < file: 13.7/15.0/18.7/1.77
  • python2 digest.py -sha256 -md5 < file: 13.7/15.9/18.7/1.64
  • jacksum -a sha256+md5 -F '#CHECKSUM{i} #FILENAME': 32.7/37.1/50/6.91

The hash output is compatible with output produced by coreutils. Since the length is dependent on the hashing algorithm, this tool does not print it. Usage (for comparison, pee was also added):

$ ./digest.py -sha256 -md5 digest.py
c217e5aa3c3f9cfaca0d40b1060f6233297a3a0d2728dd19f1de3b28454975f2  digest.py
b575edf6387888a68c93bf89291f611c  digest.py
$ ./digest.py -sha256 -md5 <digest.py
c217e5aa3c3f9cfaca0d40b1060f6233297a3a0d2728dd19f1de3b28454975f2  -
b575edf6387888a68c93bf89291f611c  -
$ pee sha256sum md5sum <digest.py
c217e5aa3c3f9cfaca0d40b1060f6233297a3a0d2728dd19f1de3b28454975f2  -
b575edf6387888a68c93bf89291f611c  -
Lekensteyn
  • 20,830
  • I was going to suggest comparing pee "openssl sha256" "openssl md5" < file, but, honestly, I just tried it, and it didn't beat digest.py. It narrowed the gap, though. – Matt Nordhoff Oct 24 '14 at 12:41
3

Try RHash

Try RHash.

There are packages for Cygwin, Debian.

Example

$ echo foo | rhash --md5 --sha1 --bsd -
MD5   ((stdin)) = d3b07384d113edec49eaa6238ad5ff00
SHA1  ((stdin)) = f1d2d2f924e986ac86fdf7b36c94bcdf32beec15

Nice to know: lotsa hashes

If you wanna go crazy: Try the --all option to get ALL supported hashes (and the --bsd formatting option to know what these hashes are):

$ echo foo | rhash --all --bsd - | sort
AICH  ((stdin)) = 6hjnf6je5gdkzbx566zwzff434zl53av
BTIH  ((stdin)) = 22a9c158a3ea04608f0e6ea826e3188c773eb4dd
CRC32 ((stdin)) = 7e3265a8
CRC32C ((stdin)) = 9626347b
ED2K  ((stdin)) = 3ee037f347c64cc372ad18857b0db91f
EDON-R256 ((stdin)) = 747b550af4c4916340680669f885ec391addf22cece025d1cb11df978401793a
EDON-R512 ((stdin)) = 521ec4b41abb75a54969c8070c3558b7f3981833165fd208d3b48de2bc23b64fa2a1d80ea94d87b176ecb99c8495f9ee19307c9ad54c23f37e034579b6ced4d8
GOST  ((stdin)) = eb9382405525bf1cc8403ed621caecfe8339cd7157e383fe9c36782ca0aeab5f
GOST-CRYPTOPRO ((stdin)) = 72e0992f1e7caec2f8406b53d7ed09263fb6df1bae9129731f97a50a9de04115
HAS-160 ((stdin)) = 6bb6e92d882dc41746064f8c2d8e81df02f13f0c
MD4   ((stdin)) = 3ee037f347c64cc372ad18857b0db91f
MD5   ((stdin)) = d3b07384d113edec49eaa6238ad5ff00
RIPEMD-160 ((stdin)) = ec0af898b7b1ab23ccf8c5036cb97e9ab23442ab
SHA1  ((stdin)) = f1d2d2f924e986ac86fdf7b36c94bcdf32beec15
SHA-224 ((stdin)) = e7d5e36e8d470c3e5103fedd2e4f2aa5c30ab27f6629bdc3286f9dd2
SHA-256 ((stdin)) = b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
SHA3-224 ((stdin)) = 5f6b734bdedd9fc2bdf02d18f16ef83bbbb9178aebe5e8f6ae79e9a2
SHA3-256 ((stdin)) = 5218df10c0ebe3b38d74fe0040d13198ac49646a43bad373b91ed887dd734fcf
SHA3-384 ((stdin)) = a4d62fdfee48479a8951de809d9f3604309e8783d754d94c0842c89ddb544ee963bf64063644251e0521ca44aca97350
SHA3-512 ((stdin)) = 6f1b16155d5f87af947270b2202c9432b64ff07880e3bd104a50605bc0f949d4e4bf30cddbb257a7f3a54881429f45efdb43fbe14371f9f7f5cb16789db9175d
SHA-384 ((stdin)) = 8effdabfe14416214a250f935505250bd991f106065d899db6e19bdc8bf648f3ac0f1935c4f65fe8f798289b1a0d1e06
SHA-512 ((stdin)) = 0cf9180a764aba863a67b6d72f0918bc131c6772642cb2dce5a34f0a702f9470ddc2bf125c12198b1995c233c34b4afd346c54a2334c350a948a51b6e8b4e6b6
SNEFRU-128 ((stdin)) = 6bf837fd63236ae6d4a7df110085177c
SNEFRU-256 ((stdin)) = 27f8e3841ee9d88c6a9e5a0b0c02e7d8c3dbffbec3e2d8f22b6419236002aebd
TIGER ((stdin)) = 89c010f8e5ddcf01c7d71c7d8352d5436e40fe5200ca8ce0
TTH   ((stdin)) = a2mppcgs5cpjv6aoap37icdcfv3wyu7pbrec6fy
WHIRLPOOL ((stdin)) = 404818c0ea953193b372a3e72c96b91a53d0d07eb99d8cb8c2aaebf56657e74de2b6a510866283d0501b95aa0ba0ddc3b7669ea5fc9422cc666a953e241d8b9e
0

Jacksum is a free and platform independent utility for computing and verifying checksums, CRCs and hashes (message digests) as well as timestamps of files. (excerpted from jacksum man page)

It is large file aware, it can process filesizes up to 8 Exabytes (= 8,000,000,000 Gigabytes), presupposed your operating system respectively your file system is large file aware, too. (excerpted from http://www.jonelo.de/java/jacksum/)

Usage example:

jacksum -a md5+sha256 -F "#ALGONAME{i} (#FILENAME) = #CHECKSUM{i}" jacksum-testfile

Sample output:

md5 (jacksum-testfile) = d41d8cd98f00b204e9800998ecf8427e
sha256 (jacksum-testfile) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

On ubuntu, run command apt-get install jacksum to get it.

Alternatively, source codes are available at

pallxk
  • 1,235