8

Based on this: Simultaneously calculate multiple digests (md5, sha256)?

I have a folder that has a large number of files that I want to compute the SHA256 hash for.

I used to code segment:

#!/bin/bash
for file in *; do
sha256sum "$file" > "$file".sha &
done

currently to compute the sha256 hash in parallel, except that my computer only has 16 physical cores.

So, the question that I have is how can I use GNU parallel to run this, but only run using the 16 physical cores that I have available on my system and that once a hash has been completed, it will automatically pick up the next file to hash?

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • 3
    For future reference, asking a new question and referencing the one that didn't quite answer your question, as you did here, is exactly the way we prefer things happen. Thanks and welcome to the site! – terdon Dec 09 '19 at 13:17
  • @terdon Thank you for educating me in regards to this.

    Yeah, in some of the folders that I was processing, there were more than 200+ files in that folder, so using the script above, it would try to run 200+ processes on a 16 core system (HyperThreading has been disabled).

    It'll run, eventually, but it's massively oversubscribed initially, so I was hoping that it would cycle through the files rather than trying to massively oversubscribe the CPU.

    Thank you.

    – alpha754293 Mar 10 '21 at 05:39

2 Answers2

7

Using xargs (and assuming that you have an implementation of this utility that supports -0 and -P):

printf '%s\0' * | xargs -0 -L 1 -P 16 sh -c 'sha256sum "$1" > "$1".sha' sh

This would pass all names in the current directory as a nul-terminated list to xargs. The xargs utility would call an in-line sh script for each one of these names, starting at most 16 concurrent processes. The in-line script takes the argument and runs sha256sum on it, outputting the result to a file of a similar name.

Note that this would also possibly pick up .sha files created in a previous run of the same pipeline. To avoid this, use a slightly more sophisticated glob than * to match the particular names that you'd want to process. For example, in bash:

shopt -s extglob
printf '%s\0' !(*.sha) | xargs ...as above...

Note also that running sha256sum on large files in parallel is likely to be disk bound rather than CPU bound and that you may possibly see similar speed of operation with a smaller number of parallel tasks.


For a GNU parallel equivalent, replace xargs with parallel.


In the zsh shell, you can do it like

autoload -U zargs
setopt EXTENDED_GLOB

zargs -P 16 -L 1 -- (^(*.sha)) -- sh -c 'sha256sum "$1" > "$1".sha' sh
Kusalananda
  • 333,661
7

With GNU parallel, you can avoid the shell loop entirely and just run:

parallel -P 16 sha256sum {} ">"{}.sha ::: *

That will run sha256sum on each file (or directory, but that's what your script did) returned by the glob * and save the output in fileName.sha. For example:

$ ls
file1  file2  file3  file4  file5
$ parallel -P 16 sha256sum {} ">"{}.sha ::: *
$ ls
file1      file2      file3      file4      file5
file1.sha  file2.sha  file3.sha  file4.sha  file5.sha

However, bear in mind what @Kusalandanda pointed out about the main bottleneck in this sort of thing being the I/O and not necessarily the CPU. You might want to run fewer than 16 in parallel.

terdon
  • 242,166
  • 1
    If you have 16 CPU threads, -P 16 is not needed. – Ole Tange Dec 09 '19 at 15:33
  • @OleTange the OP stated they have 16 physical cores, so I'm assuming many more threads. parallel defaults to the maximum available threads, not cores, right? – terdon Dec 09 '19 at 15:56
  • parallel defaults to CPU threads. If your CPU does not have hyperthreading (many AMD and ARM CPUs do not), then CPU cores == CPU threads. The behaviour can be changed with --use-cores-instead-of-threads and --use-sockets-instead-of-threads. So there is nothing wrong in what you write. – Ole Tange Dec 09 '19 at 16:12
  • 1
    @OleTange OK, thanks. As always, feel very free to edit and add or correct anything. Note to anyone else: Ole is the author of parallel, so can be assumed to know what they're talking about. – terdon Dec 09 '19 at 16:14