9

When I run a command with xargs -n 1 -P 0 for parallel execution, the output is all jumbled. Is there a way to do parallel execution, but make sure that the entire output of the first execution is written to stdout before the output of the second execution starts, the entire output of the second execution is written to stdout before the output of the third execution starts, etc.?

For example, when wanting to hash many files containing a lot of data, it can be done like this:

printf "%s\0" * | xargs -r0 -n 1 -P 0 sha256sum

I tested this on a small amount of data (9 GB) and it was done in 5.7 seconds. Hashing the same data using

sha256sum *

took 34.1 seconds. I often need to hash large amounts of data (which can take hours), so processing this in parallel can get things done a lot faster.

The problem here is that the order of the output lines is wrong. In this case, it can be fixed by simply sorting the lines by the second column. But it's not always this easy. For example, this would already break while sticking to the hashing example from above but wanting to hash numbered files in order:

printf "%s\0" {1..10000} | xargs -r0 -n 1 -P 0 sha256sum

This requires more advanced sorting. If we leave hashing example altogether, things get more complicated still.

In the comments, I was asked whether I merely want to prevent interleaving of output. This is not the case. I want order to be preserved.

UTF-8
  • 3,237
  • 1
    Write the individual outputs to their own files, then concatenate these after the xargs command is done. Not a real answer as I'm not showing how to do this (but then again, you don't show how you actually run your xargs pipeline either). – Kusalananda Jul 05 '19 at 13:52
  • @Kusalananda I'm not sure how I would be able to collect them in the correct order. – UTF-8 Jul 05 '19 at 14:45
  • Name the output files wisely. – Kusalananda Jul 05 '19 at 14:46
  • Are you asking about preserving order, or avoiding interleaving? — There is a tool to avoid interleaving. I can't remember what it is called, but it will involve (potentially) a lot of storage. – ctrl-alt-delor Jul 05 '19 at 16:57
  • @ctrl-alt-delor I actually care about the order. My current use case only produces a single short line of output per operation but their order is important. However, the dispatcher of these commands doesn't need to be xargs. I need something that can handle any character except the null character in its arguments and can run commands in parallel and thus far xargs seemed to be a good fit (until I saw the output was jumbled). – UTF-8 Jul 05 '19 at 17:57
  • Can you add that to the question, and add an example. – ctrl-alt-delor Jul 07 '19 at 20:19
  • @ctrl-alt-delor I did. – UTF-8 Jul 08 '19 at 10:44

1 Answers1

9

You can do it with GNU Parallel (--keep-order):

printf "%s\0" {1..10000} | parallel --keep-order -r0 -n 1 -P 0 sha256sum

--keep-order uses 4 file handles per process for which printing is delayed. This will normally not cause any delay.

If you have, say, 1000 file handles, and you have a single job taking longer than > 250x the average job, then GNU Parallel will use the 996 file handles for other jobs. If the long running job is still not done, GNU Parallel runs out of file handles and will therefore wait for the long job to be done. It will warn:

parallel: Warning: No more file handles.
parallel: Warning: Try running 'parallel -j0 -N 100 --pipe parallel -j0'
parallel: Warning: or increasing 'ulimit -n' (try: ulimit -n `ulimit -Hn`)
parallel: Warning: or increasing 'nofile' in /etc/security/limits.conf
parallel: Warning: or increasing /proc/sys/fs/file-max

and will then pause until the long job finishes. There will be no data loss.

Ole Tange
  • 35,514
  • Passing -n 1 -P 0 doesn't even seem necessary because the default values do quite a good job already. Thank you! – UTF-8 Jul 08 '19 at 14:13
  • @ole-tange Does --keep-order delay in some way the execution of the parallel processes or does it just buffer the output of each process? What happens if the time required by each process is asymmetric? So you are using 8 parallel processes and the first one takes much longer, do the remaining 7 keep processing? Will those other 7 take new items to process even when the first one did not finish? – M.E. Jun 06 '23 at 06:14
  • GNU Parallel buffers on disk. -k delays printing the output until all previous jobs are done. Under normal load this does not delay processing. In the extreme case, it may, if you do not have enough file handles. Compare: time parallel -kvj10 sleep ::: 10 0.0{1..300} to ulimit -n50; time parallel -kvj10 sleep ::: 10 0.0{1..300} GNU Parallel needs 4 file handles per non-printed job. Your system suffers no penalty if you increase the number of file handles to 1000000. See https://unix.stackexchange.com/questions/625616/what-is-the-historical-reason-for-limits-on-file-descriptors-ulimit-n – Ole Tange Jun 06 '23 at 10:45