4

I want to get both the number of bytes and the sha1sum of a command's output.

In principle, one can always do something like:

BYTES="$( somecommand | wc -c )"
DIGEST="$( somecommand | sha1sum | sed 's/ .*//' )"

...but, for the use-case I am interested in, somecommand is rather time-consuming, and produces a ton of output, so I'd prefer to call it only once.

One way that comes to mind would be something like

evil() {
  {
    somecommand | \
      tee >( wc -c | sed 's/^/BYTES=/' ) | \
      sha1sum | \
      sed 's/ .*//; s/^/DIGEST=/'
  } 2>&1
}

eval "$( evil )"

...which seems to work, but makes me die a little inside.

I wonder if there is a better (more robust, more general) way to capture the output of different segments of a pipeline into separate variables.

EDIT: The problem I am working on at the moment is in bash, so I am mostly interested in solutions for this shell, but I do a lot of zsh programming also, so I have some interest in those solutions as well.

EDIT2: I tried to port Stéphane Chazelas' solution to bash, but it didn't quite work:

#!/bin/bash

cmd() { printf -- '%1000s' }

bytes_and_checksum() { local IFS cmd | tee >(sha1sum > $1) | wc -c | read bytes || return read checksum rest_ignored < $1 || return }

set -o pipefail unset bytes checksum bytes_and_checksum "$(mktemp)" printf -- 'bytes=%s\n' $bytes printf -- 'checksum=%s\n' $checksum

When I run the script above, the output I get is

bytes=
checksum=96d89030c1473585f16ec7a52050b410e44dd332

The value of checksum is correct. I can't figure out why the value of bytes is not set.

EDIT3: OK, thanks to @muru's tip, I fixed the problem:

#!/bin/bash

cmd() { printf -- '%1000s' }

bytes_and_checksum() { local IFS read bytes < <( cmd | tee >(sha1sum > $1) | wc -c ) || return read checksum rest_ignored < $1 || return }

set -o pipefail unset bytes checksum bytes_and_checksum "$(mktemp)" printf -- 'bytes=%s\n' $bytes printf -- 'checksum=%s\n' $checksum

Now:

bytes=1000
checksum=96d89030c1473585f16ec7a52050b410e44dd332

UNFORTUNATELY...

...my bytes_and_checksum function stalls (deadlock?) when cmd produces a lot more output than was the case in my toy example above.

Back to the drawing board...

kjo
  • 15,339
  • 25
  • 73
  • 114
  • why don't you just save the output of somecommand in a variable, or write it to a temporary file if it's prohibitively large? – Marcus Müller Nov 02 '23 at 15:12
  • 1
    and why the eval? simply call your function. – Marcus Müller Nov 02 '23 at 15:13
  • @MarcusMüller: regardiing your second question, the code is meant to be part of a larger script, which will go on to to stuff with BYTES and DIGEST. – kjo Nov 02 '23 at 15:16
  • 1
    For your edit: https://unix.stackexchange.com/q/143958/70524 – muru Nov 02 '23 at 16:56
  • for cmd | read bytes to work in bash, you need shopt -s lastpipe with set +o monitor – Stéphane Chazelas Nov 04 '23 at 08:52
  • 1
    I can't reproduce your stalling with bash 5.2 – Stéphane Chazelas Nov 04 '23 at 08:52
  • @StéphaneChazelas: Thank you for looking into it. I will re-check and report back. – kjo Nov 04 '23 at 11:36
  • 1
    ps would help to show which processes hang around, and strace what they're doing/waiting for. – Stéphane Chazelas Nov 04 '23 at 11:39
  • @StéphaneChazelas: Apologies! I too cannot reproduce the stalling I claimed earlier. More precisely, I no longer have the code that exhibited the stalling I reported earlier, and every attempt I've made to reconstruct it from memory has resulted in code that produces the correct result in a reasonable time. – kjo Nov 04 '23 at 12:09

2 Answers2

7

Would be easier to use temp files. In zsh:

(){set -o localoptions -o pipefail; local IFS
  {cmd} > >(sha1sum > $1) | wc -c | read bytes || return
  read checksum rest_ignored < $1 || return
} =()

Beware many wc implementations include whitespace around the number they output. read with the default value of $IFS strips them.

Note that the exit status of sha1sum is lost.

There =() creates the temp file with the non-output of nothing at all. That temp file is removed automatically when the command it's given to (here an anonymous function) returns.

In cmd > file | other-cmd, cmd's output is teed internally by zsh since it's redirected twice, so here both to sha1sum and to wc. We wrap cmd in {...} to make sure zsh waits for the process-redirections to finish.

Here as the output of both sha1sum and wc is guaranteed to be no larger than a few bytes, they could also be sent to pipes, and you would not have to read from those pipes concurrently (which zsh can do as it has an interface to select()/poll() but not bash). That could be done sequentially without causing deadlocks, so it's an easy version of tee into different variables.

On Linux-based systems (where /dev/fd/x when x is a fd to a pipe behaves like a named pipe):

{
  IFS=$' \t' read bytes < <(cmd 3<&- | tee >(sha1sum > /dev/fd/3) | wc -c)
  IFS=$' \t' read sum rest <&3
} 3< <(:)

(would even work in bash).

For details about the deadlocks you'd run into with larger outputs, see also tee + cat: use an output several times and then concatenate results.

  • I delete my previous comment, since it is now mostly obsolete, except for the "thank you" part. So, thank you again! – kjo Nov 04 '23 at 14:54
2

I am using a backup bash script which has the following helper "in-between" functions which take a "supposed filename" as an argument (see tar.gz example below):

function pipesum
{
  tee >(sha1sum | awk --assign F="${1##*/}" '$2=F' > "${1?}.sha1")
}
function pipelen
{
  tee >(wc -c > "${1?}.len")
}
function pipesumlen
{
  tee >(sha1sum | awk --assign F="${1##*/}" '$2=F' > "${1?}.sha1") >(wc -c > "${1?}.len")
}
function pipechecksum
{
  tee >(sha1sum --quiet -c <(awk '$2="-"' "${1?}") >&2)
}

Example:

$ echo 123 | pipesumlen filename
123
$ ls filename*
filename.len  filename.sha1
$ cat filename*
4
a8fdc205a9f19cc1c7507a60c4f01b13d11d7fd0 filename
$ echo 123 | pipechecksum filename.sha1
123
$ echo 1234 | pipechecksum filename.sha1
1234
-: FAILED
sha1sum: WARNING: 1 computed checksum did NOT match

I am using it in a script which is very time, CPU and IO consuming, something like that:

tar | 
  pipesumlen mybackup.tar | 
  gzip > mybackup.tar.gz
<mybackup.tar.gz gunzip | 
  pipechecksum mybackup.tar.sha1 | 
  xz > mybackup.tar.xz

Thus I have my backup checked against random memory/disk bit flips. It creates the "mybackup.tar.sha1" file as if the "mybackup.tar" was actually created and checksummed while in truth in this example the uncompressed data is never written on disk.

Caveat: the pipechecksum does not terminates the script on error even with set -euo pipefail. The alternative pipechecksum which does return nonzero on checksum mismatch:

function pipechecksum
{
  { tee /dev/fd/$N | sha1sum --quiet -c <(awk '$2="-"' "${1?}") >&2; } {N}>&1
}

It seems fine, but I've came with it just today and cannot consider it proven.

legolegs
  • 321