How to run a pipe safely and sequentially?

Question

In Linux, is it possible to run a pipe:

cmd1 | cmd2

in such a way that:

cmd2 doesn't start running until cmd1 has completely finished, and
If cmd1 has an error, cmd2 doesn't run at all and the exit status of the pipe is the exit status of cmd1.

To give an example, how to make this pipe:

false | echo ok

output nothing and return a non-zero status?

Failed Solution 1

set -o pipefail

The pipe does have a non-zero exit status, but cmd2 is still run even if cmd1 fails.

Failed Solution 2

cmd1 && cmd2

This is not a pipe. No I/O redirection.

Failed Solution 3

mkfifo /tmp/fifo
cmd1 > /tmp/fifo && cmd2 < /tmp/fifo

It blocks.

Suboptimal Solution

touch /tmp/file
cmd1 > /tmp/file && cmd2 < /tmp/file

This seems to be working. But it has several shortcomings:

It writes data to disk where I/O is slower. (Surely you can use tmpfs but that's an additional system requirement).
You have to choose the temporary filename carefully. Otherwise it can accidentally overwrite an existing file. mktemp may help, but an unnamed pipe saves you the naming chore entirely.
The filesystem on which the temporary file resides may not be big enough to hold the entire data.
The temporary file doesn't auto clean-up.

Related: sponge and other ways of buffering stdout (not duplicate, because of "cmd2 doesn't start running until cmd1 has completely finished") — Michael Homer, Aug 24 '17 at 04:26
@MichaelHomer false | sponge | echo ok would still output ok with a zero return value. — Cyker, Aug 25 '17 at 00:17
Why do you ask? Smells badly like some XY problem. So please edit your question to improve it, give some actual motivation, and explain your use case. — Basile Starynkevitch, Aug 26 '17 at 05:56
With cmd1 | cmd2, the two processes are started concurrently, always. Synchronization between them happen only through I/O, i.e. cmd2 may want to read something and blocks until cmd1 has outputted something. — Kusalananda, Aug 26 '17 at 06:47

xhienne · Answer 1 · 2017-08-26T14:46:46.370

4

We don't know the size of cmd1's output but pipes have a limited buffer size. Once that amount of data has been written to the pipe, any subsequent write will block until someone read the pipe (kind of your failed solution 3).

You must use a mechanism that guarantees not to block. For very large data, use a temporary file. Else, if you can afford keeping the data in memory (that was the idea after all with pipes), use this:

result=$(cmd1) && cmd2 < <(printf '%s' "$result")
unset result

Here the result of cmd1 is stored in the variable result. If cmd1 is successful, cmd2 is executed and is fed with the data in result. Finally, result is unset to release the associated memory.

Note: formerly, I used a here-string (<<< "$result") to feed cmd2 with data but Stéphane Chazelas observed that bash would then create a temporary file, which you don't want.

Answers to questions in comment:

Yes, commands can be chained ad libitum:

result=$(cmd1) \
&& result=$(cmd2 < <(printf '%s' "$result")) \
&& result=$(cmd3 < <(printf '%s' "$result")) \
...
&& cmdN < <(printf '%s' "$result")
unset result

No, the above solution is not suitable for binary data because:
1. Command substitution $(...) eats all trailing newlines.
2. Behavior is unspecified for NUL characters (\0) in the result of a command substitution (e.g. Bash would discard them).
Yes, to circumvent all those problems with binary data, you can use an encoder like base64 (or uuencode, or a home-made one that only takes care of NUL characters and the trailing newlines):
```
result=$(cmd1 > >(base64)) && cmd2 < <(printf '%s' "$result" | base64 -d)
unset result
```
Here, I had to use a process substitution (>(...)) in order to keep cmd1 exit value intact.

That said, again that seems to be quite a hassle just to ensure that data are not written to disk. An intermediary temporary file is a better solution. See Stéphane's answer which addresses most of your concerns about it.

edited Aug 26 '17 at 14:46

answered Aug 23 '17 at 23:55

xhienne

17,793
2
53
69

Does this variable method work well with binary data? Do we need base64 or something else to prevent data corruption? – Cyker Aug 26 '17 at 07:19
Plus, can we chain the same variable repeatedly so that we can have more commands in the pipe? – Cyker Aug 26 '17 at 07:20
@Cyker No, null bytes would be discarded from the result. – Kusalananda Aug 26 '17 at 07:42
@Cyker Answer updated – xhienne Aug 26 '17 at 12:31
3

@Kusalananda, that depends on the shell, of the 5 Bourne-like shells that support that <<< zsh operator, NULs would be preserved in zsh, discarded in bash and mksh and everything past the first NUL would be lost in ksh93 and yash. And in any of those shells, the trailing newlines would be lost and one would be added back by <<<. Also note that <<< is implemented with a temporary files in most. So in addition to being stored in memory, the data would also be written on the filesystem. – Stéphane Chazelas Aug 26 '17 at 12:48
@StéphaneChazelas Thanks. Would you know if <<< is implemented with temporary files in bash? I could check the code, but if you know... – Kusalananda Aug 26 '17 at 12:55
@StéphaneChazelas I'm not 100% certain, but I think Bash creates a temp file. Ok. – Kusalananda Aug 26 '17 at 13:33
2

@Kusalananda ls -ld /proc/self/fd/0 <<< x on Linux gives a deleted regular temp file in ${TMPDIR-/tmp} – Stéphane Chazelas Aug 26 '17 at 13:38
Thanks @StéphaneChazelas and Kusalananda, answer updated consequently – xhienne Aug 26 '17 at 14:48

Stéphane Chazelas · Answer 2 · 2017-08-26T14:59:35.690

The whole point of piping commands is to run them concurrently with one reading the output of the other. If you want to run them sequentially, and if we keep the plumbing metaphor, you'll need to pipe the output of the first command to a bucket (store it) and then empty the bucket into the other command.

But doing it with pipes means having two processes for the first command (the command and another process reading its output from the other end of the pipe to store in the bucket), and two for the second one (one emptying the bucket into one end of the pipe for the command to read it from the other end).

For the bucket, you'll need either memory or the file system. Memory doesn't scale well and you need the pipes. The filesystem makes much more sense. That's what /tmp is for. Note that the disks are likely not to ever see the data anyway as the data may not be flushed there until much later (after you remove the temp file), and even if it is, it will likely still remain in memory (cached). And when it's not, that's when the data would have been too big to fit in memory in the first place.

Note that temporary files are used all the time in shells. In most shells, here documents and here strings are implemented with temp files.

In:

cat << EOF
foo
EOF

Most shells create a tempfile, open it for writing and for reading, delete it, fill it up with foo, and then run cat with its stdin duplicated from the fd open for reading. The file is deleted even before it filled up (that gives the system a clue that it whatever is written there doesn't need to survive a power loss).

You could do the same here with:

tmp=$(mktemp) && {
  rm -f -- "$tmp" &&
    cmd1 >&3 3>&- 4<&- &&
    cmd2 <&4 4<&- 3>&-
} 3> "$tmp" 4< "$tmp"

Then, you don't have to worry about clean-up as the file is deleted from the start. No need for extra processes to get the data in and out of buckets, cmd1 and cmd2 do it by themselves.

If you wanted to store the output in memory, using a shell for that would not be a good idea. First shells other than zsh can't store arbitrary data in their variables. You'd need to use some form of encoding. And then, to pass that data around, you'd end up duplicating it in memory several times, if not writing it to disk when using a here-doc or here-string.

You could use perl instead for instance:

 perl -MPOSIX -e '
   sub status() {return WIFEXITED($?) ? WEXITSTATUS($?) : WTERMSIG($?) | 128}
   $/ = undef;
   open A, "-|", "cmd1" or die "open A: $!\n";
   $out = <A>;
   close A;
   $status = status;
   exit $status if $status != 0;

   open B, "|-", "cmd2" or die "open B: $!\n";
   print B $out;
   close B;
   exit status'

score 1 · Answer 3 · answered Aug 26 '17 at 04:37

Here is a frankly awful version stitching together different tools from moreutils:

chronic sh -c '! { echo 123 ; false ; }' | mispipe 'ifne -n false' 'ifne echo ok'

It still isn't quite what you want: it returns 1 in case of failure, and zero otherwise. However, it doesn't start the second command unless the first succeeded, it returns a failing or succeeding code according to whether the first command worked or not, and it doesn't use files.

The more generic version is:

chronic sh -c '! '"$CMD1" | mispipe 'ifne -n false' "ifne $CMD2"

This pulls together three of the moreutils tools:

chronic runs a command quietly, unless it fails. In this case, we're running a shell to run your first command so that we can invert the success/failure result: it will run the command quietly if it fails, and print the output at the end if it succeeds.
mispipe pipes two commands together, returning the exit status of the first. This is similar to the effect of set -o pipefail. The commands are provided as strings so that it can tell them apart.
ifne runs a program if the standard input is non-empty, or if it is empty with -n. We're using it twice:
- The first is ifne -n false. This runs false, and uses it as the exit code, iff the input is empty (meaning that chronic ate it, meaning that cmd1 failed).
  
  When the input is not empty, it doesn't run false, passes the input through like cat, and exits 0. The output will be piped into the next command by mispipe.
- The second is ifne cmd2. This runs cmd2 iff the input is non-empty. That input is the output of ifne -n false, which will be non-empty exactly when the output of chronic was non-empty, which happens when the command succeeded.
  
  When the input is empty, cmd2 never runs, and ifne exits zero. mispipe discards the exit value anyway.

There are (at least) two remaining flaws in this approach:

As mentioned, it loses the actual exit code of cmd1, reducing it to boolean true/false. If the exit code has meaning, that's lost. It would be possible to save the code to a file in the sh command, and re-load it later (ifne -n sh -c 'read code <FILENAME ; rm -f FILENAME; exit $code' or something) if that's necessary.
If cmd1 can ever succeed with no output, everything falls apart anyway.

Plus, of course, it's multiple fairly-rare commands piped together, quoted carefully, with a non-obvious meaning.

score 1 · Answer 4 · edited Aug 26 '17 at 14:41

First off, your example false | echo ok is nonsensical since false would not output anything to its standard output and echo would not read from its standard input. The "solution" to this is false && echo ok.

cmd1 && cmd2

This will run cmd1 and will not start cmd2 until cmd1 has successfully finished executing.

In a pipeline, such as

cmd1 | cmd2

the two commands are always started concurrently (this is what you notice in your "Failed Solution 1"). The thing that synchronizes them is cmd2 reading from the output of cmd1. A pipeline is a way of passing output from one program into the input of another, concurrently running, program.

To simulate that cmd1 is outputting something that cmd2 reads, but to get rid of the concurrency, you would have to store the output from cmd1 in a temporary file that cmd2 reads:

cmd1 >outfile && cmd2 <outfile

The temporary file may be handled like this:

trap 'rm -f "$tmpfile"' EXIT
tmpfile=$(mktemp)

cmd1 >"$tmpfile" && cmd2 <"$tmpfile"

This sets up a trap that will trigger upon exiting the shell. The trap will remove the temporary file.

If you have $TMPDIR on a memory filesystem, you will not incur any I/O penalty for writing to disk.

If you are worried about the size of the file, then you will be forced to store it on disk no matter what (a pipe would not be able to hold the contents either, this is what you notice in your "Failed Solution 3").

Looking at xhienne's solution for Bash:

result=$(cmd1) && cmd2 <<< "$result"
unset result

This works if the result is text that doesn't end in empty lines, but fails if it contains null bytes (these will be discarded by bash).

To mitigate this, we could base64-encode the result:

set -o pipefail # ksh/zsh/bash
result=$( cmd1 | base64 ) && base64 -d <<<"$result" | cmd2
unset result

This is a terrible idea in terms of both memory and CPU usage, especially if the result is large (the base64 encoding of $result will be one third bigger than the binary). You are far better off writing the binary result to disk and reading it from there.

Note too that bash implements <<< using a temporary file in any case.

score -1 · Answer 5 · edited Jun 11 '20 at 14:16

run a pipe cmd1 | cmd2 n such a way that:

cmd2 doesn't start running until cmd1 has completely finished

This is impossible in general. Read pipe(7) which reminds you that pipes have limited capacity (typically 4Kbytes or 64Kbytes) and they use some kernel memory for their buffer.

So the output of cmd1 goes into the pipe. When it becomes full, any write(2) done by cmd1 to STDOUT_FILENO would block (unless cmd1 is specially coded to handle non-blocking I/O to stdout, and this is very unusual) until cmd2 has read(2) from that pipe other's end. If cmd2 did not start, that would never happen.

I strongly recommend reading a book like Advanced Linux Programming which explains that in details (and an entire book is needed to explain all this).

How to run a pipe safely and sequentially?

Failed Solution 1

Failed Solution 2

Failed Solution 3

Suboptimal Solution

5 Answers5