3

I have a series of commands I run like this:

cmd1 < input > foo
cmd2 < foo > bar
cmd3 foo bar > output

Is there a way to do this without the intermediate files foo and bar?

I'd also like to avoid running cmd1 twice:

cmd3 <(cmd1 < input) <(cmd1 < input | cmd2) > output

All 3 commands can take hours to run, and file sizes are in the 1GB to 100GB range (bioinformatics).

Here's a contrived but runnable example:

function cmd1 { sed -r 's/[246]/x/g'; }
function cmd2 { sed -r 's/[135]/-/g'; }
function cmd3 { paste $1 $2; }
seq 10 > input
cmd3 <(cmd1 < input) <(cmd1 < input | cmd2)  # cmd1 runs twice

output

1       -
x       x
3       -
x       x
5       -
x       x
7       7
8       8
9       9
10      -0

Not sure if this is helpful, but I want data to flow like this:

input --> cmd1 ---> cmd2  -->|
                |            |--> cmd3  --> output
                ------------>|

https://unix.stackexchange.com/a/43536 came close, but not quite.

Kusalananda
  • 333,661
obk
  • 133

1 Answers1

4

Something like this, possibly?

rm -f fifo
mkfifo fifo

cmd1 <input | tee fifo | cmd2 | cmd3 fifo /dev/stdin >output

This creates a named pipe called fifo. The first command writes to both the named pipe and the second command's standard input using tee. The third command reads from the named pipe and standard input.

None of the intermediate data is stored on disk, but the pipeline would likely deadlock if cmd3 does not consume from the named pipe before the pipe buffer is full, or if cmd2 produces very little data compared to the amount of data it consumes (and, crucially, not producing enough for cmd3 to consume from fifo before that pipe's buffer is full). You could possibly remedy this by using something like pv to buffer the data from the named pipe, as with cmd3 <( pv --quiet <fifo ) /dev/stdin, or at the writer site with tee >( pv --quiet >fifo ) (or some variation thereof).

Kusalananda
  • 333,661
  • How does the "deadlock" arise? cmd1 fails(?) to handle(?) the scenario where its output file handle is temporarily(?) full? – obk Feb 17 '22 at 00:09
  • @obk If cmd2 produces very little data compared to what it reads, for example, or if cmd3 does not read enough from fifo, allowing the pipe buffer to fill, which stops tee from being able to write to fifo and to cmd2. – Kusalananda Feb 17 '22 at 06:49