What's wrong with this branched data flow using `tee`, fifos, and `paste`?

Question

In the spirit of this question, I want to create a branched data flow from a single source:

cmd1 ──> tee ──────────────> │          
         ├─────> cmd2 ─────> │ cmd4  
         └─────> cmd3 ─────> │

Unlike that question, I want my output interleaved, not one command at a time. I can do that with named pipes and paste:

$ mkfifo fifo1 fifo2
$ seq 1 100 \
    | tee \
      >(awk '$0+=1' > fifo1) \
      >(awk '$0+=2' > fifo2) \
    | paste - fifo1 fifo2

This seems to work fine; i.e., it prints

1 2 3
2 3 4
...
100 101 102

That's a notional example to illustrate the concept. My real pipeline looks like this:

find "$1" -type d -print0 \
  | tee \
    >(xargs -0 -n1 du -bs | cut -f1 > fifo1) \
    >(xargs -0 -n1000 stat --printf="%G\n" > fifo2) \
  | xargs -0 -n1 echo \
  | paste - fifo1 fifo2

This also seems to work fine in most cases. But when I run it on a huge filesystem, it eventually hangs in the middle. From reading the above question, I suspect that I have a deadlock happening. But I don't quite see where it could be -- it seems like paste should be keeping all the data flowing.

I guess I still don't understand buffering and data flow in tee, pipes, and fifos. Could anyone explain what I'm missing here? Where is the blockage and how do I fix it? (Or investigate further?)

Does it deadlock if you use -n1 instead of -n1000 in the second process substitution? — Kusalananda, May 18 '21 at 16:28

Ole Tange · Answer 1 · 2021-05-18T22:10:58.207

Writing and reading to pipes in parallel requires extreme discipline.

Assume that a pipe can buffer 128KBytes. Now assume that all filenames are 1 MBytes long. A filename will not fit in the pipe, so the reader must read something before the rest can be written. But if the reader is currently waiting for input from another pipe, this will never happen.

In your example xargs -n1000 will be reading 1000MBytes before outputting anything, so the paste will be waiting for input from $fifo2 while tee tries to write to stdout.

Try this instead:

du_cut() { du -bs "$1" | cut -f1; }
stat_print() { stat --printf="%G\n" "$1"; }
doit() { printf "%s\t%s\n" $(du_cut "$1") $(stat_print "$1"); }
export -f du_cut stat_print doit
find "$1" -type d -print0 | parallel -0 --tag doit

What's wrong with this branched data flow using `tee`, fifos, and `paste`?

1 Answers1