23

I don't understand how the data flows in the pipeline and hope someone could clarify what is going on there.

I thought a pipeline of commands processes files (text, arrays of strings) in line by line manner. (If each command itself works line by line.) Each line of text passes through the pipeline, commands don't wait for the previous to finish processing whole input.

But it seems it is not so.

Here is a test example. There are some lines of text. I uppercase them and repeat each line twice. I do so with cat text | tr '[:lower:]' '[:upper:]' | sed 'p'.

To follow the process we can run it "interactively" -- skip the input filename in cat. Each part of the pipeline runs line by line:

$ cat | tr '[:lower:]' '[:upper:]'
alkjsd
ALKJSD
sdkj
SDKJ
$ cat | sed 'p'
line1
line1
line1
line 2
line 2
line 2

But the complete pipeline does wait for me to finish the input with EOF and only then prints the result:

$ cat | tr '[:lower:]' '[:upper:]' | sed 'p'
I am writing...
keep writing...
now ctrl-D
I AM WRITING...
I AM WRITING...
KEEP WRITING...
KEEP WRITING...
NOW CTRL-D
NOW CTRL-D

Is it supposed to be so? Why isn't it line-by-line?

ruakh
  • 1,517
xealits
  • 2,153
  • It's not the pipe, it's cat buffering until stdin closes. – goldilocks Jan 31 '15 at 21:46
  • but tr and sed do process lines from cat before stdin closes – xealits Jan 31 '15 at 21:49
  • The defaults used by stdio (which I believe all of the mentioned programs use) is that stderr is unbuffered, and stdout is line buffered when writing to a terminal and fully buffered otherwise (for example if it is writing to a file or a pipe). Some of the commands have flags which can change the stdout buffering, but it looks like tr does not. – kasperd Feb 01 '15 at 20:45

2 Answers2

39

There's a general buffering rule followed by the C standard I/O library (stdio) that most unix programs use. If output is going to a terminal, it is flushed at the end of each line; otherwise it is flushed only when the buffer (8K on my Linux/amd64 system; could be different on yours) is full.

If all your utilities were following the general rule, you would see output delayed in all of your examples (cat|sed, cat|tr, and cat|tr|sed). But there's an exception: GNU cat never buffers its output. It either doesn't use stdio or it changes the default stdio buffering policy.

I can be fairly sure you're using GNU cat and not some other unix cat because the others wouldn't behave this way. Traditional unix cat has a -u option to request unbuffered output. GNU cat ignores the -u option because its output is always unbuffered.

So whenever you have a pipe with a cat on the left, in the GNU system, the passage of data through the pipe will not be delayed. The cat isn't even going line by line - your terminal is doing that. While you're typing input for cat, your terminal is in "canonical" mode - line-based, with editing keys like backspace and ctrl-U offering you the chance to edit the line you have typed before sending it with Enter.

In the cat|tr|sed example, tr is still receiving data from cat as soon as you press Enter, but tr is following the stdio default policy: its output is going to a pipe, so it doesn't flush after each line. It writes to the second pipe when the buffer is full, or when an EOF is received, whichever comes first.

sed is also following the stdio default policy, but its output is going to a terminal so it will write each line as soon as it has finished with it. This has an effect on how much you must type before something shows up on the other end of the pipeline - if sed was block-buffering its output, you'd have to type twice as much (to fill tr's output buffer and sed's output buffer).

GNU sed has -u option so if you reversed the order and used cat|sed -u|tr you would see the output appear instantly again. (The sed -u option might be available elsewhere but I don't think it's an ancient unix tradition like cat -u) As far as I can tell there's no equivalent option for tr.

There is a utility called stdbuf which lets you alter the buffering mode of any command that uses the stdio defaults. It's a bit fragile since it uses LD_PRELOAD to accomplish something the C library wasn't designed to support, but in this case it seems to work:

cat | stdbuf -o 0 tr '[:lower:]' '[:upper:]' | sed 'p'
9

This actually took me some thought to understand and even more to answer. Great question (I'll upvote it next).

You neglected to try tr | sed in your debugging items above:

>tr '[:lower:]' '[:upper:]' | sed 'p'
i am writing
still writing
now ctrl-d
I AM WRITING
I AM WRITING
STILL WRITING
STILL WRITING
NOW CTRL-D
NOW CTRL-D
>

So evidently tr buffers. Learn something new everyday!

EDIT:

As I think this over, we have isolated the cause, but not provided an explanation. If you cat | tr, it writes right away, if you cat | sed, it writes right away, but if you tr | sed, it waits for EOF. I would suggest the answer might be buried in tr or sed source code then, and not be a pipe issue.

EDIT:

I see Wumpus provided the explanation while I was typing the last edit. Thanks!

fduff
  • 5,035
  • 1
    indeed they buffer! and the test with roughly 8kb lines, as Wumpus mentioned, shows the buffer is indeed 8Kb. I'd like to accept both answers to share some reputation around, but I will take Wumpus's as more complete one. Thanks anyway! – xealits Jan 31 '15 at 23:53
  • 1
    No problem, mine was the empirical answer, his was the knowledgeable one. – Poisson Aerohead Feb 01 '15 at 00:26
  • See also this question which shows how to use stdbuf which might also be helpful. http://unix.stackexchange.com/questions/182537/write-python-stdout-to-file-immediately?newsletter=1&nlcode=17737|321c – Joe Feb 06 '15 at 19:46