First, let's clear a couple of things.
The shell flushes its output after every built-in command, such as echo
in this case.
It has to. It has to mimic the behavior as if it was an external command (wherever it's possible, like here it is, it's a built-in shell command rather than the external /bin/echo
purely for performance reasons). External commands obviously have to flush (or disregard for good) the output when they quit. The built-in echo
has to behave the same way.
It's the expected behavior that if an echo
is followed by a long wait or computation in your script, the echo
-ed data is already available on the output.
This behavior is independent of where the standard output is connected to. It could be the terminal, a pipe, a file, doesn't matter. Each echo
is flushed independently. (Not to be confused with the default behavior of libc, where the behavior depends on whether the standard output goes to a terminal or not, as you can see with most of the standard utilities, like cat
, grep
, head
, and so on and so forth, and also tee
of course. The shell does flush explicitly after every built-in command, rather than relying on libc's default buffering.)
Use strace ./run.sh > out1.txt
to see a million write()
calls performed by your shell.
I assume that you have multiple CPU cores in your system, and no significant load due to other processes. In this setup the kernel assigns bash run.sh
to one of the cores, and tee
to another. This way no heavy-weight process switching takes place, and if they both can actually run then they both do run simultaneously.
Presumably if you confine the two processes into a single core (I believe you can do this with the taskset
command, I'd let you experiment with it), then you'd get vastly different results, tee
significantly slowing down the process. It's not just an extra process that needs to run serially, interleaved against run.sh
, but the kernel would also need to switch between the two processes many times and this switching itself is also quite costly.
time
measures the entire pipeline, that is, run.sh
and tee
combined. If you're interested in measuring one of the commands only, invoke time
in a subshell, like:
$ ( time ./run.sh ) | tee out2.txt > out1.txt
$ ./run.sh | ( time tee out2.txt ) > out1.txt
For the real
time, the wall clock time elapsed is printed. That is, as if you literally printed the timestamp before and after the pipeline and computed the difference, or used an external stopwatch. If two processes, each spinning one CPU core for 10 seconds, are running in a pipeline while both of them can run all the time, fully in parallel, the real time will be 10 seconds.
user
and sys
time, however, add up across CPU cores. If two parallel processes, each on their own CPU core, spin the CPU to its maximum for 10 seconds (the real time being 10 seconds as we've just seen), the user time will be 20 seconds.
Now, let's put these all together:
There's only one question left to answer: Why is it faster to write a tiny chunk of data to a pipe than to a file?
I don't have a direct answer for this, and I'm just drawing the conclusion backwards, i.e. from the timing results that you measured, that it must be faster to write to a pipe than to a file. The following are sort of speculations, but hopefully reasonable ones.
Pipes have a fixed size (I believe 64kB). It could easily be that the entire size is allocated when the pipe is created, therefore no more dynamic allocation in the kernel takes place. (If the size is reached, the writer side blocks until some space is freed up by the reader.) For files, however, there's no such limit. Whatever is passed from user space to the kernel has to be copied there (it's not feasible to block the writer until the data is actually written to disk). Therefore I find it likely that some sort of dynamic memory allocation might take place when writing to a file, making this part of the story more expensive.
In case of pipes, the only additional thing the kernel might need to do is to wake up the processes that have just become able to run, i.e. were waiting for data to appear in the pipe. In case of files, the kernel needs to update the file's in-memory metadata (size, modification time), and start a timer (or update an existing one) to schedule eventually writing out this data to the disk.
There's no strict rule that writing to a file would have to be more expensive than writing to a pipe, it just apparently so happens that it is, as it's demonstrated by the numbers you measured.
By adding tee
, you happen to reduce the work required by run.sh
, since its million write()
s now happen to be somewhat cheaper. This makes the entire run.sh
be able to run faster, and thus result in a smaller wall clock time.
You add a second process which runs mostly in parallel to it, and presumably does less work. It uses buffered write()
for both of its output files, i.e. only a few syscalls compared to the unbuffered case. For its input, well, it might perform a million tiny read()
s, but I'd guess that due to randomness in timing and what not, many of bash
's write()
s will probably be combined and will arrive in a single read()
, thereby probably requiring noticeably fewer than a million read()
s. (It would be really cool to see how many read()
s it performs. strace
'ing is not an option since then the measuring itself would significantly alter the timings. I'd go with patching tee
to increment a counter on each read()
and dump that number at the end. I'd leave it to the dear readers as an exercise.)
So, tee
is faster than run.sh
and as such doesn't delay the completion of the pipeline. However, it adds its own share to the user and sys times, making them larger than they were before.
Update:
I was curious so I patched tee
to see how many times it read()
s.
With just one terminal on the desktop, it's around 660 000 - 670 000. With a browser being opened in the background with a page or two, it's around 500 000 - 600 000. With the browser just starting up (more heavy work), it's around 400 000. Makes sense: the more other things to do, the more likely that tee
doesn't read its data immediately and some of bash
's write()
's can accumulate. You get the idea, and now the rough numbers as well, which of course can vary a lot across computers.
run.sh
does, it might be an effect of caching. You should run both variants more than once alternatingly. Writing two files could use more system time. – Bodo Aug 11 '22 at 16:21tee
– areller Aug 11 '22 at 16:30tee
is faster, it just clarifies that you're not really measuring the time taken right. – terdon Aug 11 '22 at 16:45tee
if you discard one of the output streams:time ./run.sh | tee out1.txt > /dev/null
– Sotto Voce Aug 11 '22 at 16:50tee
variant may be able to write larger chunks (possibly whatever size the pipe buffer is), whereas the first variant may write line by line. Benchmarking withhyperfine
on my FreeBSD system shows that thetee
variant is about twice as fast. This is not an answer because I'm just waving my arms in the air. – Kusalananda Aug 11 '22 at 16:59bash run.sh | cat >file
. It would be as fast as thetee
variant. – Kusalananda Aug 11 '22 at 17:18cat
andtee
handle I/O better than the shell. Which isn't surprising. – terdon Aug 11 '22 at 17:30./run.sh
's stdout is a pipe, which speeds up writes issued byecho
becausetee
is always reading from the other end of the pipe. alsotee
reads from the stdin pipe end can return more than one line, which reduces the number of both reads and writes issued bytee
. thesys
time value is duplicated because two processes are executed in the pipeline. – don_aman Aug 11 '22 at 21:07tee
simply writes what it reads. you can usestrace
to see how some (or even most) reads return more than one line. – don_aman Aug 11 '22 at 21:59tee
version is 1 second slower than plain one; am I the only? – mattia.b89 Jan 15 '23 at 08:27tee
version is 7 seconds lower for me. There is definitely some cache and buffer in play that make the result very dependent of the setup. – Uriel Jan 19 '23 at 15:50