tee is faster than no tee

Question

I have this simple bash script:

#!/bin/bash
for i in {1..1000000}
do
    echo "hello ${i}"
done

That prints a message a million times.

I try to compare the performance of dumping all the output into a single file,  vs. using tee to split the output into two files.

$ time ./run.sh > out1.txt
real    0m9.535s
user    0m6.678s
sys     0m2.803s
$ 
$ time ./run.sh | tee out2.txt > out1.txt
real    0m6.705s
user    0m6.895s
sys     0m5.214s

It turns out that writing to two files simultaneously is faster than only writing to one (these results were consistent across multiple runs).

Can anyone explain how this is possible? Also, how should I interpret the user and sys outputs of time? Why is sys time higher when using tee?

Depending on what run.sh does, it might be an effect of caching. You should run both variants more than once alternatingly. Writing two files could use more system time. — Bodo, Aug 11 '22 at 16:21
@Bodo this seems to be consistent across multiple runs, with and without tee — areller, Aug 11 '22 at 16:30
This is very relevant: How can I time a pipe? But still doesn't explain why tee is faster, it just clarifies that you're not really measuring the time taken right. — terdon, Aug 11 '22 at 16:45
I'll bet it's also faster through tee if you discard one of the output streams: time ./run.sh | tee out1.txt > /dev/null — Sotto Voce, Aug 11 '22 at 16:50
I'm guessing this is due to buffering. The tee variant may be able to write larger chunks (possibly whatever size the pipe buffer is), whereas the first variant may write line by line. Benchmarking with hyperfine on my FreeBSD system shows that the tee variant is about twice as fast. This is not an answer because I'm just waving my arms in the air. — Kusalananda, Aug 11 '22 at 16:59
See also bash run.sh | cat >file. It would be as fast as the tee variant. — Kusalananda, Aug 11 '22 at 17:18
Ah, that lends weight to your original suggestion of its being a buffering thing, @Kusalananda. Probably both cat and tee handle I/O better than the shell. Which isn't surprising. — terdon, Aug 11 '22 at 17:30
@terdon Unfortunately it's just a hunch, and I know for a fact that there are users here that know these things more intimately than I do. I'm hoping one of them will notice and answer. — Kusalananda, Aug 11 '22 at 17:44
@Kusalananda I've seen a lot of questions/complaints on various sites about how piping output of certain commands through tee changes the rate which output lines appear. Without tee, there's a steady rate of lines, but with tee there are pauses and then a flood of lines at once. It's buffering - both in the pipe feeding tee and in tee's output writes. — Sotto Voce, Aug 11 '22 at 18:19
in the second command ./run.sh's stdout is a pipe, which speeds up writes issued by echo because tee is always reading from the other end of the pipe. also tee reads from the stdin pipe end can return more than one line, which reduces the number of both reads and writes issued by tee. the sys time value is duplicated because two processes are executed in the pipeline. — don_aman, Aug 11 '22 at 21:07
@don_aman, so why would tee be better at reading from echo than the plain file (filesystem driver) is? Similarly, I don't get how the pipe would make the echo write in bigger blocks; it probably does write a line at a time, the same to the file and the pipe... — ilkkachu, Aug 11 '22 at 21:37
@terdon, are they not measuring it right? Bash's time times the whole pipeline, isn't that what they'd want here? (Unless one wants to include a 'sync' in the timed part) — ilkkachu, Aug 11 '22 at 21:39
@ilkkachu of course that depends on the implementation, but in most cases pipes (which, for shell pipelines, don't even use filesystem entries) only use a small buffer, compared to how complex filesystem io can be (including caching). anyways, in the given commands no filesystem entries are actually read from, so that wouldn't be relevant. more than one line can be read because of the pipe buffer, and tee simply writes what it reads. you can use strace to see how some (or even most) reads return more than one line. — don_aman, Aug 11 '22 at 21:59
@ Kusalananda and @terdon Buffering is precisely the answer here. When the bare script is run, each invocation of echo requires bash to write and flush a small buffer of bytes (the single line/number) to stdout a million times. With the pipe (in this case specifically to tee), bash is able to buffer the output of multiple invocations of echo (very many) and in the case of using tee, the number of writes/flushes to stdout can be a number MUCH smaller than a million. Something to test the theory would be making each iteration put out a lot more data, to reduce the impact of this buffering. — t0w0i7ne, Aug 11 '22 at 23:32
@t0w0i7ne I don't get that. Why would shell flush every line when stdout is redirected to a block device, yet buffer when stdout is redirected to a pipe? I can see the difference between a 4K file buffer and a 64K pipe buffer (although the file would be cached with write-behind), but surely a block device would not be line-buffered? — Paul_Pedant, Aug 12 '22 at 06:30
@Paul_Pedant I think there is a difference in behavior in the shell with pipes and file redirects, but I'm not sure. I'd be interested to see if the time increases even more without the file redirection. EDIT: I just tried various methods: "> /dev/null" ~3s, "> test.txt" ~4-4.5s, "| tee test2.txt" ~3.75-4.5s, straight to stdout ~5.5-6s (this result makes the pipe and file redirect similar). — t0w0i7ne, Aug 15 '22 at 18:27
I get opposite results: tee version is 1 second slower than plain one; am I the only? — mattia.b89, Jan 15 '23 at 08:27
It turns out that writing to two files simultaneously is faster than only writing to one - for you. In Linux 6.1.5 a plain redirect without tee is faster. — Artem S. Tashkinov, Jan 16 '23 at 10:11
@mattia.b89 tee version is 7 seconds lower for me. There is definitely some cache and buffer in play that make the result very dependent of the setup. — Uriel, Jan 19 '23 at 15:50
This is true and I have noticed that tee has a tendency to race against (and win) other commands. Take, for example, cat file1 | tee > (command1) > (command2) > (command3) > (command4). With tee present, the order of execution is often (not always) command1, command3, command2 and command4. It could mess up your script when the order matters. — dhm, Jan 20 '23 at 09:53

egmont · Answer 1 · 2023-01-20T13:10:52.580

First, let's clear a couple of things.

Flushing

The shell flushes its output after every built-in command, such as echo in this case.

It has to. It has to mimic the behavior as if it was an external command (wherever it's possible, like here it is, it's a built-in shell command rather than the external /bin/echo purely for performance reasons). External commands obviously have to flush (or disregard for good) the output when they quit. The built-in echo has to behave the same way.

It's the expected behavior that if an echo is followed by a long wait or computation in your script, the echo-ed data is already available on the output.

This behavior is independent of where the standard output is connected to. It could be the terminal, a pipe, a file, doesn't matter. Each echo is flushed independently. (Not to be confused with the default behavior of libc, where the behavior depends on whether the standard output goes to a terminal or not, as you can see with most of the standard utilities, like cat, grep, head, and so on and so forth, and also tee of course. The shell does flush explicitly after every built-in command, rather than relying on libc's default buffering.)

Use strace ./run.sh > out1.txt to see a million write() calls performed by your shell.

Multi core

I assume that you have multiple CPU cores in your system, and no significant load due to other processes. In this setup the kernel assigns bash run.sh to one of the cores, and tee to another. This way no heavy-weight process switching takes place, and if they both can actually run then they both do run simultaneously.

Presumably if you confine the two processes into a single core (I believe you can do this with the taskset command, I'd let you experiment with it), then you'd get vastly different results, tee significantly slowing down the process. It's not just an extra process that needs to run serially, interleaved against run.sh, but the kernel would also need to switch between the two processes many times and this switching itself is also quite costly.

The time command

time measures the entire pipeline, that is, run.sh and tee combined. If you're interested in measuring one of the commands only, invoke time in a subshell, like:

$ ( time ./run.sh ) | tee out2.txt > out1.txt
$ ./run.sh | ( time tee out2.txt ) > out1.txt

For the real time, the wall clock time elapsed is printed. That is, as if you literally printed the timestamp before and after the pipeline and computed the difference, or used an external stopwatch. If two processes, each spinning one CPU core for 10 seconds, are running in a pipeline while both of them can run all the time, fully in parallel, the real time will be 10 seconds.

user and sys time, however, add up across CPU cores. If two parallel processes, each on their own CPU core, spin the CPU to its maximum for 10 seconds (the real time being 10 seconds as we've just seen), the user time will be 20 seconds.

Now, let's put these all together:

There's only one question left to answer: Why is it faster to write a tiny chunk of data to a pipe than to a file?

I don't have a direct answer for this, and I'm just drawing the conclusion backwards, i.e. from the timing results that you measured, that it must be faster to write to a pipe than to a file. The following are sort of speculations, but hopefully reasonable ones.

Pipes have a fixed size (I believe 64kB). It could easily be that the entire size is allocated when the pipe is created, therefore no more dynamic allocation in the kernel takes place. (If the size is reached, the writer side blocks until some space is freed up by the reader.) For files, however, there's no such limit. Whatever is passed from user space to the kernel has to be copied there (it's not feasible to block the writer until the data is actually written to disk). Therefore I find it likely that some sort of dynamic memory allocation might take place when writing to a file, making this part of the story more expensive.

In case of pipes, the only additional thing the kernel might need to do is to wake up the processes that have just become able to run, i.e. were waiting for data to appear in the pipe. In case of files, the kernel needs to update the file's in-memory metadata (size, modification time), and start a timer (or update an existing one) to schedule eventually writing out this data to the disk.

There's no strict rule that writing to a file would have to be more expensive than writing to a pipe, it just apparently so happens that it is, as it's demonstrated by the numbers you measured.

By adding tee, you happen to reduce the work required by run.sh, since its million write()s now happen to be somewhat cheaper. This makes the entire run.sh be able to run faster, and thus result in a smaller wall clock time.

You add a second process which runs mostly in parallel to it, and presumably does less work. It uses buffered write() for both of its output files, i.e. only a few syscalls compared to the unbuffered case. For its input, well, it might perform a million tiny read()s, but I'd guess that due to randomness in timing and what not, many of bash's write()s will probably be combined and will arrive in a single read(), thereby probably requiring noticeably fewer than a million read()s. (It would be really cool to see how many read()s it performs. strace'ing is not an option since then the measuring itself would significantly alter the timings. I'd go with patching tee to increment a counter on each read() and dump that number at the end. I'd leave it to the dear readers as an exercise.)

So, tee is faster than run.sh and as such doesn't delay the completion of the pipeline. However, it adds its own share to the user and sys times, making them larger than they were before.

Update:

I was curious so I patched tee to see how many times it read()s.

With just one terminal on the desktop, it's around 660 000 - 670 000. With a browser being opened in the background with a page or two, it's around 500 000 - 600 000. With the browser just starting up (more heavy work), it's around 400 000. Makes sense: the more other things to do, the more likely that tee doesn't read its data immediately and some of bash's write()'s can accumulate. You get the idea, and now the rough numbers as well, which of course can vary a lot across computers.

tee is faster than no tee

1 Answers1