Piping commands with very large output

Question

I want to tar a directory and write the result to stdout, then pipe it to a compression program, like this:

tar -cvf - /tmp/source-dir | lzip -o /media/my-usb/result.lz -

I have been using pipe all the time with commands which output several lines of text. Now I am wondering what would happen when I pipes a (fast) command with very large output such as tar and a very slow compression command followed? Will tar wait for its output to be consumed by lzip? Or it just does as fast as it can then outputs everything to RAM? It will be a disaster in low RAM system if the latter is true.

For what it's worth, compression operations like lzip on a stream of data are much faster than file-system operations like tar. Unless you have an ancient UNIX box based on an early Sparc or Motorola 68020 chip. — O. Jones, Jan 03 '20 at 12:48
@O. Jones: I doubt xz at maximum compression would be faster than an SSD. — Oskar Skog, Jan 03 '20 at 23:14
@O.Jones If you compress with maximum option values possible (e.g: lzip -s256MiB -m273) it shall take a very long time. — Livy, Jan 04 '20 at 14:12

Kusalananda · Accepted Answer · 2020-12-08T20:49:12.423

46

When the data producer (tar) tries to write to the pipe too quickly for the consumer (lzip) to have time to read all of it, it will block until lzip has had time to read what tar is writing. There is a small buffer associated with the pipe, but its size is likely to be smaller than the size of most tar archives. There is no risk of filling up your system's RAM with your pipeline.

"Blocking" simply means that when tar does a call to the write() library function (or equivalent), the call won't return until the data has been delivered to the pipe buffer, which could take a bit of time if lzip is slow to read from that same buffer. You should be able to see this in top where tar would slow down and sleep a lot compared to lzip (assuming tar is in fact quicker than lzip).

You would therefore not fill up a significant amount of RAM with your pipeline. To do that (if you wanted to), you could use something like pv in the middle, with some large buffer (here, a gigabyte):

tar -cvf - /tmp/source-dir | pv --buffer-size 1G | lzip -o /media/my-usb/result.lz -

This would still block tar whenever pv blocks. pv would block when its buffer is full and it can't write to lzip.

The reverse situation works in a similar way, i.e. if you have a slow left-hand side of a pipe writing to a fast right-hand side, the consumer on the right would block on read() until there is data to be read from the pipe.

This (data I/O) is the only thing that synchronises the processes taking part in a pipeline. Apart from reading and writing (and occasionally blocking while waiting for someone else to read or write), they would run independently of each other.

edited Dec 08 '20 at 20:49

answered Jan 02 '20 at 17:18

Kusalananda

333,661

2

FWIW, the size of the pipe buffer is dependent on OS and versions. Real old Linux had buffers around 4Kb in size. Later it defaulted to something like 64Kb, and provides a fcntl() where an open pipe buffer can be resized (eg to 1M) as required. – Stephen Harris Jan 02 '20 at 17:21
2

From https://unix.stackexchange.com/questions/11946/how-big-is-the-pipe-buffer "PIPE_BUF" is the largest value that can be written atomically. sysctl fs.pipe-user-pages-soft returns 16384 on my machine, and sysctl fs.pipe-max-size returns 1048576 – Stephen Harris Jan 02 '20 at 17:31
1

All that adding buffer here really will do is just increase the memory usage of the pipeline. Except in a few pathological cases, there's usually little to gain by adding large buffers between commands in a pipeline. – Lie Ryan Jan 03 '20 at 09:05
@LieRyan This is absolutely true. In this case it's a useless addition. It would be more useful it was, for example, part of a pipeline stage that received data on a remote system ( ... | ssh user@host 'pv --buffer-size 1G | lzip ...' or something similar), or if pv or some other utility was used to throttle the data stream in some way. – Kusalananda Jan 03 '20 at 09:20
4

@LieRyan I didn't know about pv, but use cases which come to mind for a large buffer are reading from a time-metered connection or from a USB stick you want to remove as soon as possible, or generally from any process which locks a critical resource. – Peter - Reinstate Monica Jan 03 '20 at 22:48
2

Piping from a network into an I/O-bound process that does writes in chunks is another place where it comes in handy. Several jobs ago (so the details are hazy), I had a job that effectively did periodic flushes to disk, hanging for a significant while as they were ongoing; without pv and a large buffer, this stopped the download, meaning a substantial loss in throughput. Sure, if your I/O system flushes at a reasonably steady rate, it doesn't make much of a difference; but that's not always the case. – Charles Duffy Jan 04 '20 at 06:05
@LieRyan: you could also imagine cases where the command you're piping into can usually or nearly keep up with the disk, but processing rate varies with data. e.g. simple repeating patterns compress quickly; redundancy over a longer distance and/or varying in pattern takes longer to find. 64kiB is probably too small and would lead to stalls in the pipeline. A few MiB is very reasonable a lot of the time, but yeah usually 1GiB would be overkill for most jobs and just lead to more cache and TLB thrashing. – Peter Cordes Jan 04 '20 at 09:24

score 2 · Answer 2 · edited Jan 03 '20 at 20:52

2

GNU tar has --lzip option to "filter the archive through lzip", so you may want to use instead:

tar -cvf --lzip /media/my-usb/result.lz /tmp/source-dir

Answering the question: in your case the system will manage the pipe properly using default system buffer size.

edited Jan 03 '20 at 20:52

Kusalananda

333,661

answered Jan 02 '20 at 17:18

Yurko

718

3

I think that the questioner wants to know what "properly" actually entails. – JdeBP Jan 02 '20 at 17:29
7

All the compression and decompression options in tar just fork an appropriate sub-process and connect it internally to stdout (to write an archive) or stdin (to read an archive). The underlying mechanism is identical to the shell's pipework. – Paul_Pedant Jan 02 '20 at 18:58
2

I know the --lzip option of tar, but it lack the flexibility of setting options for lzip itself. By piping commands, I can throw in a lot of options for lzip. – Livy Jan 03 '20 at 12:43
1

Doesn't just assume GNU, but assumes recent GNU. Why write code that reduces your portability unless you get some kind of benefit in return? – Charles Duffy Jan 03 '20 at 16:25

Piping commands with very large output

2 Answers2