What's the POSIX way to read an exact number of bytes from a file?

Question

Just hit this problem, and learned a lot from the chosen answer: Create random data with dd and get "partial read warning". Is the data after the warning now really random?

Unfortunately the suggested solution head -c is not portable.

For folks who insist that dd is the answer, please carefully read the linked answer which explains in great detail why dd can not be the answer. Also, please observe this:

$ dd bs=1000000 count=10 if=/dev/random of=random
dd: warning: partial read (89 bytes); suggest iflag=fullblock
0+10 records in
0+10 records out
143 bytes (143 B) copied, 99.3918 s, 0.0 kB/s
$ ls -l random ; du -kP random
-rw-rw-r-- 1 me me 143 Apr 22 19:19 random
4       random
$ pwd
/tmp

dd is portable. If you don't mind the warning, or adjust your blocksize, is there a problem with using dd? — muru, Apr 22 '16 at 21:35
@muru dd doesn't do the job, for reasons explained in the linked answer. In my experiments, requesting 10 * 2^20 bytes with dd yields less than 200 bytes. If you don't understand or believe that, I urge you to read the linked answer which clearly explains how it can be so. — Low Powah, Apr 22 '16 at 22:55
@LowPowah I did read the linked post and I understand it, but I wonder why you can't adjust your blocksize. — muru, Apr 22 '16 at 23:01
@muru Of course I can set the bs= parameter, but that doesn't prevent dd from returning before reading the requested number of bytes (n = bs * count). — Low Powah, Apr 22 '16 at 23:04
Are you saying that you can't get the requested number of bytes for any value of bs? — muru, Apr 22 '16 at 23:05
@muru I am saying that dd bs=1000000 count=10 if=/dev/random of=/tmp/random results in a file containing less than 200 bytes. Now do you understand why dd isn't the right tool for the job? — Low Powah, Apr 22 '16 at 23:08
No, I still don't get it. If thatbs causes problems, why aren't you using a lower bs? Why not dd bs=1000 count=10000? Is something forcing you to use that bs? — muru, Apr 22 '16 at 23:11
@muru the only way to get a guaranteed number of bytes from dd is to either use a bs of 1 byte (as read() will return at least 1 byte) or to not use bs= and instead use obs= (and, optionally, ibs=) separately and pipe it into another dd with your count and an ibs= set to the obs= of the first. If you use bs= at all dd will write partial reads without buffering them to a known size. Using (i)bs=1000 count=10000 only guarantees 10k writes of up to 1000 bytes and will happily write out less than 10k * 1000 bytes if any of the reads return less. — Adrian Günter, Apr 16 '18 at 06:43
@AdrianGünter which still doesn't explain why OP can't use a bs of 1... — muru, Apr 16 '18 at 06:47
@muru Because dd if=/dev/zero of=/dev/null bs=1 count=10000000 takes far longer than with larger block sizes. It's simply not practical for many/most situations. Piping to another dd works and allows arbitrarily large reads and writes. — Adrian Günter, Apr 16 '18 at 06:52
@AdrianGünter for all that, you haven't shown a concrete example for avoiding a 1b block size — muru, Apr 16 '18 at 08:08
@muru: Simplest - dd if=/dev/random | dd count=128 | wc -c will reliably write 64KiB on systems where dd's default blocksize is 512 bytes. The blocksize can be adjusted by setting obs= on the first dd and ibs= (or just bs=) on the second to the same value: dd if=/dev/random obs=4K | dd bs=4K count=16 | wc -c also writes 64KiB. The key is to never set the bs= value on the first dd as this will ensure full output blocks are accumulated before writes. On some implementations you need to set ibs= of first to a value other than obs=: dd if=... ibs=1K obs=4K | dd bs=4K ... — Adrian Günter, Apr 16 '18 at 15:30
@muru try dd if=/dev/random of=/dev/null obs=1317 and let it run for 30 seconds or so on a system that isn't entropy starved, then kill it with Ctrl-c. If you read the status output as [<full_blocks>+<partial_blocks>] records (in|out) you will see that dd read in many (or entirely) partial blocks – many more blocks than it wrote – and that every output block it wrote was a full block, i.e., 1317 bytes. You can verify this with dd if=/dev/random obs=1317 | pv -bn >/dev/null; pv will report bytes read in multiples of 1,317. — Adrian Günter, Apr 16 '18 at 16:41

score 17 · Accepted Answer · answered Apr 22 '16 at 23:36

Unfortunately, to manipulate the content of a binary file, dd is pretty much the only tool in POSIX. Although most modern implementations of text processing tools (cat, sed, awk, …) can manipulate binary files, this is not required by POSIX: some older implementations do choke on null bytes, input not terminated by a newline, or invalid byte sequences in the ambient character encoding.

It is possible, but difficult, to use dd safely. The reason I spend a lot of energy steering people away from it is that there's a lot of advice out there that promotes dd in situations where it is neither useful nor safe.

The problem with dd is its notion of blocks: it assumes that a call to read returns one block; if read returns less data, you get a partial block, which throws things like skip and count off. Here's an example that illustrates the problem, where dd is reading from a pipe that delivers data relatively slowly:

yes hello | while read line; do echo $line; done | dd ibs=4 count=1000 | wc -c

On a bog-standard Linux (Debian jessie, Linux kernel 3.16, dd from GNU coreutils 8.23), I get a highly variable number of bytes, ranging from about 3000 to almost 4000. Change the input block size to a divisor of 6, and the output is consistently 4000 bytes as one would naively expect — the input to dd arrives in bursts of 6 bytes, and as long as a block doesn't span multiple bursts, dd gets to read a complete block.

This suggests a solution: use an input block size of 1. No matter how the input is produced, there's no way for dd to read a partial block if the input block size is 1. (This is not completely obvious: dd could read a block of size 0 if it's interrupted by a signal — but if it's interrupted by a signal, the read system call returns -1. A read returning 0 is only possible if the file is opened in non-blocking mode, and in that case a read had better not be considered to have been performed at all. In blocking mode, read only returns 0 at the end of the file.)

dd ibs=1 count="$number_of_bytes"

The problem with this approach is that it can be slow (but not shockingly slow: only about 4 times slower than head -c in my quick benchmark).

POSIX defines other tools that read binary data and convert it to a text format: uuencode (outputs in historical uuencode format or in Base64), od (outputs an octal or hexadecimal dump). Neither is well-suited to the task at hand. uuencode can be undone by uudecode, but counting bytes in the output is awkward because the number of bytes per line of output is not standardized. It's possible to get well-defined output from od, but unfortunately there's no POSIX tool to go the other way round (it can be done but only through slow loops in sh or awk, which defeats the purpose here).

Thank you for a very comprehensive answer. It seems like there is no simple, and safe way which is also portable. Maybe the answer is to write a C program if one wants to work with arbitrary bytes in units smaller than lines. I am intrigued by the possibility of a uuencode/uudecode solution. Can you please explain a little more why such a solution would not be safe or portable? (I'm defining safe to mean guaranteed not to lose data on given that everything else works perfectly.) — Low Powah, Apr 23 '16 at 00:27
@LowPowah uuencode won't lose data, the problem is counting the input bytes. You can easily count the number of lines, but the number of bytes per line is not standardized. You can pipe into awk and do the counting there, but if you do that I think you'll lose any speed advantage. Furthermore the output of uuencode (in either format) can't easily be split according to input bytes, since it processes bytes by blocks. The output of od is easy to work with but difficult to convert back to binary afterwards. — Gilles 'SO- stop being evil', Apr 23 '16 at 00:34
On my system ibs=1 with 7.9 MB of data degrades the performance from 62 MB/s down to 2.4 MB/s. — ceving, Jan 17 '20 at 16:32
@StéphaneChazelas Yes, but it also encodes those bytes. Sure, you can pipe the output to an awk program that will decode them (or can you? I don't remember if awk can portably output null bytes), but that's not really helpful here. — Gilles 'SO- stop being evil', Nov 01 '20 at 11:54

score 8 · Answer 2 · edited Nov 05 '20 at 16:58

8

Newer versions of the GNU implementation of dd have a count_bytes iflag. eg:

cat /dev/zero | dd count=1234 iflag=count_bytes | wc -c

will output something like

2+1 records in
2+1 records out
1234 bytes (1.2 kB, 1.2 KiB) copied, 0.000161684 s, 7.6 MB/s
1234

edited Nov 05 '20 at 16:58

Stéphane Chazelas

544,893

answered Mar 22 '19 at 18:43

jdizzle

193

score -2 · Answer 3 · answered Apr 22 '16 at 23:29

Part of the point of using dd at all is that the user gets to pick the block size it uses. If dd fails for too large block sizes, IMO it's the user's responsibility to try smaller block sizes. I could ask for a TB from dd in one block, but that doesn't mean I'll get it.

If you want an exact number of bytes, this will be horrendously slow, but should work:

dd bs=1 count=1000000

If even a block size of 1 results in partial reads, …

What's the POSIX way to read an exact number of bytes from a file?

3 Answers3

Linked

Related