Is there a character as `delim` of `read`, so that `read` reads the entire of a file at once?

Question

In bash manual, about read builtin command

-d delim The first character of delim is used to terminate the input line, rather than newline.

Is it possible to specify a character as delim of read, so that it never matches (unless it can match EOF, which is a character?) and read always reads the entire of a file at once?

Thanks.

Can you motivate the question with a use-case? Why does a file's contents need to end up in a variable? — Jeff Schaller, Dec 29 '18 at 19:49

Stéphane Chazelas · Accepted Answer · 2019-01-02T18:27:04.917

Since bash can't store NUL bytes in its variables anyway, you can always do:

IFS= read -rd '' var < file

which will store the content of the file up to the first NUL byte or the end of the file if the file has no NUL bytes (text files, by definition (by the POSIX definition at least) don't contain NUL bytes).

Another option is to store the content of the file as the array of its lines (including the line delimiter if any):

readarray array < file

You can then join them with:

IFS=; var="${array[*]}"

If the input contains NUL bytes, everything past the first occurrence on each line will be lost.

In POSIX sh syntax, you can do:

var=$(cat < file; echo .); var=${var%.}

We add a . which we remove afterwards to work around the fact that command substitution strips all trailing newline characters.

If the file contains NUL bytes, the behaviour will vary between implementations. zsh is the only shell that will preserve them (it's also the only shell that can store NUL bytes in its variables). bash and a few other shells just removes them, while some others choke on them and discard everything past the first NUL occurrence.

You could also store the content of the file in some encoded form like:

var=$(uuencode -m - < file)

And get it back with:

printf '%s\n' "$var" | uudecode

Or with NULs encoded as \0000 so as to be able to use it in arguments to printf %b in bash (assuming you're not using locales where the charset is BIG5, GB18030, GBK, BIG5-HKCSC):

var=; while true; do
  if IFS= read -rd '' rec; then
    var+=${rec//\\/\\\\}\\0000
  else
    var+=${rec//\\/\\\\}
    break
  fi
done < file

And then:

printf %b "$var"

to get it back.

If the first read reads a file up to the first NUL, will the second read be able to read past NUL? — Tim, Jan 02 '19 at 18:00
@Tim, yes, IFS= read -rd '' is often used in a loop to read NUL-delimited records in zsh or bash. You'll find many examples on unix.SE — Stéphane Chazelas, Jan 02 '19 at 18:20

ozzy · Answer 2 · 2018-12-29T20:30:55.557

3

The answer is generally "no", simply because - as a general rule - there is no actual character in a file that conclusively marks the end of a file.

You are probably well advised to try a different approach, such as one of those suggested here: https://stackoverflow.com/questions/10984432/how-to-read-the-file-content-into-a-variable-in-one-go. Use of:

IFS="" contents=$(<file)

is particularly elegant; it causes Bash to read the contents of file into the variable contents, except for NULL-bytes, which Bash-variables can't hold (due to its internal use of C-style, NULL-byte terminated strings). IFS="" sets the internal field separator to empty so as to disable word splitting (and hence to avoid the removal of newlines).

Note: Since (for lack of reputation points) I can't comment on the answer suggesting the use of read with the -N option, I note here that that answer is - by definition - not guaranteed to work as it stands, because the filesize is unknown in advance.

edited Dec 29 '18 at 20:30

answered Dec 29 '18 at 16:23

ozzy

845

That however removes all trailing newline characters. In bash, that also removes all NUL bytes. – Stéphane Chazelas Dec 29 '18 at 19:52
@StéphaneChazelas Thanks for pointing that out; I adapted the answer to fix the issue as much as possible. – ozzy Dec 29 '18 at 20:33
I don't get your point; should I add another 2 0s to 40000000 because maybe someone is going to read a 4G file into a bash variable, with a 128 bytes read buffer? ;-) – Dec 29 '18 at 20:57
@pizdelect Honestly, I didn't know about the 128 bytes read buffer. Still, I'd be inclined to code something more like fs=$(du -b file | cut -f1); read -rN${fs} contents <file, to make the code seem a tad less arbitrary : ) – ozzy Dec 29 '18 at 21:18
@ozzy fwiw, that would better use wc -c instead of du .. | cut, and set LC_CTYPE=C, since the -N is the length in characters, not in bytes. All in all, I'm sorry that you didn't like my answer; I had tried to keep to the actual question and correct some misconceptions about the EOF "character", not second guess the OP (why use read? why read a whole file in a bash variable? why use that much bash in the 1st place ;-)) – Dec 30 '18 at 10:51
@pizdelect As to the byte/character difference: fair point. I think you'd need wc -m though, rather then wc -c (which according to the man page also reports bytes). As to the use of the arbitrary constant: for a one-off script, I'm sure it will do, but I'd try to avoid it; it doesn't seem clean to me. – ozzy Dec 30 '18 at 11:24

score 2 · Answer 3 · 2018-12-29T20:37:59.220

2

In bash, use the -N (number of characters) option.

read -rN 40000000 foo

Omit the -r option if you really want backslashes to escape characters in the file.

from help read:

-N nchars return only after reading exactly NCHARS characters, unless
   EOF is encountered or read times out, ignoring any delimiter

EOF is not a character, but a status: a read (the system call, not the shell builtin) has returned a zero length. But getchar() and other functions will conveniently return EOF which is an integer with a value (-1) that cannot conflict with any valid character from any charset. Thence the confusion, compounded by the fact that some old operating systems really did use an EOF marker (usually ^Z) because they were only keeping track of whole blocks in the filesystem metadata.

Curiously, read -N0 seems to does a "slow slurp" (it will read the whole file just the same, but doing a system call for each character). I'm not sure this is an intended feature ;-)

strace -fe trace=read ./bash -c 'echo yes | read -N0'
...
[pid  8032] read(0, "y", 1)             = 1
[pid  8032] read(0, "e", 1)             = 1
[pid  8032] read(0, "s", 1)             = 1
[pid  8032] read(0, "\n", 1)            = 1
[pid  8032] read(0, "", 1)              = 0

Notice that the buffer that bash's read builtin is using is only 128 bytes, so you shouldn't read large files with it. Also, if your file is heavily utf-8, you should use LC_CTYPE=C read ...; otherwise bash will alternate reads of 128 bytes with byte-by-byte reads, making it even slower.

edited Dec 29 '18 at 20:37

answered Dec 29 '18 at 16:20

I'm not sure read -N0 even makes any sense. – ilkkachu Dec 29 '18 at 17:23
note that Bash can't store NUL bytes in variables, and read drops them so you can't get an exact copy of a file that contains them. – ilkkachu Dec 29 '18 at 17:33
@StéphaneChazelas Oh no, it's not that bad ;-) It will try to read the 40000000 in 128 bytes sized chunks (strace it or look at lbuf in lib/sh/zread.c). – Dec 29 '18 at 20:00
@ilkkachu it's not at all obvious why truncating the file at the first NUL bytes is preferable to stripping NUL bytes. If you do the latter on an UTF-16 file, you will get something readable. – Dec 29 '18 at 20:17
@StéphaneChazelas OK for the dumb character conversion (it will read 128 bytes in one shot, then another 128 one by one). But no, bash's read builtin will read 128 bytes chunks even on seekable files. – Dec 29 '18 at 20:25
1

I didn't say anything about preference. I just said it won't result in an exact copy of the data. – ilkkachu Dec 30 '18 at 13:35

Is there a character as `delim` of `read`, so that `read` reads the entire of a file at once?

3 Answers3