Should the shell read (an script) one character at a time?

Question

While reading an script, the shell will read the script from a file, or from a pipe or possibly, some other source (stdin ?). The input may not be seek able under some corner conditions (no way to rewind the file position to a previous position).

It has been said that read reads stdin one byte at a time until it finds an unescaped newline character

Should the shell also read one character at a time from its script input?.
I mean the script, not an additional data text file that could be used.

If so: why is that needed? Is it defined in some spec?

Do all shells work similarly? Which not?

score 3 · Answer 1 · answered Jan 08 '19 at 09:11

the shell will read from the script file or from a device descriptor

Or from a pipe, which is probably the easiest way to get a non-seekable input fd.

Should the shell also read one character at a time from its script input?.

If it wants to support scripts that run commands that read from stdin, and expect to get their input using lines from the script itself.

As in something like this:

$ cat foo.sh
#!/bin/sh
line | sed -e 's/^/* /'
xxx
echo "end."

$ cat foo.sh | bash
* xxx
end.

The line command reads a single line from standard input (the xxx) line, and the shell reads the other lines as commands. For this to work, line also needs to take care not to read the input too far, as otherwise the shell would not see the following lines. With GNU utilities, head -n1 would read too much, as would e.g. sed. The line utility from util-linux takes care to read one byte at a time to not read past the newline.

The above script doesn't work with e.g. dash, as it reads the script full blocks at a time:

$ cat foo.sh | dash
* 
dash: 3: xxx: not found
end.

Dash and Busybox read full blocks, the others I tested (Bash, Ksh, mksh and Zsh) read byte-by-byte.

Note that that's a rather convoluted script, and it doesn't work properly if run as, e.g. bash foo.sh, since in that case stdin doesn't point to the script itself, and the xxx line would be taken as a command. It would probably be better to use a here-doc if ne wants to include data within the script itself. This works with any shell, when run as sh bar.sh, sh < bar.sh or cat bar.sh | sh:

$ cat bar.sh
#!/bin/sh
sed -e 's/^/* /' <<EOF
xxx
EOF
echo "end."

@Isaac, yes. I was trying to say that you make a distinction between a file and "device descriptor" in the first sentence of your question. I can't see what you're aiming at with that distinction, especially with stdin as an example of the second group. If you did something like sh < a.sh, then the stdin of the script would be a file instead. — ilkkachu, Jan 08 '19 at 11:16
Ah, just that a file in the filesystem is usually seek able, a file descriptor (like stdin 0,1,2,...) is usually not. Exactly your description of what happens with the pipe, but with different words, hope that makes sense. — , Jan 08 '19 at 11:21
@Isaac, Hmm. Bash still reads byte-by-byte if we do something like mkfifo p; cat foo.sh > p & bash p. That seems a bit overly careful: I can't really see there would be a chance for commands in the script to read the same pipe in that case (they can't really get the fd directly, Bash seems to open it at number 255, the script would need to know that). None of the other shells seem to do a byte-by-byte read in that case. — ilkkachu, Jan 08 '19 at 12:31
It is a POSIX requirement, read my (added) answer. (any comment?). — , Jan 09 '19 at 02:07

score 1 · Accepted Answer · 2019-04-20T18:31:53.130

Yes for a POSIX compliant shell. The bash developer had this to say:

POSIX requires it for scripts that are read from stdin. When reading from a script given as an argument, bash reads blocks.

And, indeed, the POSIX spec says this (emphasis mine):

When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell.

That is: (for stdin script) the shell shall read one-character-at-a-time.

In C locale, one char is one byte.

It seems that posh, mksh, lksh, attsh, yash, ksh, zsh and bash conform to this requirement.

However ash (busybox sh) and dash do not.

@ilkkachu That is assuming you know which is the first encoded byte. The point is that: reading at a random file offset might require reading several bytes (one-byte-at-a-time) to get in sync again and find the "first (encoded) byte". — , Jan 18 '19 at 17:39
@ilkkachu If any read sequence is found to be invalid: it could be rejected (right there and then), but generally it isn't. Valid C char values (any byte value) could generate any sequence of bytes. A shell must be able to deal with such sequences, even if invalid in some particular locale encoding. Try, for example: printf 'echo one\xea\xd5two' | sh -s. — , Jan 18 '19 at 17:48

Should the shell read (an script) one character at a time?

2 Answers2

Linked