Can `head` read/consume more input lines than it outputs?

Question

Given the following 3 scripts:

printf 'a\nb\nc\n' > file && { head -n 1; cat; } < file
printf 'a\nb\nc\n' | { head -n 1; cat; }
{ head -n 1; cat; } < <(printf 'a\nb\nc\n')

I'd expect the output from each to be:

a
b
c

but for some of those, on some systems, that is not the case. For example, on cygwin:

$ printf 'a\nb\nc\n' > file && { head -n 1; cat; } < file
a
b
c

$ printf 'a\nb\nc\n' | { head -n 1; cat; }
a

$ { head -n 1; cat; } < <(printf 'a\nb\nc\n')
a

What is causing the different output from those scripts?

Additional info - this is apparently not just a head problem:

$ printf 'a\nb\nc\n' | { sed '1q'; cat; }
a
$ printf 'a\nb\nc\n' | { awk '1;{exit}'; cat; }
a
$ { sed '1q'; cat; } < <(printf 'a\nb\nc\n')
a
$ { awk '1;{exit}'; cat; } < <(printf 'a\nb\nc\n')
a

What would be a robust, POSIX way in shell (i.e. without just invoking awk or similar once to do everything) to read some number of lines from input and leave the rest for a different command regardless of whether the input is coming from a pipe or a file?

This question was inspired by comments under an answer to sort the whole .csv based on the value in a certain column.

I know you are asking for POSIX so I'm commenting instead of an answer. With GNU sed, you can use the -u option to avoid the buffering issue. Wonder why similar option isn't available for other commands though. — Sundeep, Jul 04 '23 at 07:41
@Sundeep but the OP is just asking for a "POSIX way" in the shell, in response to this comment. "POSIX way" could mean "honoring the POSIX requirement". So maybe you have valuable information to post? Does Perl honor the POSIX requirement for seekable files? — jubilatious1, Jul 04 '23 at 15:16
@jubilatious1 By "POSIX way" I mean using tools defined by POSIX and using POSIX standard definitions for options, regexps, etc. — Ed Morton, Jul 04 '23 at 18:22
@EdMorton I'll confess I'm confused. There are some well-established idioms that do important functions, such as truncating a file in-place in sh using tail and perl (see this Answer). So it seems that some tools which may not be "defined" by POSIX still honor/respect POSIX definitions? If so, wouldn't those answers be useful? — jubilatious1, Jul 04 '23 at 21:39
As a POSIX solution, how about: (i=<LINES_TO_READ>; while [ $((i=i-1)) -ge 0 ]; do read; done; <command to read the rest>) ? — aviro, Jul 05 '23 at 10:57
@jubilatious1 if you have an answer you think is useful feel free to post it, it's just not what I'm personally interested in at this time, I'm just looking for an answer than anyone on a POSIX system can use but maybe others would find non-POSIX solutions useful too, idk. — Ed Morton, Jul 05 '23 at 12:45
@aviro if you have a solution you think is useful please post it as an answer so people can comment on it and up/down vote it and others in future can easily see it. — Ed Morton, Jul 05 '23 at 12:47

score 14 · Accepted Answer · edited Jul 04 '23 at 06:25

head may read its whole input. It must read at least what it outputs (otherwise it is logically impossible to implement), but it may read more.

Typically head asks the operating system to read a fixed-size buffer (by calling the read system call or similar). It then looks for newline characters in that buffer, and prints output until it reaches the desired number of lines.

All POSIX compliant implementations of head call lseek to reset the file position on the input to be just after the end of the part that has been copied to the output. However this is only possible if the file is seekable: this includes ordinary files, but not pipes. If the input is a pipe, whatever head has read is discarded from the pipe and cannot be put back. This explains the difference you observed between <file (regular file) and | or <() (pipe).

The relevant section of the standard above is:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility. For files that are not seekable, the state of the file offset in the open file description for that file is unspecified. A conforming application shall not assume that the following three commands are equivalent:
tail -n +2 file
(sed -n 1q; cat) < file
cat file | (sed -n 1q; cat)
The second command is equivalent to the first only when the file is seekable. The third command leaves the file offset in the open file description in an unspecified state. Other utilities, such as head, read, and sh, have similar properties.

Some head implementations such as the head builtin of ksh93 (enabled after builtin head, and provided it was included at build time) do also try not to leave the cursor past the last line it has output when the input is not seekable, in the case of ksh93's by reading the input one byte at a time (like shells read builtins typically do) or by peeking at the contents of pipes before reading it on system that have such a possibility (not Linux). But those are rather the exception as there's a steep performance penalty.

@Thanks for the answer. I added a bit to it just now to ask what WOULD be a robust way to read a few lines then leave the rest so if you have something in mind for that please do let us know. — Ed Morton, Jul 03 '23 at 18:38
@EdMorton Do everything inside the same tool. If you really need multiple tools (for example to sort a file except for N header lines), have a single “driver” tool do the splitting and joining. (Or use a different approach, e.g. prepending a section indicator to each line if you want to sort only part of a file.) — Gilles 'SO- stop being evil', Jul 03 '23 at 18:46

Can `head` read/consume more input lines than it outputs?

1 Answers1

Linked