Is this tail behavior in Grouping Commands specified by POSIX?

Question

Using tail combined with other standard tools in Grouping Commands can make some powerful constructions. For example, to get the first and last line of a file:

$ seq 10 > file
$ { head -n1; tail -n1; } <file
1
10

When feeding file contents from a pipe to group commands, tail fails to produce output, because a pipe is un-lseekable:

$ seq 10 | { head -n1; tail -n1; }
1

Now, when the content is big enough, tail works:

$ seq 10000 | { head -n1; tail -n1; }
1
10000

That's because after the first lseek failure, tail know it's not a lseekable file descriptor and because the contents of the pipe haven't been read all yet, it starts to read the content till the end.

As the user point of view, I expect that the behavior should be consistent regardless of input content size. I've looked through POSIX tail, lseek documentation and didn't find out any description.

Is this behavior specified by POSIX? If not, how can I make the result to be always consistent?

I have tested with GNU tail and FreeBSD tail, both have the same behavior.

it might be worth pointing out that splitting input like that with tail probably isn't particularly useful, and might actually make for more work under the hood. as stéphane mentions, it requires extra input validation for a tail than can simply seek through to the end of input because it has to compare that input offset to one it might lseek() to, and the outcome is no different than head -n1 file; tail -n1 file. i find that stuff more useful when slicing input off of the top: while IFS= read -r v; do { printf %s\\n "$v"; head; } >&"$((1+(x=!x)))"; done <in >out1 2>out2 — mikeserv, Nov 01 '15 at 02:33
actually, i guess when you group tail at the least it shouldn't print overlap, so there's that, of course. my apologies. — mikeserv, Nov 01 '15 at 02:44

Stéphane Chazelas · Accepted Answer · 2018-05-07T16:12:50.673

8

Note that the problem is not with tail but with head here which reads from the pipe more than the first line it is meant to output (so there's nothing left for tail to read).

And yes, it's POSIX conformant.

head is required to leave the cursor within stdin just after the last line it has output when the input is seekable, but not otherwise.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap01.html:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility. For files that are not seekable, the state of the file offset in the open file description for that file is unspecified.

For head to be able to do that for a non-seekable file would mean it would have to read one byte at a time which would be terribly inefficient¹. That's what the read or line utility do or GNU sed with the -u option.

So you can replace head -n 20 with gsed -u 20q if you want that behaviour.

Though here, you'd rather want:

sed -e 1b -e '$b' -e d

instead. Here, only one tool invocation, so no problem with an internal buffer that can't be shared between two tool invocations. Note however that for large files, it's going to be less efficient as sed reads the whole file, while for seekable files tail would skip most of it by seeking near the end of the file.

See the related discussion about buffering at Why is using a shell loop to process text considered bad practice?.

Note that tail must output the tail of the stream on stdin. While, as an optimisation and for seekable files, implementations may seek to the end of the file to get the trailing data from there, it is not allowed to seek back to a point that would be before the initial position at the time tail was invoked (Busybox tail used to have that bug).

So for instance in:

{ cat; tail -n 1; } < file

Even though tail could seek back to the last line of file, it does not. Its stdin is an empty stream as cat left the cursor at the end of the file; it's not allowed to reclaim data from that stream by seeking further backward in the file.

^{(Text above crossed out pending clarification by the Open Group and considering that it's not done correctly by several implementations)}

^{¹ The head builtin of ksh93 (enabled if you put /opt/ast/bin ahead of $PATH), for sockets (a type of non-seekable files) instead peeks at the input (using recvfrom(..., MSG_PEEK)) prior to actually reading it to see how much it needs to read to make sure it doesn't read too much. And falls back to reading one byte at a time for other types of files. That is slightly more efficient and I believe is the main reason why it implements its pipes with socketpair()s instead of pipe(). Note that it's not completely fool proof as there's a race condition that could be triggered if another process read from the socket in between the peek and the read.}

edited May 07 '18 at 16:12

answered Oct 29 '15 at 17:29

Stéphane Chazelas

544,893

I want to know about the tail behavior. After the first failed lseek, it started to read the content. While in seq 10000 | tail -n1, it did not even try to perform any lseek. Did POSIX define tail behavior if lseek fail? – cuonglm Oct 29 '15 at 17:34
@cuonglm, if my reading of Stephane's answer is correct, with seq 10 | { head -n1; tail -n1; }, tail is left with nothing to read because head has greedily slurped up the input. In any case it would not make sense for tail to attempt an lseek on non-seekable file – iruvar Oct 29 '15 at 17:40
@cuonglm, it only lseeks for seekable files (which it probably determines with a fstat(0)). See also my edit. – Stéphane Chazelas Oct 29 '15 at 17:46
@StéphaneChazelas: Thanks for thorough answer ( as always :) ). The last wondered thing in my mind, that when the first lseek try failed, do POSIX allow tail to discard lseek try and changing its behavior to read the file till the end? – cuonglm Oct 29 '15 at 19:04
The lseek() is an implementation choice for optimisation, POSIX specifies the behaviour, which is to get the tail of input not the details of how it should be done. In anycase, I don't see GNU tail attempting let alone failing lseek when the input is a pipe. – Stéphane Chazelas Oct 29 '15 at 19:16
@StéphaneChazelas; In strace -fe lseek sh -c 'seq 10 | { head -n1; tail -n1; }, I saw tail do only one lseek. And as my comment above, in seq 10000 | tail -n1, tail even didn't do any lseek. – cuonglm Oct 30 '15 at 01:25
@cuonglm, most likely what you're seeing is head attempting an lseek() (to bring the cursor at the end of the first line), not tail. – Stéphane Chazelas Oct 30 '15 at 09:16
@StéphaneChazelas: Ah, my bad. Thanks for clarifying. – cuonglm Oct 30 '15 at 10:08
tac does that same buggy tail thing, though its not specified one way or the other, of course. { cat; tac; } < file – mikeserv Nov 01 '15 at 00:01
great quote for the head behavior. Is there a similar reference for "it [tail] is not allowed to seek back to a point that would be before the initial position at the time tail was invoked"? – jfs Apr 25 '18 at 20:29
I ask because FreeBSD's tail utility breaks this behavior (it can mmap). – jfs Apr 29 '18 at 21:03
@jfs, hmmm, you're right, there doesn't seem to be anything in POSIX requiring that (even allowing that by some reading) and not many implementations do it. Not even busybox. So either they've reverted the change or it's my memory playing tricks on me (the latter more likely). I'll raise the question to the Austin Group. – Stéphane Chazelas Apr 30 '18 at 06:21
@jfs, see edit and discussion at https://www.mail-archive.com/austin-group-l@opengroup.org/msg02417.html – Stéphane Chazelas Apr 30 '18 at 16:37
Why 'sed -e 1b -e '$b' -e d' and not the much simpler 'sed -n '1p;$p' – May 01 '18 at 05:11
Also, on a medium size file seq 100000000>file, the sed command takes more than 10 seconds while a head -n1 file; tail -n1 file takes just 0.011 seconds. – May 01 '18 at 05:14
@isaac, sed -n '1p;$d' would print the content of a one-line file twice. Good point about your second comment. Thanks. See edit. – Stéphane Chazelas May 01 '18 at 06:05
Ah, interesting, thanks. Maybe sed -n '1{p;b};$p' then?. – May 01 '18 at 22:25
@isaac, that would work with GNU sed, but portably, you'd need: sed -ne '1{p;b' -e '}' -e '$p'. b};$p branches to the "};$p" label in sed implementations based on the original one. If you wanted a shorter one, you could do sed -e 1b -e '$!d' but I find it less legible than sed -e 1b -e '$b' -e d. – Stéphane Chazelas May 02 '18 at 08:35

Is this tail behavior in Grouping Commands specified by POSIX?

1 Answers1