120

If I want to tail a 25 GB textfile, does the tail command read the whole file?

Since a file might be scattered on a disk I imagine it has to, but I do not understand such internals well.

5 Answers5

126

No, tail doesn't read the whole file, it seeks to the end then read blocks backwards until the expected number of lines have been reached, then it displays the lines in the proper direction until the end of the file, and possibly stays monitoring the file if the -f option is used.

Note however that tail has no choice but to read the whole data if provided a non seekable input, for example when reading from a pipe.

Similarily, when asked to look for lines starting from the beginning of the file, with using the tail -n +linenumber syntax or tail +linenumber non standard option when supported, tail obviously reads the whole file (unless interrupted).

jlliagre
  • 61,204
  • 16
    Damn, too fast :-). Here's the relevant source code. Print the last N_LINES lines from the end of file FD. Go backward through the file, reading 'BUFSIZ' bytes at a time (except probably the first), until we hit the start of the file or have read NUMBER newlines. – phemmer Nov 28 '13 at 09:54
  • 1
    Also, tail +n will read the whole file - first to find desired number of newlines, then to output the rest. – SF. Nov 28 '13 at 10:20
  • @SF. Indeed, answer updated. – jlliagre Nov 28 '13 at 10:25
  • 5
    Note that not all tail implementation do it or do it properly. For instance busybox 1.21.1 tail is broken in that regard. Also note that the behavior varies when tailing stdin and where stdin is a regular file and the initial position in the file is not at the beginning when tail is invoked (like in { cat > /dev/null; tail; } < file) – Stéphane Chazelas Nov 28 '13 at 10:59
  • 4
    @StephaneChazelas *nix -- the world of weird edge-cases becoming normal. (Though seekable vs non-seekable inputs is definitely a valid point.) – user Nov 28 '13 at 11:49
70

You could have seen how tail works yourself. As you can for one of my files read is done three times and in total roughly 10K bytes are read:

strace 2>&1  tail ./huge-file >/dev/null  | grep -e "read" -e "lseek" -e "open" -e "close"
open("./huge-file", O_RDONLY)           = 3
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_END)                   = 80552644
lseek(3, 80551936, SEEK_SET)            = 80551936
read(3, ""..., 708) = 708
lseek(3, 80543744, SEEK_SET)            = 80543744
read(3, ""..., 8192) = 8192
read(3, ""..., 708) = 708
close(3)                                = 0
jlliagre
  • 61,204
  • I don't see how this answers the question. Can you explain what's going on here? – Iain Samuel McLean Elder Nov 29 '13 at 15:27
  • 12
    strace shows what system calls tail do when runs. Some introdaction about system calls you can read here http://en.wikipedia.org/wiki/System_call. Briefly - open - opens a file and returns a handle (3 in this example), lseek positions where you are going to read and read just reads and as you can see it returns how many byte are read, –  Nov 29 '13 at 15:34
  • 2
    So analyzing system calls you can sometimes understand how a program works. –  Nov 29 '13 at 15:41
26

Since a file might be scattered on a disk I imagine it has to [read the file sequentially], but I do not understand such internals well.

As you now know, tail just seeks to the end of the file (with the system call lseek), and works backwards. But in the remark quoted above, you're wondering "how does tail know where on disk to find the end of the file?"

The answer is simple: Tail does not know. User-level processes see files as continuous streams, so all tail can know is the offset from the start of the file. But in the filesystem, the file's "inode" (directory entry) is associated with a list of numbers denoting the physical location of the file's data blocks. When you read from the file, the kernel / the device driver figures out which part you need, works out its location on disk and fetches it for you.

That's the kind of thing we have operating systems for: so you don't have to worry about where your file's blocks are scattered.

alexis
  • 5,759
2

If head or tail appears to be reading the entire file, a likely reason is that the file contains few or no newline characters. I tripped over this a few months ago with a very large (gigabytes) JSON blob that had been serialized with no whitespace whatsoever, not even in strings.

If you have GNU head/tail you can use -c N to print the first/last N bytes instead of lines, but unfortunately this is not a POSIX feature.

zwol
  • 7,177
1

As you can see in the source code line 525, you can see the comments for the implementation.

 /* Print the last N_LINES lines from the end of file FD.
   Go backward through the file, reading 'BUFSIZ' bytes at a time (except
   probably the first), until we hit the start of the file or have
   read NUMBER newlines.
   START_POS is the starting position of the read pointer for the file
   associated with FD (may be nonzero).
   END_POS is the file offset of EOF (one larger than offset of last byte).
   Return true if successful.  */
muru
  • 72,889
Seeni
  • 111
  • 3