3

The Geostationary Operational Environmental Satellite (GOES)-R Product User Guide (PUG) from the National Oceanic and Atmospheric Administration (NOAA) contains the following rather wordy description of a plain text file (§4.3) (emphasis mine):

The Unix text file format is used in a small subset of the Level 1b and 2+ semi-static source data files. The Unix text file format, less the end-of-file character, is embedded in GRB metadata packets to store the XML-based netCDF Markup Language (NcML) representation of the netCDF file specifications, which includes the values for product metadata.

The Unix text file format is a sequence of lines (i.e., records), potentially variable in length, of electronic text. For the GOES-R ground system, the electronic text, newline, and end-of-file characters conform to the American Standard Code for Information Interchange (ASCII). At the end of each line is the newline character. At the end of file, there is an end-of-file character.

Is this an accurate description of the contents of a file? I thought that the end of file was a condition that the operating system or a library routine was returning when no more data can be read from a file (or other stream). Is this byte actually contained in the file?

gerrit
  • 3,487
  • I'd say they're completely wrong about an 'end-of-file' character. To start with, there is no EOF character. some apps use EOT (^D) to signal the end of the file (or input), but that's app-specific, far from universal or required. DOS/Windows uses (or used to use, i dunno any more) ^Z. – cas Aug 28 '19 at 11:27
  • @cas what about the file separator 0x1C if I'm not mistaken. They could of course also be eroniously naming the EOT as EOF as you point out. – Philip Couling Aug 28 '19 at 22:15
  • 2
    Their definition is vacuously consistent, because they define "unix text file" as being terminated by a mysterious EOF character, but also specify that that character is omitted in the form used by them ("less the end-of-file character"). It's like defining "mouse" as "a rodent not wearing a hat". –  Aug 28 '19 at 22:47
  • 4
    I suspect it's a legacy issue where NOAA or NASA had at some stage been using a non-Unix OS that uses EOF characters and it's possible that this sentence was accidentally copied from an earlier document. – Anthony Geoghegan Aug 29 '19 at 10:03
  • @mosvy thats funny but I believe incorrect. in this context "Unix Text File Format" appears to refer to specific file format which is based on unix text files. It's clearly not trying to say "any generic unix text file". There does appear to be some prior knowledge expected on the part of the reader about what this EOF character is. Relying on such reader knowledge is terrible practice in a spec, but there it is. As someone else mentioned a UTF-8 BOM, it does appear to be a similar, that readers may expect there to be one, but they explicitly rule it out. – Philip Couling Aug 29 '19 at 11:26
  • @PhilipCouling, I read it as describing both "the Unix text file format" in general (the last paragraph, with the mistaken mention of eof), and the more specific use in metadata packets (with the eof explicitly excluded). That sentence about the GOES-R ground system in the middle does make it a bit fuzzy though, so I'm not sure. – ilkkachu Aug 29 '19 at 15:30

3 Answers3

4

The Unix text file format is a sequence of lines (i.e., records), potentially variable in length, of electronic text. At the end of each line is the newline character. At the end of file, there is an end-of-file character.

Is this an accurate description of the contents of a file?

Up to but excluding that last bolded part, yes. But I don't know of any Unixy systems that would use an end-of-file character, they all store the length of a file down to a byte, making such markers unnecessary.

Then again, it appears there have been systems that did use an end-of-file character. At least Wikipedia claims that:

The CP/M file system only recorded the lengths of files in multiples of 128-byte "records", so by convention a Control-Z character was used to mark the end of meaningful data if it ended in the middle of a record.

Having file lengths stored only up to a block would require some sort of custom to encode the end of the last line within the data stream. Any programs handling binary data would of course also have to deal with the more granular file sizes somehow. With binary files it might be easier to ignore the trailing "extra" bytes, though.

I think I've seen Control-Z used as an EOF marker on MS-DOS, but it wasn't necessary there either.

That quoted text seems to have a mistaken idea of text files in current systems. If we look at what the POSIX standard says, there's no mention of an end-of-file character or marker for text files, just that they contain no NUL bytes and consist of lines (ending in newlines).

See also: What's the last character in a file?

As for this part...

For the GOES-R ground system, [...] and end-of-file characters conform to the American Standard Code for Information Interchange (ASCII).

Like others have said in the comments, there's no character for end-of-file in ASCII, at least not with that name (*). Control-Z mentioned above is 26, or "substitute" (SUB), "used to indicate garbled or invalid characters". So, based on just that text, it would be hard to know what the EOF character would be, were it used.

(* There's "end of text" (ETX, code 3), "end of transmission" (EOT, code 4), "end of transmission block" (ETB, 23), "end of medium" (EOM, 25) and also "file separator" (FS, 28). Close, but not exact.)

I thought that the end of file was a condition that the operating system or a library routine was returning when no more data can be read from a file (or other stream).

That's what it is, indeed. The system call read() returns zero bytes (with no error) when the end of a file is reached, while some stdio functions (getchar()) have a return special value for it, unsurprisingly called EOF.

See also: Difference between EOT and EOF

ilkkachu
  • 138,973
  • IIRC, the POSIX definition of "text file" imposes a limit on the line length (LINE_MAX), which also has the unxpected consequence that shell scripts are not "text files", because a shell is supposed to accept lines of any length in scripts. –  Aug 29 '19 at 15:10
  • @mosvy, yep, I ignored the length limitation on purpose here. But you are right, that difference does seem a bit surprising. FWIW, It's spelled out in the description of sh: "The input file shall be a text file, except that line lengths shall be unlimited." – ilkkachu Aug 29 '19 at 15:21
3

This looks to be something very specific to the file format they are discussing. As a general rule files don't NEED an EOF character. Non is added without a program explicitly writing one.

Checking an ASCII table I don't see an EOF character. They might be referring to an EOT or FS character, but that's not clear. https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html

It is however common in some file formats to out a marker at the end of the file. Particularly in simple file formats which are intended for communication. This protects against files being inadvertently truncated. If you know a file must end with a specific marker, and that marker only comes at the end, the. You can easily tell if you received the whole file or just par of it. As I read it, they are referring to this type of marker.

-1

The 'end of file' character they're referring to is probably a single newline occurring as the last character in the file. Most conventional text files on UNIX and UNIX-like systems end in such a way so that you can use the cat command (or something similar` to display the file contents and be certain that the next command prompt will be on it's own line.

Some poorly behaved applications actually fail to parse files correctly if they don't see this final newline. In this respect, it's a bit like the Unicode byte-order-mark in UTF-8 encoded text, it's not needed at all (actually, it's not even supposed to be there according to most standards), but some applications refuse to interpret things as UTF-8 without it.


From the perspective of the OS itself though, no such 'character' exists. The filesystem stores the correct size for the file, and when asked o read the file, the OS returns exactly that much data in total, so there's no point in even having such a concept, let alone having a character for it.

Some people confuse the EOT control code (^D) for this concept, as it's widely used on UNIX-like systems to signal the end of a stream of interactive input, but that's just a convention derived from the original usage (to signal the end of a transmission over some communication link). Note that this is significantly different from DOS systems, where ^Z was used to actually signal the end of a file both in interactive input and in actual files. The EOT control code doesn't actually show up in the data stream the application sees, it's interpreted by the terminal which signals an end-of-file condition to the application when it encounters ^D.

  • Hmm, interpreting EOT for this purpose seems risky for a binary file! – gerrit Aug 28 '19 at 21:18
  • 2
    This is misleading. 1) the standard is NOT referring to a trailing LF. 2) LF in *nix programs is usually a line terminator not separator. This makes processing simpler, and isn't realted to pretty "cat" output. Blank files therefore don't usually need a solo LF. – Philip Couling Aug 29 '19 at 00:04
  • There are two separate EOT byte sequences and both definitely do show up in a program's data stream. ASCII 0x04 is uncommon. "^D" is a keyboard combination translated by the terminal into a byte sequence often represented the same way. Programs reading directly from the TTY or a file will receive these bytes! It's the shell that strips them before passing to a subprocess.
  • – Philip Couling Aug 29 '19 at 00:05
  • @PhilipCouling And it's that interpretation of LF usage that leads to those misbehaving applications that have issues with files that don't have a LF on the last line. The assumption that the final line of a file must end with a newline of some sort is a recurring source of issues in software in general (not just software for UNIX-like systems). Most likely, the software in question here has some components that make that (incorrect) assumption. – Austin Hemmelgarn Aug 29 '19 at 14:30
  • @gerrit Except it's the terminal or the shell doing the interpreting, not the application itself. Most ask for this, but they can specifically request just receiving data directly as-is. Without that bypass though, the application never sees the EOT character in the input stream, because it gets turned into out-of-band signaling of the end of the input. – Austin Hemmelgarn Aug 29 '19 at 14:31
  • 3
    "The 'end of file' character they're referring to is probably a single newline occurring as the last character in the file." -- doubtful, since the quoted text ends with "At the end of each line is the newline character. At the end of file, there is an end-of-file character." and that seems to imply that a) there is a newline at end of the last line (since it's each line) and b) the end-of-file is somehow different than a newline (since why else call it something else). – ilkkachu Aug 29 '19 at 14:36