10

I was trying to compute sha256 for a simple string, namely "abc". I found out that using sha256sum utility like this:

sha256sum file_with_string

gives results identical to:

sha256sum # enter, to read input from stdin
abc
^D

namely:

edeaaff3f1774ad2888673770c6d64097e391bc362d7d6fb34982ddf0efd18cb

Note, that before the end-of-input signal another newline was fed to stdin.


What bugged me at first was that when I decided to verify it with an online checksum calculator, the result was different:

ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad

I figured it might have had something to do with the second newline I fed to stdin, so I tried inserting ^D twice this time (instead of using newline) with the following result:

abcba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad

Now, this is of course poorly formatted (due to the lack of a newline character), but that aside, it matches the one above.

After that, I realized I clearly fail to understand something about input parsing in the shell. I double-checked and there's no redundant newline in the file I specified initially, so why am I experiencing this behavior?

mdx
  • 203
  • 1
    I don't understand. How is ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad different from printf abc | sha256sum? – Arkadiusz Drabczyk Oct 10 '20 at 13:25
  • It is not. However, sha256sum file where file contains the actual abc - very much is. And that's the method I used - with an actual file. – mdx Oct 10 '20 at 13:29
  • @TuRtoise: no, it isn't printf abc > file; sha256sum file – Arkadiusz Drabczyk Oct 10 '20 at 13:30
  • @Arkadiusz Drabczyk, you're right, this way all works as expected. However, when I create and fill the file by hand, I'm getting the behaviour I described. Edit: by hand I mean in a text editor, e.g. vim. – mdx Oct 10 '20 at 13:38
  • Why should text files end with a newline? Your vim behaves well and terminates the line. – Kamil Maciorowski Oct 10 '20 at 14:15
  • @TuRtoise when you edit a file in vim, vim will automatically add a terminating newline if there isn't already one. You can check with printf abc > foo; cp foo bar; vim bar, then in vim simply enter :wq without doing any other modification. Then compare the two files with diff foo bar. –  Oct 10 '20 at 14:57
  • @TuRtoise If you want to understand how ^D works (my explanation is probably not the best, and there's a lot of stupid misinformation online -- that ^D sends a "signal", etc), try the following: open two terminal windows, and in the first run tty, and take the path printed by it (eg. /dev/pts/7) and run while :; do printf NEWCAT:; cat -v; done >/dev/pts/7 in the second terminal. Then experiment with entering ^D after either a newline or some other text. –  Oct 10 '20 at 15:25
  • Like others have mentioned, that last newline is an expected part of a text file. The vi(m) on my Debian also shows "[Incomplete last line]" or "[noeol]" when opening a file that doesn't have the final newline. On the other hand, many other editors (try e.g. nano) actually allow you to move to the start of the line after the last actual line, or even remove that final newline, making it more visible by treating it as starting a new line, instead of just being a line terminator. (Calling it a terminator would seem more in line with the definition requiring one at the end of each line.) – ilkkachu Oct 10 '20 at 21:11
  • 2
    "before the end-of-input signal another newline was fed to stdin". No. Not "another". One newline, and that's it. Also as explained in the answers, ^D is not actually sent, it just ends the stream. – jcaron Oct 10 '20 at 21:54
  • 1
    @jcaron, "another" if they counted the one at the end of the sha256 command (which they also explicitly mentioned in the code block). – ilkkachu Oct 10 '20 at 21:58
  • 1
    After that, I realized I clearly fail to understand something about input parsing in the shell. Do you mean in the terminal, POSIX TTY semantics? The only things the shell is parsing are the commands sha256sum file_with_string or sha256sum. Use strace sha256sum to see the read system call it makes, and see what input you submit when you hit control-D on an empty line (creating a read()=0 meaning EOF) vs. a non-empty line (just submitting the line). (You can do this with strace cat > /dev/null as well. Similar to @user414777) Anyway, is that what this question is about? – Peter Cordes Oct 11 '20 at 13:20
  • The byte value 4 corresponds to Ctrl-D (EOT, \004), and 10 is newline (LF, line feed, \012, \n). With the typical stty settings, the terminal converts a Ctrl-D (EOT) at the beginning of line to an end-of-file indication for the process reading from that TTY device, as indicated by read(0, "abc\n", 32768) = 4 on the stderr of strace -e read sha256sum. Thus it's the same as printf 'abc\n' | strace -e read sha256sum. – pts Oct 12 '20 at 15:07

2 Answers2

30

The difference is the newline. First, let's just collect the sha256sums of abc and abc\n:

$ printf 'abc\n' | sha256sum 
edeaaff3f1774ad2888673770c6d64097e391bc362d7d6fb34982ddf0efd18cb  -
$ printf 'abc' | sha256sum 
ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad  -

So, the ba...ad sum is for the string abc, while the ed..cb one is for abc\n. Now, if your file is giving you the ed..cb output, that means your file has a newline. And, given that "text files" require a trailing newline, most editors will add one for you if you create a new file.

To get a file without a newline, use the printf approach above. Note how file will warn you if your file has no newline:

$ printf 'abc' > file
$ file file
file: ASCII text, with no line terminators

And

$ printf 'abc\n' > file2
$ file file2
file2: ASCII text

And now:

$ sha256sum file file2
ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad  file
edeaaff3f1774ad2888673770c6d64097e391bc362d7d6fb34982ddf0efd18cb  file2
terdon
  • 242,166
  • 1
    Can you cite documentation stating that text files require a trailing newline? In the last few decades I've seen many plain text files with various line endings (e.g. CR, LF, CRLF), with or without a line ending at the end. I've never seen a universal rule stating that a newline is required. Also, the file command doesn't display a warning about the missing newline, it just describes what it has detected. – pts Oct 12 '20 at 15:02
  • 6
    @pts It's a POSIX thing, platforms that have CRLF or CR are not covered. See What conditions must be met for a file to be a text file as defined by POSIX? – terdon Oct 12 '20 at 15:59
13
sha256sum # enter, to read input from stdin
abc
^D

so I tried inserting ^D twice this time (instead of using newline)

When you press ^D (VEOF) on a tty in canonical mode (the default in any command line window, xterm, etc), the terminal driver ("line discipline") immediately makes available the data already buffered to the process reading from the tty, without waiting for a newline.

When you enter abc, <newline>, then ^D, sha256sum will read the "abc\x0a" string (i.e. terminated by a LF) after the <newline>, and the empty string "" (i.e. a read of size 0) after the ^D, which sha256sum will interpret as end-of-file.

When you enter abc, then ^D twice, sha256sum will read the "abc" string after the first ^D, and then again the empty string "" after the second ^D.

So the output will have an extra newline in the former case, and the sha256sum checksum will be different.

In the case of a regular file, sha256sum will keep reading until it reaches the end-of-file, where, just in the two cases above, a read will return an empty string. The situation is similar, and sha256 is completely unaware that its input is a terminal, pipe or regular file.

  • "the output will have an extra newline in the former case," -- it should only have the one newline that the user entered before hitting ^D. One more than in the other case, yes, but one that was entered and is supposed to be there and is not extra. – ilkkachu Oct 10 '20 at 20:45
  • it seemed good to me, just that particular phrasing felt a bit off. – ilkkachu Oct 10 '20 at 21:34
  • 2
    hexdump -C is definitely your friend in such cases. It's really the only thing you can be sure will tell you byte-for-byte what the file actually contains, contrary to most editors which will automatically convert stuff one way or another. – jcaron Oct 10 '20 at 21:56
  • 2
    od -tx1 can be used if you don't have hexdump. – Jasen Oct 11 '20 at 06:26
  • od -tx1 = output as a hexadecimal 1-byte units – Adam Oct 11 '20 at 08:05
  • strace cat > /dev/null is useful for seeing those canonical TTY read semantics in action. Or of course strace sha256sum. – Peter Cordes Oct 11 '20 at 13:24
  • Apologies for the delay - a very useful post for sure. I think terdon's got the gist in a bit more structured and concise way, so I'm accepting his answer. Nevertheless, +1 – mdx Oct 18 '20 at 12:31