23

In How do I bring HEREDOC text into a shell script variable? someone reports a problem using a here document with a quoted delimiter word inside $(...) command substitution, where a backslash \ at the end of a line inside the document triggers newline-joining line continuation, while the same here document outside command substitution works as expected.

Here is a simplified example document:

cat <<'EOT'
abc ` def
ghi \
jkl
EOT

This includes one backtick and one backslash at the end of a line. The delimiter is quoted, so no expansions occur inside the body. In all Bourne-alikes I can find this outputs the contents verbatim. If I put the same document inside a command substitution as follows:

x=$(cat <<'EOT'
abc ` def
ghi \
jkl
EOT
)
echo "$x"

then they no longer behave identically:

  • dash, ash, zsh, ksh93, BusyBox ash, mksh, and SunOS 5.10 POSIX sh all give the verbatim contents of the document, as before.
  • Bash 3.2 gives a syntax error for an unmatched backtick. With matched backticks, it attempts to run the contents as a command.
  • Bash 4.3 collapses "ghi" and "jkl" onto a single line, but has no error. The --posix option does not affect this. Kusalananda tells me (thanks!) that pdksh behaves the same way.

In the original question, I said this was a bug in Bash's parser. Is it? [Update: yes] The relevant text from POSIX (all from the Shell Command Language definition) that I can find is:

  • §2.6.3 Command Substitution:

    With the $(command) form, all characters following the open parenthesis to the matching closing parenthesis constitute the command. Any valid shell script can be used for command, except a script consisting solely of redirections which produces unspecified results.

  • §2.7.4 Here-Document:

    If any part of word is quoted, the delimiter shall be formed by performing quote removal on word, and the here-document lines shall not be expanded.

  • §2.2.1 Escape Character (Backslash):

    If a <newline> follows the <backslash>, the shell shall interpret this as line continuation. The <backslash> and <newline> shall be removed before splitting the input into tokens.

  • §2.3 Token Recognition:

    When an io_here token has been recognized by the grammar (see Shell Grammar), one or more of the subsequent lines immediately following the next NEWLINE token form the body of one or more here-documents and shall be parsed according to the rules of Here-Document.

    When it is not processing an io_here, the shell shall break its input into tokens by applying the first applicable rule below to the next character in its input. ...

    ...

    1. If the current character is <backslash>, single-quote, or double-quote and it is not quoted, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Quoting . During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the and the end of the quoted text.

My interpretation of this is that all characters after $( until the terminating ) comprise the shell script, verbatim; a here document appears, so here-document processing occurs instead of ordinary tokenisation; the here document then has a quoted delimiter, meaning that its contents is processed verbatim; and the escape character never comes into it. I can see an argument, however, that this case is simply not addressed, and both behaviours are permissible. It's possible that I've skipped over some relevant text somewhere, too.


  • Is this situation made clearer elsewhere?
  • What should a portable script be able to rely on (in theory)?
  • Is the specific treatment given by of any of these shells (Bash 3.2/Bash 4.3/everyone else) mandated by the standard? Forbidden? Permitted?
Paulo Tomé
  • 3,782
Michael Homer
  • 76,565
  • Can you show us how you produce your output in the second case? – Julie Pelletier Jan 29 '17 at 06:07
  • @JuliePelletier echo "$x", but any way of inspecting the variable works. I've edited that line into the bottom. – Michael Homer Jan 29 '17 at 06:13
  • FWIW: pdksh behaves like bash 4.4 and 4.3. – Kusalananda Jan 29 '17 at 08:17
  • Have you reported this as a bug against bash yet? I agree it's a bug in bash, and I can reproduce it both in 4.4 and latest devel version. – geirha Jan 30 '17 at 08:33
  • @geirha It's not clear (yet) that it is a bug. If we find something definitive I will file it with whichever shells turn out to need it. – Michael Homer Jan 30 '17 at 08:38
  • @MichaelHomer, well even if we don't consider it a bug, just as different, valid interpretations of a vague standard, Chet will non-the-less want the behaviours to match the other POSIX shells. mksh is yet another shell that does the sane thing btw (not interpreting the \). – geirha Jan 30 '17 at 10:59
  • Fair enough. I'll give it a couple of days first. If nothing else we could ask for a formal interpretation. – Michael Homer Jan 31 '17 at 01:01
  • 2
    Looks like it's an easy fix. This patch seem to work at least: ignore_quoted_newline_in_quoted_heredoc.patch – geirha Jan 31 '17 at 08:04
  • 1
    I think you are interpreting this correctly and imo the standard is pretty clear since "The shell shall expand the command substitution by executing command in a subshell environment [...] and replacing the command substitution [...] with the standard output of the command [...]" So it runs the command in a subshell and replaces $(...) with whatever that output is... Now, when running the command in your example in a subshell (in bash) it does output the expected result. It's only when turning it into command substitution that it collapses "ghi" and "jkl". So this is a bug imo – don_crissti Feb 03 '17 at 00:30
  • Maybe bash has it correct, and the others not. §2.3, numbered lower so higher priority (right?), says " result token shall contain exactly the characters that appear in the input (except for joining)". Is not bash doing the newline joining in what it's sending to the command substitution? Again command substitution lower number than Here-Document rules. I just love ambiguities :/ – Chindraba Feb 06 '17 at 07:43
  • I have not heard of this lower-numbering rule previously and it seems somewhat at odds with the organisational system. – Michael Homer Feb 06 '17 at 07:50
  • @MichaelHomer, mostly speculation on my part. A set of rules can be setup where first rule encountered is used, and later rules handle what hasn't be handled yet. OTOH as rules are encountered the modify the results of prior rules. Sort of like the difference between OR and AND. Just a different way of considering how the rules might be interpreted by the developers when implementing them. – Chindraba Feb 06 '17 at 22:51
  • I give it two more days, then I report it as a bash bug if you don't have by then. – geirha Feb 10 '17 at 19:47
  • 2
    @geirha I reported a Bash bug; I'm not going to bother about pdksh since it doesn't seem to have even a shadow of current maintenance. – Michael Homer Feb 11 '17 at 06:23
  • mksh is now widely considered pdksh's future (OpenBSD ksh being another one, but the MirBSD ksh author also maintains the Debian package bringing it to a much wider audience and mksh and oksh generally feed each other). Good catch for the bash bug btw. – Stéphane Chazelas Feb 24 '17 at 21:28

1 Answers1

8

This was asked on Bash's mailing list, and the maintainer confirmed it was a bug

They also mentioned that the text in POSIX "is not necessarily ambiguous, but it does require close reading.", so I asked for a clarification on that. Their answer including a description of the issue and interpretation of the standard was as follows:

The command substitution is a red herring; it's relevant only in that it pointed out where the bug was.

The delimiter to the here-document is quoted, so the lines are not expanded. In this case, the shell reads lines from the input as if they were quoted. If a backslash appears in a context where it is quoted, it does not act as an escape character (see below), and the special handling of backslash-newline does not take place. In fact, if any part of the delimiter is quoted, the here-document lines are read as if single-quoted.

The text in Posix 2.2.1 is written awkwardly, but means that the backslash is only treated specially when it's not quoted. You can quote a backslash and inhibit all all expansion only with single quotes or another backslash.

The close reading part is the "not expanded" text implying the single quotes. The standard says in 2.2 that here documents are "another form of quoting," but the only form of quoting in which words are not expanded at all is single quotes. So it's a form of quoting that is just about exactly like single quotes, but not single quotes.

ilkkachu
  • 138,973
Kevin
  • 411