shell: keep trailing newlines ('\n') in command substitution

Question

I want to be able to capture the exact output of a command substitution, including the trailing new line characters.

I realise that they are stripped by default, so some manipulation may be required to keep them, and I want to keep the original exit code.

For example, given a command with a variable number of trailing newlines and exit code:

f(){ for i in $(seq "$((RANDOM % 3))"); do echo; done; return $((RANDOM % 256));}
export -f f

I want to run something like:

exact_output f

And have the output be:

Output: $'\n\n'
Exit: 5

I'm interested in both bash and POSIX sh.

Newline is part of $IFS, so it will not be captured as an argument. — Deathgrip, Aug 01 '17 at 16:02
@Deathgrip It has nothing to do with IFS (try ( IFS=:; subst=$(printf 'x\n\n\n'); printf '%s' "$subst" ). Only newlines get stripped. \t and do not, and IFS doesn't affect it. — Petr Skocik, Aug 01 '17 at 17:02
Related: How can I work with binary in bash, to copy bytes verbatim without any conversion? — Stéphane Chazelas, Aug 02 '17 at 11:10
See also: tcsh preserve newlines in command substitution `...` for tcsh — Stéphane Chazelas, Aug 02 '17 at 11:11
Similar: How can I set an environment variable which contains newline characters? — Stéphane Chazelas, Aug 02 '17 at 11:13

Stéphane Chazelas · Accepted Answer · 2023-06-20T07:25:33.260

POSIX shells

The usual (^{_{1
2
3
4
5
6
7
8
9
10
11
12
13
14
15}}) trick to get the complete stdout of a command is to do:

output=$(cmd; ret=$?; echo .; exit "$ret")
ret=$?
output=${output%.}

The idea is to add an extra .\n. Command substitution will only strip that \n. And you strip the . with ${output%.}.

Note that in shells other than zsh, that will still not work if the output has NUL bytes. With yash, that won't work if the output is not text.

Also note that in some locales, it matters what character you use to insert at the end. . should generally be fine (see below), but some other might not. For instance x (as used in some other answers) or @ would not work in a locale using the BIG5, GB18030 or BIG5HKSCS charsets. In those charsets, the encoding of a number of characters ends in the same byte as the encoding of x or @ (0x78, 0x40)

For instance, ū in BIG5HKSCS is 0x88 0x78 (and x is 0x78 like in ASCII, all charsets on a system must have the same encoding for all the characters of the portable character set which includes English letters, @ and .). So if cmd was printf '\x88' (which by itself is not a valid character in that encoding, but just a byte-sequence) and we inserted x after it, ${output%x} would fail to strip that x as $output would actually contain ū (the two bytes making up a byte sequence that is a valid character in that encoding).

Using . or / should be generally fine, as POSIX requires:

“The encoded values associated with <period>, <slash>, <newline>, and <carriage-return> shall be invariant across all locales supported by the implementation.”, which means that these will have the same binary represenation in any locale/encoding.
“Likewise, the byte values used to encode <period>, <slash>, <newline>, and <carriage-return> shall not occur as part of any other character in any locale.”, which means that the above cannot happen, as no partial byte sequence could be completed by these bytes/characters to a valid character in any locale/encoding. (see 6.1 Portable Character Set)

The above does not apply to other characters of the Portable Character Set.

Another approach, as discussed by @Isaac, would be to change the locale to C (which would also guarantee that any single byte can be correctly stripped), only for the stripping of the last character (${output%.}). It would be typically necessary to use LC_ALL for that (in principle LC_CTYPE would be enough, but that could be accidentally overridden by any already set LC_ALL). Also it would be necessary to restore the original value (or e.g. the non-POSIX compliant locale be used in a function). But beware, that some shells don't support changing the locale while running (though this is required by POSIX).

By using . or /, all that can be avoided.

bash/zsh alternatives

With bash and zsh, assuming the output has no NULs, you can also do:

IFS= read -rd '' output < <(cmd)

To get the exit status of cmd, you can do wait "$!"; ret=$? in some versions of bash but not in zsh though in zsh, you can write it cmd | IFS= read -rd '' output and get the exit status in $pipestatus[1].

rc/es/akanaga

For completeness, note that rc/es/akanga have an operator for that. In them, command substitution, expressed as `cmd (or `{cmd} for more complex commands) returns a list (by splitting on $ifs, space-tab-newline by default). In those shells (as opposed to Bourne-like shells), the stripping of newline is only done as part of that $ifs splitting. So you can either empty $ifs or use the ``(seps){cmd} form where you specify the separators:

ifs = ''; output = `cmd

or:

output = ``()cmd

In any case, the exit status of the command is lost. You'd need to embed it in the output and extract it afterwards which would become ugly.

fish

In fish, command substitution is with (cmd) and doesn't involve a subshell.

set var (cmd)

Creates a $var array with all the lines in the output of cmd if $IFS is non-empty, or with the output of cmd stripped of up to one (as opposed to all in most other shells) newline character if $IFS is empty.

So there's still an issue in that (printf 'a\nb') and (printf 'a\nb\n') expand to the same thing even with an empty $IFS.

To work around that, the best I could come up with was:

function exact_output
  set -l IFS . # non-empty IFS
  set -l ret
  set -l lines (
    cmd
    set ret $status
    echo
  )
  set -g output ''
  set -l line
  test (count $lines) -le 1; or for line in $lines[1..-2]
    set output $output$line\n
  end
  set output $output$lines[-1]
  return $ret
end

Since version 3.4.0 (released in March 2022), you can do instead:

set output (cmd | string collect --allow-empty --no-trim-newlines)

With older versions, you could do:

read -z output < (begin; cmd; set ret $status; end | psub)

With the caveat that $output is an empty list instead of a list with one empty element if there's no output.

Version 3.4.0 also added support for $(...) which behaves like (...) except that it can also be used inside double quotes in which case it behaves like in the POSIX shell: the output is not split on lines but all trailing newline characters are removed.

Bourne shell

The Bourne shell did not support the $(...) form nor the ${var%pattern} operator, so it can be quite hard to achieve there. One approach is to use eval and quoting:

eval "
  output='`
    exec 4>&1
    ret=\`
      exec 3>&1 >&4 4>&-
      (cmd 3>&-; echo \"\$?\" >&3; printf \"'\") |
        awk 3>&- -v RS=\\\\' -v ORS= -v b='\\\\\\\\' '
          NR > 1 {print RS b RS RS}; {print}; END {print RS}'
    \`
    echo \";ret=\$ret\"
  `"

Here, we're generating a

output='output of cmd
with the single quotes escaped as '\''
';ret=X

to be passed to eval. As for the POSIX approach, if ' was one of those characters whose encoding can be found at the end of other characters, we'd have a problem (a much worse one as it would become a command injection vulnerability), but thankfully, like ., it's not one of those, and that quoting technique is generally the one that is used by anything that quotes shell code (note that \ has the issue, so shouldn't be used (also excludes "..." inside which you need to use backslashes for some characters). Here, we're only using it after a ' which is OK).

tcsh

See tcsh preserve newlines in command substitution `...`

(not taking care of the exit status, which you could address by saving it in a temporary file (echo $status > $tempfile:q after the command))

Thanks - and especially for the clue on the different charsets.
If zsh can store NUL in a variable, why wouldn't IFS= read -rd '' output < <(cmd) work? It needs to be able to store a string's length... does it encode '' as a 1-byte string of \0 rather than a 0-byte string? — Tom Hale, Aug 02 '17 at 11:19
@TomHale, yes, read -d '' is treated as read -d $'\0' (in bash as well though there $'\0' is the same as '' everywhere). — Stéphane Chazelas, Aug 02 '17 at 11:26
Where can "rc/es/akanga" be found? Google didn't turn up much when I searched for "akanga". Are they the same shell? — Melab, Aug 28 '20 at 03:18
@Melab, rc is the shell of research Unix v10 and plan9. These days, you can find it in ports of plan9 utilities (now opensourced) such as 9base, though the most popular implementation on Unix-like systems is the clone by Byron Rakitzis and from which es and akanga are derived. See the wikipedia entry. — Stéphane Chazelas, Aug 29 '20 at 05:47
Instead of using a specific character as the guard, wouldn't it make more sense to use a specific byte sequence? For instance, output="$(cmd; ret="${?}"; printf '\056'; exit "${ret}")"; output="${output%"$(printf '\056')"}" should be safe in any locale and not require locale changes in the middle of the script, which doesn't seem to be portable. — Ionic, Apr 06 '21 at 08:29
@Ionic, that wouldn't help at all. printf '\56' is the same as printf . on ASCII based systems regardless of the locale and if that byte, combined with the last byte(s) of the output was forming a valid character in the locale's charset, ${output%"$(printf '\56')"} would (in some shells) fail to remove it the same way ${output%.} would. On the other hand, on some hypothetical non-ASCII based systems, \56 might be the encoding of newline for instance which would defeat the purpose. — Stéphane Chazelas, Apr 06 '21 at 09:34
It would help in the sense that specific characters couldn't suddenly be interpreted as multi-byte sequences in other character sets. Also, we don't care for the interpretation of the byte value - like you explained correctly. The only thing we care about is adding and removing the same byte, no matter what it is (minus NUL, which is often problematic). It will not help regarding the issue of the shell interpreting single bytes as part of a multi-byte character sequence, true, but I might have something in mind which might. Need to test around first, before I spill it, though. — Ionic, Apr 07 '21 at 13:35
Turns out that my idea is unusable. Originally, I was thinking about reading the variable line-by-line and using a different tool to delete the added character, exploiting the fact that LC_ALL='C' program ... to run program in "byte mode" will be supported regardless of the shell. However, pipes spawn subshells, so you won't be able to get the modified data up to the parent process, where it's needed. This could be worked around by command substitution, but then we'd be back at square one... — Ionic, Apr 13 '21 at 06:00
Use echo . and do not go by the (ASCII) encoding of the period. POSIX requires the period (and slash and maybe other characters) to be encoded identically across all supported locales and them to not be part of any multibyte encoding. (This is for . and / in pathnames.) — mirabilos, Jan 25 '22 at 06:33

score 5 · Answer 2 · edited Jan 26 '22 at 22:14

For the new question, this script works:

#!/bin/bash
f()           { for i in $(seq "$((RANDOM % 3 ))"); do
                    echo;
                done; return $((RANDOM % 256));
              }
exact_output(){ out=$( $1; ret=$?; echo x; exit "$ret" );
                unset OldLC_ALL ; [ "${LC_ALL+set}" ] && OldLC_ALL=$LC_ALL
                LC_ALL=C ; out=${out%x};
                unset LC_ALL ; [ "${OldLC_ALL+set}" ] && LC_ALL=$OldLC_ALL
                 printf 'Output:%10q\nExit :%2s\n' "${out}" "$?"
               }
exact_output f
echo Done

On execution:

Output:$'\n\n\n'
Exit :25
Done

The longer description

The usual wisdom for POSIX shells to deal with the removal of \n is:

add an x

s=$(printf "%s" "${1}x"); s=${s%?}

That is required because the last new line(S) are removed by the command expansion per POSIX specification:

removing sequences of one or more characters at the end of the substitution.

About a trailing `x`.

It has been said in this question that an x could be confused with the trailing byte of some character in some encoding. But how are we going to guess what or which character is better in some language in some possible encoding, that is a difficult proposition, to say the least.

However; That is simply incorrect.

The only rule that we need to follow is to add exactly what we remove.

It should be easy to understand that if we add something to an existing string (or byte sequence) and later we remove exactly the same something, the original string (or byte sequence) must be the same.

Where do we go wrong? When we mix characters and bytes.

If we add a byte, we must remove a byte, if we add a character we must remove the exact same character.

The second option, adding a character (and later removing the exact same character) may become convoluted and complex, and, yes, code pages and encodings may get in the way.

However, the first option is quite possible, and, after explaining it, it will become plain simple.

Lets add a byte, an ASCII byte (<127), and to keep things as less convoluted as possible, let's say an ASCII character in the range of a-z. Or as we should be saying it, a byte in the hex range 0x61 - 0x7a. Lets choose any of those, maybe an x (really a byte of value 0x78). We can add such byte with by concatenating an x to an string (lets assume an é):

$ a=é
$ b=${a}x

If we look at the string as a sequence of bytes, we see:

$ printf '%s' "$b" | od -vAn -tx1c
  c3  a9  78
 303 251   x

An string sequence that ends in an x.

If we remove that x (byte value 0x78), we get:

$ printf '%s' "${b%x}" | od -vAn -tx1c
  c3  a9
 303 251

It works without a problem.

A little more difficult example.

Lets say that the string we are interested in ends in byte 0xc3:

$ a=$'\x61\x20\x74\x65\x73\x74\x20\x73\x74\x72\x69\x6e\x67\x20\xc3'

And lets add a byte of value 0xa9

$ b=$a$'\xa9'

The string has become this now:

$ echo "$b"
a test string é

Exactly what I wanted, the last two bytes are one character in utf8 (so anyone could reproduce this results in their utf8 console).

If we remove a character, the original string will be changed. But that is not what we added, we added a byte value, which happens to be written as an x, but a byte anyway.

What we need to avoid misinterpreting bytes as characters. What we need is an action that removes the byte we used 0xa9. In fact, ash, bash, lksh and mksh all seem to do exactly that:

$ c=$'\xa9'
$ echo ${b%$c} | od -vAn -tx1c
 61  20  74  65  73  74  20  73  74  72  69  6e  67  20  c3  0a
  a       t   e   s   t       s   t   r   i   n   g     303  \n

But not ksh or zsh.

However, that is very easy to solve, lets tell all those shells to do byte removal:

$ LC_ALL=C; echo ${b%$c} | od -vAn -tx1c

that's it, all shells tested work (except yash) (for the last part of the string):

ash             :    s   t   r   i   n   g     303  \n
dash            :    s   t   r   i   n   g     303  \n
zsh/sh          :    s   t   r   i   n   g     303  \n
b203sh          :    s   t   r   i   n   g     303  \n
b204sh          :    s   t   r   i   n   g     303  \n
b205sh          :    s   t   r   i   n   g     303  \n
b30sh           :    s   t   r   i   n   g     303  \n
b32sh           :    s   t   r   i   n   g     303  \n
b41sh           :    s   t   r   i   n   g     303  \n
b42sh           :    s   t   r   i   n   g     303  \n
b43sh           :    s   t   r   i   n   g     303  \n
b44sh           :    s   t   r   i   n   g     303  \n
lksh            :    s   t   r   i   n   g     303  \n
mksh            :    s   t   r   i   n   g     303  \n
ksh93           :    s   t   r   i   n   g     303  \n
attsh           :    s   t   r   i   n   g     303  \n
zsh/ksh         :    s   t   r   i   n   g     303  \n
zsh             :    s   t   r   i   n   g     303  \n

Just that simple, tell the shell to remove a LC_ALL=C character,which is exactly one byte for all byte values from 0x00 to 0xff.

Beware that some shells don't support changing the locale during runtime (despite this is required by POSIX).

Solution that should generally work without changing the locale

While the above should work with any (except newline or null) byte as sentinel value, it can be made easier, without changing the locale:

Using . or / should be generally fine, as POSIX requires:

“The encoded values associated with <period>, <slash>, <newline>, and <carriage-return> shall be invariant across all locales supported by the implementation.”, which means that these will have the same binary represenation in any locale/encoding.
“Likewise, the byte values used to encode <period>, <slash>, <newline>, and <carriage-return> shall not occur as part of any other character in any locale.”, which means that the above cannot happen, as no partial byte sequence could be completed by these bytes/characters to a valid character in any locale/encoding. (see 6.1 Portable Character Set)

The above does not apply to other characters of the Portable Character Set.

Solution for comments:

For the example discussed in the comments, one possible solution (which fails in zsh) is:

#!/bin/bash
LC_ALL=zh_HK.big5hkscs
a=$(printf '\210\170');
b=$(printf '\170');
unset OldLC_ALL ; [ "${LC_ALL+set}" ] && OldLC_ALL=$LC_ALL
LC_ALL=C ; a=${a%"$b"};
unset LC_ALL ; [ "${OldLC_ALL+set}" ] && LC_ALL=$OldLC_ALL
printf '%s' "$a" | od -vAn -c

That will remove the problem of encoding.

zsh added printf -v for compatibility with bash in December 2015 — Stéphane Chazelas, Aug 02 '17 at 10:46
I agree that fixing the locale to C to make sure ${var%?} always strips one byte is more correct in theory, but: 1- LC_ALL and LC_CTYPE override $LANG, so you'd need to set LC_ALL=C 2- you can't do the var=${var%?} in a subshell as the change would be lost, so you'd need to save and restore the value and state of LC_ALL (or resort to non-POSIX local scope features) 3- changing the locale midway through the script is not fully supported in some shells like yash. On the other end, in practice . is never a problem in real-life charsets, so using it avoids mingling with LC_ALL. — Stéphane Chazelas, Aug 02 '17 at 14:48
Note that the only multibyte characters that mksh supports are UTF-8 encoded ones and only when the utf8-mode option is set. You'll notice that a='é' b=$'\xa9' mksh -o utf8-mode -c 'echo "${a%"$b"}"' doesn't break that é character apart like in ksh93 or zsh and without UTF-8 mode a='éé' mksh -c 'echo "${a%?}"' does break it. — Stéphane Chazelas, Sep 09 '20 at 16:15
In which way does your last script "fail in zsh"? It produces the same result for me. It fails with yash though which doesn't support changing locales midway through the script, nor non-text in its variables or command arguments. — Stéphane Chazelas, Sep 09 '20 at 16:23
Just add a period or slash then, POSIX requires they are encoded the same across all locales. — mirabilos, Jan 25 '22 at 06:14

Petr Skocik · Answer 3 · 2017-08-02T10:33:11.880

2

You can output a character after the normal output and then strip it:

#capture the output of "$@" (arguments run as a command)
#into the exact_output` variable
exact_output() 
{
    exact_output=$( "$@" && printf X ) && 
    exact_output=${exact_output%X}
}

This is a POSIX compliant solution.

edited Aug 02 '17 at 10:33

answered Aug 01 '17 at 16:43

Petr Skocik

28,816

Based on the responses, I see my question was unclear. I just updated it. – Tom Hale Aug 02 '17 at 10:17

score 0 · Answer 4 · edited Jan 25 '22 at 02:26

Here's a bash function that encapsulates the LC_ALL=C technique described by @Isaac.

# This function provides a general solution to the problem of preserving
# trailing newlines in a command substitution.
#
#    cmdsub <command goes here>
#
# If the command succeeded, the result will be found in variable CMDSUB_RESULT.
cmdsub() {
  local -r BYTE=$'\x78'
  local result
  if result=$("$@"; ret=$?; echo "$BYTE"; exit "$ret"); then
    local LC_ALL=C
    CMDSUB_RESULT=${result%"$BYTE"}
  else
    return "$?"
  fi
}

Notes:

$'\x78' was chosen for the dummy byte in order to test the corner case discussed in this Q&A discussion, but any byte could have been used except newline (0x0A) and NUL (0x00).
Encapsulating it within a function had the added benefit that we could make LC_ALL a local variable, thus avoiding the need to save and restore its value.
I considered using bash 4.3's nameref feature to allow the caller to supply the name of the variable into which the result should be stored, but decided it would be better to support older bash.
In principle setting, LC_CTYPE should be enough, however if “externally” LC_ALL were already set, that would override the former.

Successfully tested the BIG5HKSCS corner case using bash 4.1:

#!/bin/bash
LC_ALL=zh_HK.big5hkscs
cmdsub() {
  local -r BYTE=$'\x78'
  local result
  if result=$("$@"; ret=$?; echo "$BYTE"; exit "$ret"); then
    local LC_ALL=C
    CMDSUB_RESULT=${result%"$BYTE"}
  else
    return "$?"
  fi
}
cmd() { echo -n $'\x88'; }
if cmdsub cmd; then
  v=$CMDSUB_RESULT
  printf '%s' "$v" | od -An -tx1
else
  printf "The command substitution had a non-zero status code of %s\n" "$?"
fi

Result was 88 as expected.

I received a suggested edit: "use the exit status that was set in the $(…) ... and in order not to change the function layout to much, just store it in a variable instead of returning it." Thank you, but, I like the idea of returning the exit status of the command substitution so that I can use the function in an if statement, as shown the in example at the bottom. — Robin A. Meade, Jan 25 '22 at 19:27

shell: keep trailing newlines ('\n') in command substitution

4 Answers4

POSIX shells

bash/zsh alternatives

rc/es/akanaga

fish

Bourne shell

tcsh

About a trailing `x`.

A little more difficult example.

Solution that should generally work without changing the locale

Solution for comments:

Linked

Related

shell: keep trailing newlines ('\n') in command substitution

4 Answers4

POSIX shells

bash/zsh alternatives

rc/es/akanaga

fish

Bourne shell

tcsh

About a trailing x.

A little more difficult example.

Solution that should generally work without changing the locale

Solution for comments:

Linked

Related

About a trailing `x`.