For the new question, this script works:
#!/bin/bash
f() { for i in $(seq "$((RANDOM % 3 ))"); do
echo;
done; return $((RANDOM % 256));
}
exact_output(){ out=$( $1; ret=$?; echo x; exit "$ret" );
unset OldLC_ALL ; [ "${LC_ALL+set}" ] && OldLC_ALL=$LC_ALL
LC_ALL=C ; out=${out%x};
unset LC_ALL ; [ "${OldLC_ALL+set}" ] && LC_ALL=$OldLC_ALL
printf 'Output:%10q\nExit :%2s\n' "${out}" "$?"
}
exact_output f
echo Done
On execution:
Output:$'\n\n\n'
Exit :25
Done
The longer description
The usual wisdom for POSIX shells to deal with the removal of \n
is:
add an x
s=$(printf "%s" "${1}x"); s=${s%?}
That is required because the last new line(S) are removed by the command expansion per POSIX specification:
removing sequences of one or more characters at the end of the substitution.
About a trailing x
.
It has been said in this question that an x
could be confused with the trailing byte of some character in some encoding. But how are we going to guess what or which character is better in some language in some possible encoding, that is a difficult proposition, to say the least.
However; That is simply incorrect.
The only rule that we need to follow is to add exactly what we remove.
It should be easy to understand that if we add something to an existing string (or byte sequence) and later we remove exactly the same something, the original string (or byte sequence) must be the same.
Where do we go wrong? When we mix characters and bytes.
If we add a byte, we must remove a byte, if we add a character we must remove the exact same character.
The second option, adding a character (and later removing the exact same character) may become convoluted and complex, and, yes, code pages and encodings may get in the way.
However, the first option is quite possible, and, after explaining it, it will become plain simple.
Lets add a byte, an ASCII byte (<127), and to keep things as less convoluted as possible, let's say an ASCII character in the range of a-z. Or as we should be saying it, a byte in the hex range 0x61
- 0x7a
. Lets choose any of those, maybe an x (really a byte of value 0x78
). We can add such byte with by concatenating an x to an string (lets assume an é
):
$ a=é
$ b=${a}x
If we look at the string as a sequence of bytes, we see:
$ printf '%s' "$b" | od -vAn -tx1c
c3 a9 78
303 251 x
An string sequence that ends in an x.
If we remove that x (byte value 0x78
), we get:
$ printf '%s' "${b%x}" | od -vAn -tx1c
c3 a9
303 251
It works without a problem.
A little more difficult example.
Lets say that the string we are interested in ends in byte 0xc3
:
$ a=$'\x61\x20\x74\x65\x73\x74\x20\x73\x74\x72\x69\x6e\x67\x20\xc3'
And lets add a byte of value 0xa9
$ b=$a$'\xa9'
The string has become this now:
$ echo "$b"
a test string é
Exactly what I wanted, the last two bytes are one character in utf8 (so anyone could reproduce this results in their utf8 console).
If we remove a character, the original string will be changed. But that is not what we added, we added a byte value, which happens to be written as an x, but a byte anyway.
What we need to avoid misinterpreting bytes as characters. What we need is an action that removes the byte we used 0xa9
. In fact, ash, bash, lksh and mksh all seem to do exactly that:
$ c=$'\xa9'
$ echo ${b%$c} | od -vAn -tx1c
61 20 74 65 73 74 20 73 74 72 69 6e 67 20 c3 0a
a t e s t s t r i n g 303 \n
But not ksh or zsh.
However, that is very easy to solve, lets tell all those shells to do byte removal:
$ LC_ALL=C; echo ${b%$c} | od -vAn -tx1c
that's it, all shells tested work (except yash) (for the last part of the string):
ash : s t r i n g 303 \n
dash : s t r i n g 303 \n
zsh/sh : s t r i n g 303 \n
b203sh : s t r i n g 303 \n
b204sh : s t r i n g 303 \n
b205sh : s t r i n g 303 \n
b30sh : s t r i n g 303 \n
b32sh : s t r i n g 303 \n
b41sh : s t r i n g 303 \n
b42sh : s t r i n g 303 \n
b43sh : s t r i n g 303 \n
b44sh : s t r i n g 303 \n
lksh : s t r i n g 303 \n
mksh : s t r i n g 303 \n
ksh93 : s t r i n g 303 \n
attsh : s t r i n g 303 \n
zsh/ksh : s t r i n g 303 \n
zsh : s t r i n g 303 \n
Just that simple, tell the shell to remove a LC_ALL=C character,which is exactly one byte for all byte values from 0x00
to 0xff
.
Beware that some shells don't support changing the locale during runtime (despite this is required by POSIX).
Solution that should generally work without changing the locale
While the above should work with any (except newline or null) byte as sentinel value, it can be made easier, without changing the locale:
Using .
or /
should be generally fine, as POSIX requires:
- “The encoded values associated with
<period>
, <slash>
, <newline>
, and <carriage-return>
shall be invariant across all locales supported by the implementation.”, which means that these will have the same binary represenation in any locale/encoding.
- “Likewise, the byte values used to encode
<period>
, <slash>
, <newline>
, and <carriage-return>
shall not occur as part of any other character in any locale.”, which means that the above cannot happen, as no partial byte sequence could be completed by these bytes/characters to a valid character in any locale/encoding.
(see 6.1 Portable Character Set)
The above does not apply to other characters of the Portable Character Set.
Solution for comments:
For the example discussed in the comments, one possible solution (which fails in zsh) is:
#!/bin/bash
LC_ALL=zh_HK.big5hkscs
a=$(printf '\210\170');
b=$(printf '\170');
unset OldLC_ALL ; [ "${LC_ALL+set}" ] && OldLC_ALL=$LC_ALL
LC_ALL=C ; a=${a%"$b"};
unset LC_ALL ; [ "${OldLC_ALL+set}" ] && LC_ALL=$OldLC_ALL
printf '%s' "$a" | od -vAn -c
That will remove the problem of encoding.
$IFS
, so it will not be captured as an argument. – Deathgrip Aug 01 '17 at 16:02IFS
(try( IFS=:; subst=$(printf 'x\n\n\n'); printf '%s' "$subst" )
. Only newlines get stripped.\t
anddo not, and
IFS
doesn't affect it. – Petr Skocik Aug 01 '17 at 17:02tcsh
– Stéphane Chazelas Aug 02 '17 at 11:11