Convert to uppercase, except for escaped characters

Question

The methods I found break things further down the line by also affecting linebreaks.
For example...

$ message="First Line\nSecond Line"; 
$ echo "${message^^}"
FIRST LINE\NSECOND LINE

Is there an elegant way to convert a string to uppercase, but leaving escaped characters alone, to get the following output instead?

FIRST LINE\nSECOND LINE

I could just do something convoluted like changing "\n" to 0001 or something along those lines, apply the conversion and then return 0001 to "\n". But maybe there is a better way.

Is this for later inclusion as part of some other data, possibly in XML or JSON format? If so, a parser of that format may possibly have routines for turning strings into uppercase in the way you describe, as, for example, ascii_upcase in tho JSON parser jq, or the XPath function upper-case() for XML. — Kusalananda, Jul 24 '22 at 11:50
@Kusalananda For me this is only about text processing, but someone else stumbling across this question might have such a use case. — Ocean, Jul 25 '22 at 09:52

Stéphane Chazelas · Answer 1 · 2022-04-26T07:48:31.807

With zsh instead of bash:

$ message="First Line\nSecond Line"
$ set -o extendedglob
$ print -r -- ${message//(#b)((\\?)|(?))/$match[2]$match[3]:u}
FIRST LINE\nSECOND LINE

In bash (or any shell) and with the GNU implementation of sed, you can do the same with:

$ printf '%s\n' "$message" | sed -E 's/(\\.)|(.)/\1\u\2/g'
FIRST LINE\nSECOND LINE

Some potentially more efficient variants as they minimise the number of substitutions:

zsh

print -r -- ${message//(#b)((\\?)|([^\\]##))/$match[2]$match[3]:u}

or

print -r -- ${message//(#b)((\\?)#)([^\\]##)/$match[1]$match[3]:u}

their GNU sed translations:

printf '%s\n' "$message" | sed -E 's/(\\.)|([^\\]+)/\1\U\2/g'

or

printf '%s\n' "$message" | sed -E 's/((\\.)*)([^\\]+)/\1\U\3/g'

Beware they convert \Mx (Meta-x, an escape sequence supported by zsh's print for instance and that expands to the 0xf8 byte ('x' + 0x80)) to \MX (0xd8). They also convert \x7a to \x7A or \u007a to \u007A or \Cx to \CX but that shouldn't be a problem as those expand to the same.

score 3 · Answer 2 · edited Apr 25 '22 at 19:05

3

I'd be tempted to interpret the escape sequences into literal characters:

message="First Line\nSecond Line"
declare -u Message                       # uppercase on assignment
printf -v Message -- "${message//%/%%}"  # assign
declare -p Message                       # inspect

result

declare -u msg="FIRST LINE
SECOND LINE"

edited Apr 25 '22 at 19:05

Stéphane Chazelas

544,893

answered Apr 25 '22 at 19:03

glenn jackman

85,964

3

Beware that with message='\141' for instance, you'd get declare -u Message="A" instead of declare -u Message="a" – Stéphane Chazelas Apr 25 '22 at 19:07
Note that any \ will ve doubled \\. – Apr 25 '22 at 22:45
1

Not giving printf a format causes the change of % that you want to avoid by duplicating every %. However, a printf -v Message '%b' -- "${message}" will interpret back-slashed characters exactly as echo -e without changing the %s. – Apr 25 '22 at 22:57
Please read: https://unix.stackexchange.com/q/700508/232326 – Apr 27 '22 at 19:26

G-Man Says 'Reinstate Monica' · Accepted Answer · 2022-05-09T23:15:35.573

echo "$message"  |  sed -e 's/^[[:lower:]]/\u&/' -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g' \
                                                 -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'

-e 's/^[[:lower:]]/\u&/' If the first character in the string (or, more generally, the first character on a line) is a lower-case letter, capitalize it. Because the first character on a line can’t be escaped. Duh. That’s a no-brainer.
-e 's/$[^\]$$[[:lower:]]$/\1\u\2/g' Look at the line two characters at a time. If a lower-case letter is preceded by something other than a backslash, leave the preceding character alone, and capitalize the lower-case letter.

You might think that this would be enough to process the entire line. Unfortunately, since it processes the line two characters at a time, it gets only every other letter:
```
$ echo "first line\nsecond line" | sed -e 's/$[^\]$$[[:lower:]]$/\1\u\2/g'
fIrSt LiNe\nSeCoNd LiNe
```
so,
-e 's/$[^\]$$[[:lower:]]$/\1\u\2/g' Do the exact same thing a second time. This will pick up the letters that were skipped on the first pass.

Alternative version:

echo "$message" | sed -e 's/^[[:lower:]]/\u&/' \
                                  -e ': loop; s/\([^\]\)\([[:lower:]]\)/\1\u\2/g; t loop'

Basically the same as the first version, but, instead of repeating the second s command, it iterates it with a loop.

Unfortunately, this will not work correctly for double backslashes:  foo\\bar will become FOO\\bAR, even though the b should be capitalized, since the \\ is an escaped backslash, and so should not cause the b to be escaped.

No, the first character could be escaped, like when you want to insert a tab at the beginning, which would be "\t". — Ocean, May 09 '22 at 15:19
One of us is not understanding the other. If the line begins with \t, then the first character is \. t is the second* character.* If I’m misunderstanding you, please explain more clearly. — G-Man Says 'Reinstate Monica', May 09 '22 at 22:36
Semantics. If a line begins with "\t", then the first character is an escaped "t". But one can also say that "" is the first character. Depends on how you look at it, I guess. It could also be an escaped "" by having "\t", so one gets "\t" instead of the tab character. Since these constructs are supposed to represent a single character (\t is tab), I treat them as single entities, which was the origin of the misunderstanding. — Ocean, May 10 '22 at 12:08

score 1 · Answer 4 · answered Jul 24 '22 at 11:42

1

I'd consider evaluating the \n and other escape sequences at the point that the variable was defined. Here $message actually contains a newline.

message=$(printf '%b' 'First Line\nSecond Line')
echo "${message^^}"

Output

FIRST LINE
SECOND LINE

answered Jul 24 '22 at 11:42

Chris Davies

116,213
16
160
287

Kadir · Answer 5 · 2022-04-26T08:31:51.590

0

The variable can be iterated line by line. Then concatenate the output again.

bash:

$ message="First Line\nSecond Line";
$ message=$(echo -e ${message} |while read -r line; do echo -n "${line^^}\n" ; done) && message=${message%??}
$ echo ${message} 
FIRST LINE\nSECOND LINE

edited Apr 26 '22 at 08:31

answered Apr 26 '22 at 07:09

Kadir

264
1
6

See Understanding "IFS= read -r line", When is double-quoting necessary? and Why is printf better than echo? – Stéphane Chazelas Apr 26 '22 at 07:34
1

That will likely leave linefeeds alone, but the OP asked for all escaped characters to be left alone. – Henrik supports the community Apr 26 '22 at 08:04
1

Backslash processing should be removed from the while read loop for sure. Just edited the answer. – Kadir Apr 26 '22 at 08:35
(1) For starters, ${message} should be "$message". See ${variable_name} doesn’t mean what you think it does …. (2) You should explain your answer better — in particular (IMO) the %?? part. (You don’t need to explain it to me; I figured it out.) … … … … … … … … … … … … … … … Please do not respond in comments; [edit] your answer to make it clearer and more complete. … (Cont’d) – G-Man Says 'Reinstate Monica' May 07 '22 at 19:04
(Cont’d) … (3) This is a classic example of providing a solution for the example while ignoring the larger question. foo\012bar will turn into FOO\nBAR, \g\h\i\j\k\l\m\n\o\p\q will turn into \G\H\I\J\K\L\M\n\O\P\Q, and any of \a, \b, \c, \e, \f, \r, \t, \v, and \\ will cause problems. Also, leading and trailing spaces, and multiple spaces. (4) Strictly speaking, the question didn’t say that you should clobber the original variable. If you need a multi-step process, you should assign the intermediate value to a temp variable. – G-Man Says 'Reinstate Monica' May 07 '22 at 19:04

jubilatious1 · Answer 6 · 2022-07-24T11:33:13.433

Using Raku (formerly known as Perl_6)

~$ echo 'a\nb'
a\nb
~$ echo 'a\nb' | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB
~$ echo "a\\nb"
a\nb
~$ echo "a\\nb" | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB

Above uses a negative look-behind assertion, <!after "\\">, to select out all characters except those immediately after a \ backslash. Selected characters are then uppercased with Raku's .uc routine.

Certainly it's safer to provide the regex with a custom <-[ … ]> negative character class, sparing backslashed characters like \n and \t from being uppercased. (FYI, custom positive character classes are written <+[ … ]> or more simply <[ … ]> in Raku).

Below, using Raku's "Q-lang" (quoting language) to feed the substitution operator a string. In all four examples below \n is returned (not uppercase \N). Note in the third example how \n is operationally-interpreted as a newline character, and this remains unchanged in the fourth example, telling us that \n still exists in that string (i.e. it has NOT been uppercased to \N):

~$ raku -e 'put Q<a\nb>'
a\nb
~$ raku -e 'put Q<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A\nB
~$ raku -e 'put Q:b<a\nb>'
a
b
~$ raku -e 'put Q:b<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A
B

NOTE, see: "Place an escape sign before every non-alphanumeric characters" for Raku answers to a related question on StackOverflow.

References:
https://docs.raku.org/language/quoting
https://docs.raku.org/language/regexes#Literals_and_metacharacters
https://raku.org

Convert to uppercase, except for escaped characters

6 Answers6