6

The methods I found break things further down the line by also affecting linebreaks.
For example...

$ message="First Line\nSecond Line"; 
$ echo "${message^^}"
FIRST LINE\NSECOND LINE

Is there an elegant way to convert a string to uppercase, but leaving escaped characters alone, to get the following output instead?

FIRST LINE\nSECOND LINE

I could just do something convoluted like changing "\n" to 0001 or something along those lines, apply the conversion and then return 0001 to "\n". But maybe there is a better way.

Ocean
  • 272
  • Is this for later inclusion as part of some other data, possibly in XML or JSON format? If so, a parser of that format may possibly have routines for turning strings into uppercase in the way you describe, as, for example, ascii_upcase in tho JSON parser jq, or the XPath function upper-case() for XML. – Kusalananda Jul 24 '22 at 11:50
  • @Kusalananda For me this is only about text processing, but someone else stumbling across this question might have such a use case. – Ocean Jul 25 '22 at 09:52

6 Answers6

6

With zsh instead of bash:

$ message="First Line\nSecond Line"
$ set -o extendedglob
$ print -r -- ${message//(#b)((\\?)|(?))/$match[2]$match[3]:u}
FIRST LINE\nSECOND LINE

In bash (or any shell) and with the GNU implementation of sed, you can do the same with:

$ printf '%s\n' "$message" | sed -E 's/(\\.)|(.)/\1\u\2/g'
FIRST LINE\nSECOND LINE

Some potentially more efficient variants as they minimise the number of substitutions:

  • zsh

    print -r -- ${message//(#b)((\\?)|([^\\]##))/$match[2]$match[3]:u}
    

    or

    print -r -- ${message//(#b)((\\?)#)([^\\]##)/$match[1]$match[3]:u}
    
  • their GNU sed translations:

    printf '%s\n' "$message" | sed -E 's/(\\.)|([^\\]+)/\1\U\2/g'
    

    or

    printf '%s\n' "$message" | sed -E 's/((\\.)*)([^\\]+)/\1\U\3/g'
    

Beware they convert \Mx (Meta-x, an escape sequence supported by zsh's print for instance and that expands to the 0xf8 byte ('x' + 0x80)) to \MX (0xd8). They also convert \x7a to \x7A or \u007a to \u007A or \Cx to \CX but that shouldn't be a problem as those expand to the same.

3

I'd be tempted to interpret the escape sequences into literal characters:

message="First Line\nSecond Line"
declare -u Message                       # uppercase on assignment
printf -v Message -- "${message//%/%%}"  # assign
declare -p Message                       # inspect

result

declare -u msg="FIRST LINE
SECOND LINE"
glenn jackman
  • 85,964
  • 3
    Beware that with message='\141' for instance, you'd get declare -u Message="A" instead of declare -u Message="a" – Stéphane Chazelas Apr 25 '22 at 19:07
  • Note that any \ will ve doubled \\. –  Apr 25 '22 at 22:45
  • 1
    Not giving printf a format causes the change of % that you want to avoid by duplicating every %. However, a printf -v Message '%b' -- "${message}" will interpret back-slashed characters exactly as echo -e without changing the %s. –  Apr 25 '22 at 22:57
  • Please read: https://unix.stackexchange.com/q/700508/232326 –  Apr 27 '22 at 19:26
1
echo "$message"  |  sed -e 's/^[[:lower:]]/\u&/' -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g' \
                                                 -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'
  • -e 's/^[[:lower:]]/\u&/'  If the first character in the string (or, more generally, the first character on a line) is a lower-case letter, capitalize it.  Because the first character on a line can’t be escaped.  Duh.  That’s a no-brainer.

  • -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'  Look at the line two characters at a time.  If a lower-case letter is preceded by something other than a backslash, leave the preceding character alone, and capitalize the lower-case letter.

    You might think that this would be enough to process the entire line.  Unfortunately, since it processes the line two characters at a time, it gets only every other letter:

    $ echo "first line\nsecond line" | sed -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'
    fIrSt LiNe\nSeCoNd LiNe
    

    so,

  • -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'  Do the exact same thing a second time.  This will pick up the letters that were skipped on the first pass.


Alternative version:

echo "$message" | sed -e 's/^[[:lower:]]/\u&/' \
                                  -e ': loop; s/\([^\]\)\([[:lower:]]\)/\1\u\2/g; t loop'

Basically the same as the first version, but, instead of repeating the second s command, it iterates it with a loop.


Unfortunately, this will not work correctly for double backslashes:  foo\\bar will become FOO\\bAR, even though the b should be capitalized, since the \\ is an escaped backslash, and so should not cause the b to be escaped.

  • No, the first character could be escaped, like when you want to insert a tab at the beginning, which would be "\t". – Ocean May 09 '22 at 15:19
  • One of us is not understanding the other.  If the line begins with \t, then the first character is \.  t is the second* character.*  If I’m misunderstanding you, please explain more clearly. – G-Man Says 'Reinstate Monica' May 09 '22 at 22:36
  • Semantics. If a line begins with "\t", then the first character is an escaped "t". But one can also say that "" is the first character. Depends on how you look at it, I guess. It could also be an escaped "" by having "\t", so one gets "\t" instead of the tab character. Since these constructs are supposed to represent a single character (\t is tab), I treat them as single entities, which was the origin of the misunderstanding. – Ocean May 10 '22 at 12:08
1

I'd consider evaluating the \n and other escape sequences at the point that the variable was defined. Here $message actually contains a newline.

message=$(printf '%b' 'First Line\nSecond Line')
echo "${message^^}"

Output

FIRST LINE
SECOND LINE
Chris Davies
  • 116,213
  • 16
  • 160
  • 287
0

The variable can be iterated line by line. Then concatenate the output again.

bash:

$ message="First Line\nSecond Line";
$ message=$(echo -e ${message} |while read -r line; do echo -n "${line^^}\n" ; done) && message=${message%??}
$ echo ${message} 
FIRST LINE\nSECOND LINE
Kadir
  • 264
  • 1
  • 6
0

Using Raku (formerly known as Perl_6)

~$ echo 'a\nb'
a\nb
~$ echo 'a\nb' | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB
~$ echo "a\\nb"
a\nb
~$ echo "a\\nb" | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB

Above uses a negative look-behind assertion, <!after "\\">, to select out all characters except those immediately after a \ backslash. Selected characters are then uppercased with Raku's .uc routine.

Certainly it's safer to provide the regex with a custom <-[ … ]> negative character class, sparing backslashed characters like \n and \t from being uppercased. (FYI, custom positive character classes are written <+[ … ]> or more simply <[ … ]> in Raku).

Below, using Raku's "Q-lang" (quoting language) to feed the substitution operator a string. In all four examples below \n is returned (not uppercase \N). Note in the third example how \n is operationally-interpreted as a newline character, and this remains unchanged in the fourth example, telling us that \n still exists in that string (i.e. it has NOT been uppercased to \N):

~$ raku -e 'put Q<a\nb>'
a\nb
~$ raku -e 'put Q<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A\nB
~$ raku -e 'put Q:b<a\nb>'
a
b
~$ raku -e 'put Q:b<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A
B

NOTE, see: "Place an escape sign before every non-alphanumeric characters" for Raku answers to a related question on StackOverflow.

References:
https://docs.raku.org/language/quoting
https://docs.raku.org/language/regexes#Literals_and_metacharacters
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17