9

I have the following file :

$ cat test
Villes visit\U000000e9es

How can I interpret those \UXXXXXXXX codes, e.g. how can I get :

$ cat test | pipe into something
Villes visitées
ChennyStar
  • 1,743
  • 1
    That's not a UTF-8 code, that's a C-style representation of a character by its Unicode codepoint. Maybe you mean that you want to replace that \U000000e9 with it the UTF-8 encoding of the U+00E9 character? – Stéphane Chazelas Sep 05 '22 at 09:01
  • Yes, that's what I want, replace the C-style \U000000e9 by é – ChennyStar Sep 05 '22 at 09:04
  • The question is by é encoded as per the user's locale? Like as 0xe9 if the locale uses iso8859-1 or as UTF-8 (0xc3 0xa9) regardless of the user's locale? – Stéphane Chazelas Sep 05 '22 at 09:06
  • I think it's by the user's locale. It's a file that's produced by someone else, thus I don't have all the details. But it's produced using C preprocessor (cpp), something like (in its simplest form) $ echo "villes visitées" | cpp -P, which prints out villes visit\U000000e9es. – ChennyStar Sep 05 '22 at 09:20
  • 1
    To clarify, \U000000e9 is unambiguous. That's the LATIN SMALL LETTER E WITH ACUTE character. The question is more about how you want to represent it, so more about how the output will be used. – Stéphane Chazelas Sep 05 '22 at 09:34

1 Answers1

8

With perl:

$ perl -C -pe 's/\\U([[:xdigit:]]{8})/chr hex$1/ge' <yourfile
Villes visitées

Assuming the locale uses UTF-8 as its charmap¹, that would convert \UXXXXXXXX to the UTF-8 encoding of the U+XXXXXXXX character. To get UTF-8 Output regardless of the user's locale, change the -C to -CO.

To convert it to the é character in the correct encoding for the user's locale (assuming there's such a character in the user's locale charset):

perl -Mopen=locale -pe 's/\\U([[:xdigit:]]{8})/chr hex$1/ge' <yourfile

So for instance, in a fr_CH.iso88591 locale, that would convert it to the 0xe9 byte (the encoding of é in ISO8859-1), while in a zh_HK.big5hkscs locale that would convert it to 0x88 0x6d (its encoding in BIG5-HKSCS). And 0xc3 0xa9 in a fr_FR.UTF-8 locale (its UTF-8 encoding). In a ar_AE.iso88596 locale, as ISO8859-6 doesn't have a é character, you'd get Villes visit\x{00e9}es.

Or you could use ICU's uconv (in the icu-devtools package on Debian-based systems) to apply the Hex/C-Any transform :

uconv -x hex/c-any <your-file

It understands \uXXXX and \UXXXXXXXX sequences (more if you use hex-any) and outputs in UTF-8. Pipe to iconv -f utf-8 to get the output in the user's locale (see also iconv's -c option to skip characters that can't be encoded).

$ printf '%s\n' '&#233; &#xe9; \x{e9} U+00E9 \u00e9 \U000000e9 \U0001F427 \ud83d\udc27' | uconv -x hex/c-any
&#233; &#xe9; \x{e9} U+00E9 é é  
$ printf '%s\n' '&#233; &#xe9; \x{e9} U+00E9 \u00e9 \U000000e9 \U0001F427 \ud83d\udc27' | uconv -x hex-any  
é é é é é é  

(both also recognise java-style surrogate pairs though that shouldn't occur in your output if it's from cpp -P).

For the perl one to understand both \uXXXX and \UXXXXXXXX like uconv's hex/c-any does, change the perl code to:

s/(?|\\u([[:xdigit:]]{4})|\\U([[:xdigit:]]{8}))/chr hex$1/ge

zsh's print builtin also understands those \uXXXX and \UXXXXXXXX (does not require all the 4/8 digits) and many more, so you could also do:

print -- "$(<your-file)"

You'll get an error if there are characters not present in the locale's charmap.

Some printf implementations also support them for their %b format directive:

printf '%b\n' "$(cat <your-file)"

Like zsh's print, it supports more than just the \u/\U ones, at least the \n/\b/\r..., the \0ooo ones and often more like \xHH.


¹ See output of the locale charmap command; in locales that use other charmaps, what you get is mostly useless in your case. If all the code points on a line are in the 0x0 .. 0xff range, you get the ISO8859-1 encoding (the byte value of the code point), and if not (if there's at least one code point above 0xff in the line), the UTF-8 encoding (and some warnings about it)