Emoticons seem to be specified using a format of U+xxxxx
wherein each x is a hexadecimal digit.
For example, U+1F615 is the official Unicode Consortium code for the "confused face"
As I am often confused, I have a strong affinity for this symbol.
The U+1F615 representation is confusing to me because I thought the only encodings possible for unicode characters required 8, 16, 24 or 32 bits, whereas 5 hex digits require 5x4=20 bits.
I've discovered that this symbol seems to be represented by a completely different hex string in bash:
$echo -n | hexdump
0000000 f0 9f 98 95
0000004
$echo -e "\xf0\x9f\x98\x95"
$PS1=$'\xf0\x9f\x98\x95 >'
>
I would have expected U+1F615 to convert to something like \x00 \x01 \xF6 \x15.
I don't see the relationship between these 2 encodings?
When I lookup a symbol in the official Unicode Consortium list, I would like to be able to use that code directly without having to manually convert it in this tedious fashion. i.e.
- finding the symbol on some web page
- copying it to the clipboard of the web browser
- pasting it in bash to echo through a hexdump to discover the REAL code.
Can I use this 20-bit code to determine what the 32-bit code is?
Does a relationship exist between these 2 numbers?
\U1F615
is followed by another valid hexadecimal digit then that will be assumed to be part of the escape sequence. To make it work regardless of what it is followed by it has to have enough leading zeros to be exactly eight digits long:\U0001F615
– kasperd Dec 30 '15 at 09:18