How does od treat symbols after `\x7f`?

Question

The following command passes to od symbols from \x00 to \xff:

$ seq 0 255 | awk '{printf("%c", $0)}' | od -c

But what I get is:

0000000  \0 001 002 003 004 005 006  \a  \b  \t  \n  \v  \f  \r 016 017
0000020 020 021 022 023 024 025 026 027 030 031 032 033 034 035 036 037
0000040       !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /
0000060   0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?
0000100   @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
0000120   P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _
0000140   `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o
0000160   p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~ 177
0000200 302 200 302 201 302 202 302 203 302 204 302 205 302 206 302 207
0000220 302 210 302 211 302 212 302 213 302 214 302 215 302 216 302 217
0000240 302 220 302 221 302 222 302 223 302 224 302 225 302 226 302 227
0000260 302 230 302 231 302 232 302 233 302 234 302 235 302 236 302 237
0000300 302 240 302 241 302 242 302 243 302 244 302 245 302 246 302 247
0000320 302 250 302 251 302 252 302 253 302 254 302 255 302 256 302 257
0000340 302 260 302 261 302 262 302 263 302 264 302 265 302 266 302 267
0000360 302 270 302 271 302 272 302 273 302 274 302 275 302 276 302 277
0000400 303 200 303 201 303 202 303 203 303 204 303 205 303 206 303 207
0000420 303 210 303 211 303 212 303 213 303 214 303 215 303 216 303 217
0000440 303 220 303 221 303 222 303 223 303 224 303 225 303 226 303 227
0000460 303 230 303 231 303 232 303 233 303 234 303 235 303 236 303 237
0000500 303 240 303 241 303 242 303 243 303 244 303 245 303 246 303 247
0000520 303 250 303 251 303 252 303 253 303 254 303 255 303 256 303 257
0000540 303 260 303 261 303 262 303 263 303 264 303 265 303 266 303 267
0000560 303 270 303 271 303 272 303 273 303 274 303 275 303 276 303 277
0000600

What's wrong with characters after \x7f?

@slm en_US.UTF-8. Sorry, I decided to simplify the original command. You can see it in the revisions. Initially it was echo ibase=16 80 | tr ' ' \; | bc | awk '{printf("%c",$0)}' | od -c. So, it's awk that produces two bytes. — x-yuri, Aug 01 '18 at 09:00
Note: It seems to me that echo $(seq 0 255) | tr ' ' \; | bc is equivalent to seq 0 255 (which is much simpler). Or am I missing something? — Malte Skoruppa, Aug 01 '18 at 10:17
@MalteSkoruppa, no, you aren't. They just had the bc there for base conversion in another version, so that might be a remainder of that. echo $(...) is a bit redundant in itself too. But { echo obase=16; seq 0 255; } | bc would be a somewhat useful use of the bc. — ilkkachu, Aug 01 '18 at 10:36

Stéphane Chazelas · Accepted Answer · 2018-08-01T12:16:12.180

Depending on the awk implementation, printf("%c", n) outputs the byte value n, or the character whose code point is n.

If the locale's charset is UTF-8 (see output of locale charmap), that yields the same result for values 0 to 127 (where the encoding of characters U+0000 to U+007f is the byte values 0 to 0x7f).

But for anything over 127, you get the corresponding byte value (truncated to 8 bits) for the awk implementations in the first category, or the UTF-8 encoding for the others (at least GNU awk, probably the one you're using).

gawk 'BEGIN{printf "%c", 8364}'

(8364 being 0x20AC) prints a € Euro sign (U+20AC), encoded as 0xe2 0x82 0xac in UTF-8, while

mawk 'BEGIN{printf "%c", 8364}'

prints a 0xAC byte (which is the encoding of no character in UTF-8, that's invalid text, your terminal may render it as �, the replacement character).

Note that code point here would typically be the Unicode code point for multi-byte character sets, and the charset value (so byte value) for single-byte ones. In a locale using the iso8859-15 charset, the Euro sign has code point 0xA4 (not 0x20AC), printf("%c", 0xA4) would print a Euro sign (byte value 0xA4) regardless of the awk implementation.

So if you want to print bytes by value (values from 1 to 255, not all awk implementations will work properly for 0), use:

LC_ALL=C awk 'BEGIN{printf "%c", value}'

The C locale's charset is guaranteed to be single-byte and every system has a C locale.

You can also use:

printf '\200'

(here the byte value is expressed in octal). Some printf implementations also support hexadecimal:

printf '\x80'

Some printf implementations also support:

printf '\u20ac'

To print a character based on its Unicode code point (generally in the locale's charset (so 0xA4 in iso8859-15 locales, 0xe2 0x82 0xac in UTF-8 ones, and various different behaviours in locales where the charset doesn't have the Euro sign), though some like the printf builtin of ksh93 outputs it encoded in UTF-8 regardless of the locale's charset).

How does od treat symbols after `\x7f`?

1 Answers1