My terminal is set to UTF-8. If I type this:
echo -n 'ýá' | xxd
I can see this output:
00000000: c3bd c3a1
Which is fine. Now I would like to remove the 'ý' character from the string, so I use:
echo -n 'ýá' | tr -d 'ý' | xxd
But the result will be:
00000000: a1
The tr
removes also the next c3
byte, but that is part of the 'á' character. Why is working this way? Is this a bug? Or should I set something?
tr
works on the byte level and the UTF-8 characters you show are represented by 2 bytes and one of those bytes is common in both UTF-8 characters. I think it is not a bug, but the way this primitive tool works. (You can test with some characters that are represented by single bytes and see what happens when some of them are represented twice in the pattern.) – sudodus Mar 23 '20 at 14:16tr
is not prepared for UTF-8 ? – user3719454 Mar 23 '20 at 14:17