3

I have the following string

echo -e "a12\x8fb12\x9f" | xxd
0000000: 6131 328f 6231 329f 0a                   a12.b12..

and want to delete the sequence 12\x9f and 12\x8f with sed.

I can do it with this command

sed -e 's_12\x8f__g' -e 's_12\x9f__g'

but why does this command not work?

sed -e 's_12[\x8f\x9f]__g'
Baba
  • 3,279
  • 1
    I would assume that sed is not completely binary safe. You try to only work on the hex side like this: echo -e "a12\x8fb12\x9f" | sed -e 's_12[\x8f]__g' | xxd -ps | sed 's/../\0 /g' | sed -r 's/31 32 (8f|9f) ?//g' | xxd -r -ps | xxd – LatinSuD Jun 26 '14 at 13:08
  • 4
    Use LC_ALL=C sed..., \x8f by itself won't make a character in a UTF-8 locale. – Stéphane Chazelas Jun 26 '14 at 13:14

1 Answers1

9

That would be because the [...] matches on a character. sed would try and match characters against the range specified in [...]. In UTF-8 locales, you can only encounter \x8f as part of a multi-byte character. You'll notice that . doesn't match on it either (and that's a POSIX requirement).

For instance:

sed 's/[eé\xa9]//'

would not make sense. é is a character (encoded as 0xc3 0xa9), 0xa9 is not a character but as a byte, can be found inside a character (like é), e is a character (encoded as 0x65). You can't expect sed to somehow be able to match 0xa9 both inside a character and as a byte.

To match arbitrary byte data with a text utility like sed, you'll want to use a locale where characters are bytes, that's a typical case for LC_ALL=C.

LC_ALL=C sed 's/12[\x8f\x9f]//g'

Or portably:

LC_ALL=C sed "$(printf 's/12[\217\237]//g')"

Note that you can't expect to process data containing NUL characters (or that don't end in a newline character or where newline characters are more than a few kilobytes appart) portably with sed. Use perl -p/-n instead in that case.