6

In order to test that some code handles an invalid UTF-8 sequence properly, I need to insert an invalid UTF-8 sequence into an existing text file. However, my attempts to do so result in a valid UTF-8 sequence, not an invalid one.

The byte 0xff is always invalid in UTF-8, so I tried to insert it, following the instructions on the "Inserting Text" page of the Emacs manual:

c-q 377 <RET>

A funny-looking character got inserted into the file:

-OP INC       ÿ3713 SOUTHS
              ^

However, when I examine the contents of the file using the Unix command od -c <path>:

0001240   C  \t 303 277   3   7   1   3       S   O   U   T   H   S   H
                ^^^ ^^^

I can see that a valid UTF-8 sequence 303 277 (octal) has been inserted, not the single byte 377 that I tried to insert.

How can I insert an invalid UTF-8 sequence into a file?

Wayne Conrad
  • 163
  • 7

2 Answers2

9

C-q 377 RET inserts the character with octal code 377 (aka LATIN SMALL LETTER Y WITH DIAERESIS). If you want to insert a byte instead of a character, you can do it with:

M-: (insert (unibyte-string #o377)) RET

As @legoscia mentioned, Emacs will probably ask you to use another coding system after inserting such a character, but you can choose utf-8 at that prompt anyway.

Stefan
  • 26,154
  • 3
  • 46
  • 84
5

After some trial and error, I managed to convert ÿ to the byte 377 using M-x recode-region, specifying that it was really in raw-text but was interpreted as latin-1.

When saving the file, Emacs didn't want to save it as UTF-8, offering to save it as raw-text instead, which seems to have had the desired effect.

legoscia
  • 6,012
  • 29
  • 54