How to insert invalid UTF-8 sequence

Question

In order to test that some code handles an invalid UTF-8 sequence properly, I need to insert an invalid UTF-8 sequence into an existing text file. However, my attempts to do so result in a valid UTF-8 sequence, not an invalid one.

The byte 0xff is always invalid in UTF-8, so I tried to insert it, following the instructions on the "Inserting Text" page of the Emacs manual:

c-q 377 <RET>

A funny-looking character got inserted into the file:

-OP INC       ÿ3713 SOUTHS
              ^

However, when I examine the contents of the file using the Unix command od -c <path>:

0001240   C  \t 303 277   3   7   1   3       S   O   U   T   H   S   H
                ^^^ ^^^

I can see that a valid UTF-8 sequence 303 277 (octal) has been inserted, not the single byte 377 that I tried to insert.

How can I insert an invalid UTF-8 sequence into a file?

score 9 · Answer 1 · answered Apr 10 '15 at 17:50

C-q 377 RET inserts the character with octal code 377 (aka LATIN SMALL LETTER Y WITH DIAERESIS). If you want to insert a byte instead of a character, you can do it with:

M-: (insert (unibyte-string #o377)) RET

As @legoscia mentioned, Emacs will probably ask you to use another coding system after inserting such a character, but you can choose utf-8 at that prompt anyway.

score 5 · Accepted Answer · answered Apr 10 '15 at 14:48

5

After some trial and error, I managed to convert ÿ to the byte 377 using M-x recode-region, specifying that it was really in raw-text but was interpreted as latin-1.

When saving the file, Emacs didn't want to save it as UTF-8, offering to save it as raw-text instead, which seems to have had the desired effect.

answered Apr 10 '15 at 14:48

legoscia

6,012
29
54

I picked this answer because it tab completes: it's easier to type. – Wayne Conrad Apr 24 '15 at 15:53

How to insert invalid UTF-8 sequence

2 Answers2