13

I have some UTF-8-encoded text files which display strange escape codes in Emacs. For instance, this text:

In ista quaestione primo exponam quid intelligendum est per hoc nomen ‘Deus’; secundo, respondebo ad quaestionem.

Shows like this in Emacs:

enter image description here

This only happens in Emacs. Other editors show the text correctly. How can I fix this problem?


Update 1

If I call revert-buffer-with-coding-system and select utf-8 the file get read correctly. So, as Gilles has correctly guessed, Emacs isn't detecting the file encoding. If I add the code ; -*- coding: utf-8 -*- to the file, Emacs opens and displays it correctly.


Update 2

I reencoded the file in "UTF-8 with BOM encoding," and now it displays alright in Emacs. I don't know what's the difference between the two types, but Emacs seems to be aware of the BOMed one only.

NVaughan
  • 1,481
  • 12
  • 27

4 Answers4

11

For some reason, Emacs isn't recognizing the file as UTF-8. You can force Emacs to reopen the file as UTF-8 by running the command C-x RET r (revert-buffer-with-coding-system) and entering utf-8.

The reason why Emacs didn't recognize this file as UTF-8 (but recognizes other) is likely that it contains some invalid UTF-8 sequence. This sequence will still appear as backslash followed by three octal digits with a different color (the escape-glyph face) after reinterpreting the file as UTF-8. You can search for such a sequence by running C-M-s (isearch-regexp) and looking for

[^^@-~[:multibyte:]]

where ^@ is entered by typing C-q C-SPC (it's the character ^@ = 0, not the two-character sequence circumflex-at; the character before it is the circumflex character).

You can force Emacs to recognize the file as UTF-8 by adding a coding system file variable: put something like -*-coding: utf-8-*- on the first line, or put something like this near the end of the file (you can replace # by any prefix, but Local Variables: and End: must appear exactly like this with the trailing colon):

# Local Variables:
# coding: utf-8
# End:

Emacs chooses the encoding according to which files are interpreted based on several settings, primarily language environments and the variables auto-coding-alist and auto-coding-regexp-alist. Since you have the same problem with this file even when running emacs -Q, I think that this isn't an issue with those settings, but with the file content.

  • If I open the file without the coding system file variable (i.e. when the file displays wrongly) and run the regex search, all of my `\342`, `\200`, `\230`, etc. get selected. But if I open it "correctly" (using the coding variable), then no search results appear. – NVaughan Mar 22 '15 at 19:52
  • @NVaughan Hmmm. Then I don't understand why this file isn't recognized as UTF-8 when others are (especially under `emacs -Q`). – Gilles 'SO- stop being evil' Mar 22 '15 at 20:14
2

It's late to answer the question about the BOM, but I'll do it anyhow.

The byte order mark (BOM) is a sequence of three bytes \xef\xbb\xbf which, at the beginning of a file, indicates to systems and applications that the contents are encoded as UTF-8. Properly they're metadata, not treated as part of the contents.

Most applications -- Emacs is one of them -- honor the BOM and write all UTF-8 files with it. Other applications may honor it in reading, but not write it; and others don't know about it and may throw an error message when they encounter it. In other words, the situation is messy. I prefer to use it wherever possible.

1

For UNIX-like systems only.

In many cases the straightforward encoding definition in ~/.bashrc ~/bash_profile

LANG=en_EN.UTF8

accomplished with

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

in ~/.profile should solve your issue.

P.S. After these corrections you need to RELOGIN in your session to allow changes become visible.

Alioth
  • 119
  • 3
  • Althougth what you say may be useful, this does not appear to answer this question, since the problem was only with **some** utf-8 files. – JeanPierre Aug 14 '16 at 12:06
  • Suppose that after the strict encoding definition in configuration files this problem might disappear for **all** files forever end ever :-) – Alioth Aug 15 '16 at 07:36
0

On Windows

it started recognizing and displaying characters correctly by setting the environment variable suggested by @Alioth.

LANG=en_US.UTF-8
kisp
  • 101
  • 2