How to display Unicode UTF-8 as Unicode?

Question

I have some UTF-8-encoded text files which display strange escape codes in Emacs. For instance, this text:

In ista quaestione primo exponam quid intelligendum est per hoc nomen ‘Deus’; secundo, respondebo ad quaestionem.

Shows like this in Emacs:

enter image description here

This only happens in Emacs. Other editors show the text correctly. How can I fix this problem?

Update 1

If I call revert-buffer-with-coding-system and select utf-8 the file get read correctly. So, as Gilles has correctly guessed, Emacs isn't detecting the file encoding. If I add the code ; -*- coding: utf-8 -*- to the file, Emacs opens and displays it correctly.

Update 2

I reencoded the file in "UTF-8 with BOM encoding," and now it displays alright in Emacs. I don't know what's the difference between the two types, but Emacs seems to be aware of the BOMed one only.

Emacs isn't recognizing the file as UTF-8. What is the content of your init file? What version of Emacs are you running? Does it change anything if you start Emacs with `emacs -q` or `emacs -Q`? — Gilles 'SO- stop being evil', Mar 22 '15 at 19:06
I have no problem with other UTF-8 files. I'm running GNU Emacs 24.4.4. No difference with `emacs -q` or `emacs -Q`. — NVaughan, Mar 22 '15 at 19:17
Ah, if it works with other files and in a pristine configuration then the reason is probably that the file also contains invalid UTF-8 somewhere. Let me see how to tell with Emacs... — Gilles 'SO- stop being evil', Mar 22 '15 at 19:21
possibly related: http://emacs.stackexchange.com/q/4100/2264 — Sean Allred, Mar 22 '15 at 19:35

score 11 · Accepted Answer · answered Mar 22 '15 at 19:38

For some reason, Emacs isn't recognizing the file as UTF-8. You can force Emacs to reopen the file as UTF-8 by running the command C-x RET r (revert-buffer-with-coding-system) and entering utf-8.

The reason why Emacs didn't recognize this file as UTF-8 (but recognizes other) is likely that it contains some invalid UTF-8 sequence. This sequence will still appear as backslash followed by three octal digits with a different color (the escape-glyph face) after reinterpreting the file as UTF-8. You can search for such a sequence by running C-M-s (isearch-regexp) and looking for

[^^@-~[:multibyte:]]

where ^@ is entered by typing C-q C-SPC (it's the character ^@ = 0, not the two-character sequence circumflex-at; the character before it is the circumflex character).

You can force Emacs to recognize the file as UTF-8 by adding a coding system file variable: put something like -*-coding: utf-8-*- on the first line, or put something like this near the end of the file (you can replace # by any prefix, but Local Variables: and End: must appear exactly like this with the trailing colon):

# Local Variables:
# coding: utf-8
# End:

Emacs chooses the encoding according to which files are interpreted based on several settings, primarily language environments and the variables auto-coding-alist and auto-coding-regexp-alist. Since you have the same problem with this file even when running emacs -Q, I think that this isn't an issue with those settings, but with the file content.

If I open the file without the coding system file variable (i.e. when the file displays wrongly) and run the regex search, all of my `\342`, `\200`, `\230`, etc. get selected. But if I open it "correctly" (using the coding variable), then no search results appear. — NVaughan, Mar 22 '15 at 19:52
@NVaughan Hmmm. Then I don't understand why this file isn't recognized as UTF-8 when others are (especially under `emacs -Q`). — Gilles 'SO- stop being evil', Mar 22 '15 at 20:14

score 2 · Answer 2 · answered Jan 21 '18 at 14:26

It's late to answer the question about the BOM, but I'll do it anyhow.

The byte order mark (BOM) is a sequence of three bytes \xef\xbb\xbf which, at the beginning of a file, indicates to systems and applications that the contents are encoded as UTF-8. Properly they're metadata, not treated as part of the contents.

Most applications -- Emacs is one of them -- honor the BOM and write all UTF-8 files with it. Other applications may honor it in reading, but not write it; and others don't know about it and may throw an error message when they encounter it. In other words, the situation is messy. I prefer to use it wherever possible.

score 1 · Answer 3 · answered Aug 14 '16 at 10:01

1

For UNIX-like systems only.

In many cases the straightforward encoding definition in ~/.bashrc ~/bash_profile

LANG=en_EN.UTF8

accomplished with

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

in ~/.profile should solve your issue.

P.S. After these corrections you need to RELOGIN in your session to allow changes become visible.

answered Aug 14 '16 at 10:01

Alioth

119
3

Althougth what you say may be useful, this does not appear to answer this question, since the problem was only with **some** utf-8 files. – JeanPierre Aug 14 '16 at 12:06
Suppose that after the strict encoding definition in configuration files this problem might disappear for **all** files forever end ever :-) – Alioth Aug 15 '16 at 07:36

score 0 · Answer 4 · answered Jun 02 '22 at 07:29

0

On Windows

it started recognizing and displaying characters correctly by setting the environment variable suggested by @Alioth.

LANG=en_US.UTF-8

answered Jun 02 '22 at 07:29

kisp

101
2

How to display Unicode UTF-8 as Unicode?

4 Answers4

Linked