9

I consolidated many text files (win, mac, unix) into a single orgmode file. For some characters I was seeing numbers instead of the right characters. Things like \314\203.

I used "revert-buffer-with-coding-sytem" and choose utf-8-hfs-unix. That fixes it.

But now every time I save, Emacs asks me to choose a coding system. If I choose raw-text it stops asking but when I open the file again the numbers are back.

How do I fix this?

Jason Mirk
  • 713
  • 3
  • 15
  • 2
    How about adding `-*- coding: utf-8-hfs-unix;-*-` in the first line of your file? (https://www.gnu.org/software/emacs/manual/html_node/emacs/Specify-Coding.html#Specify-Coding) – JeanPierre Jan 19 '16 at 00:12
  • this partially worked. No more numbers. The problem is that when I save Emacs asks me again for a coding system. Only If i choose raw-text I can save without it asking me every time. – Jason Mirk Jan 19 '16 at 00:19
  • 2
    Could it be that your file contains chars from multiple incompatible charsets? – JeanPierre Jan 19 '16 at 00:33
  • I think so but how do I fix it? – Jason Mirk Jan 19 '16 at 14:25
  • 1
    I had trouble the first few days using Emacs (a few years ago), but I adopted the approaching in the following link and have never looked back -- **How to reset emacs to save files in utf-8-unix character encoding?** -- http://stackoverflow.com/a/20736147/2112489 It is similar to the previous answer by elethen, but has some stuff that I added a few years ago. There are still some special characters, however, that trigger a prompt. Since it is so rare that I encounter those characters (usually when editing a file after optical character recognition), I never spent more time on the issue. – lawlist Jan 23 '16 at 01:02
  • My guess is that you have a mix of dos and unix line feed characters. Can you select the whole buffer and run `dos2unix` on it? `C-x h` `C-u M-| dos2unix RET`. – Kaushal Modi Jan 23 '16 at 02:29
  • what does the `file` command line tool says about your file? (`file myfile` in a shell) – JeanPierre Jan 27 '16 at 14:17

1 Answers1

8

This happened to me for a while also before I had an idea of what was going on - here's an example of how something like this can happen - (if it matters, I'm on Windows, in case it's something specific to this build) -

Let's say you have a file that's encoded in UTF-8, and you paste some text from a website that's encoded with the Latin-1 or Windows-1252 code page, e.g. an O with an umlaut, or curly quotes.

Now you have a sequence of UTF-8 encoded characters followed by something which either doesn't make sense to UTF-8 or will possibly be misinterpreted. If it can't interpret it as a correct UTF-8 sequence it will display it as the raw value, e.g. octal \326 (which is an O with an umlaut in the Latin-1 code page). This is because to UTF-8, the \326 in particular is supposed to be followed by something with a 10 in the highest two bits, and if it's not, it doesn't know what to do with it.

E.g. if you were to go to https://www.gnu.org/software/emacs/manual/html_node/emacs/Intro.html#Intro and copy some text that included curly quotes, like "The `G' in GNU" and pasted it into a UTF-8 encoded buffer, you'd end up with "The \221G\222 in GNU".

So... what to do?

For one thing, you can look at the buffer with different encoding systems to see if it will display those characters correctly, e.g. Windows-1252 and Latin-1 are fairly common -

M-x revert-buffer-with-coding-system windows-1252 RET
M-x revert-buffer-with-coding-system latin-1 RET

If the document looks better this way, you can save it with this new encoding. There are a lot of different coding systems though.

To put it back to UTF-8, just do

M-x revert-buffer-with-coding-system utf-8 RET

As for why this happens, I'm not sure - it would seem that Emacs would know how something was encoded in the clipboard and translate it accordingly, but it doesn't seem to do this.

For some more explanation, see https://stackoverflow.com/questions/1543613/how-does-utf-8-variable-width-encoding-work and http://kunststube.net/encoding/.

Brian Burns
  • 1,577
  • 14
  • 15
  • 2
    I opened it with Visual Studio Code and it read it perfectly. Saved, opened in emacs. Everything looks fine now! – Jason Mirk Jan 28 '16 at 01:53
  • @JasonMirk Interesting - maybe it took a guess at what coding system to use for any oddball characters, e.g. Latin-1, and translated them to UTF-8? – Brian Burns Jan 28 '16 at 02:18
  • I think so. It's all good man. Saul Goodman. – Jason Mirk Jan 28 '16 at 15:46
  • This is not the solution, but it helped along with @JeanPierre 's comments. The idea is to read the file with the desired encoding (Alt+x revert-buffer-with-coding-system) then search for non-ASCII characters in order to filter those that remain invalid (M-x search-forward-regex [[:nonascii:]] RET). – emagar Oct 17 '18 at 13:48