1

In GUI emacs 25.3.1, I have a text file with octal character codes for accents. For example I have many \212 characters which are really Š characters. If I replace-string the occurrences, it shows fine in the buffer, but if I exit and restart emacs, and reopen the file I just fixed, the file no longer shows properly and now has two characters at each location, for example, each Š is now \305\240.

How can I view my text file with octal codes and see the proper accents in Emacs, or how can I convert it so it so I can save and load in the future and see the proper accents?


Per one answer, in the original file, I did C-x RET f windows-1252 RET, but it gives error:

These default coding systems were tried to encode text
in the buffer ‘Sn Index.txt’:
  (windows-1252 (2 . 4194273) (32 . 4194246) (36 . 4194273) (46 . 4194273) (67
  . 4194246) (71 . 4194276) (87 . 4194186) (93 . 4194276) (132 . 4194186) (138
  . 4194276) (174 . 4194186))
However, each of them encountered characters it couldn’t encode:
  windows-1252 cannot encode these: \341 \306 \344 \212 ...

Click on a character (or switch to this window by ‘C-x o’
and select the characters by RET) to jump to the place it appears,
where ‘C-u C-x =’ will give information about it.

Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
   to remove or modify the problematic characters,
or specify any other coding system (and risk losing
   the problematic characters).

  raw-text no-conversion

Encoding (C-h C RET) of original and modified files are the same:

  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for terminal output:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for inter-client cut and paste:
  nil
Defaults for subprocess I/O:
  decoding: U -- utf-8-unix (alias: mule-utf-8-unix)

  encoding: U -- utf-8-unix (alias: mule-utf-8-unix)


Priority order for recognizing coding systems when reading files:
  1. utf-8 (alias: mule-utf-8)
  2. iso-2022-7bit 
  3. iso-latin-1 (alias: iso-8859-1 latin-1)
  4. iso-2022-7bit-lock (alias: iso-2022-int-1)
  5. iso-2022-8bit-ss2 
  6. emacs-mule 
  7. raw-text 
  8. iso-2022-jp (alias: junet)
  9. in-is13194-devanagari (alias: devanagari)
  10. chinese-iso-8bit (alias: cn-gb-2312 euc-china euc-cn cn-gb gb2312)
  11. utf-8-auto 
  12. utf-8-with-signature 
  13. utf-16 
  14. utf-16be-with-signature (alias: utf-16-be)
  15. utf-16le-with-signature (alias: utf-16-le)
  16. utf-16be 
  17. utf-16le 
  18. japanese-shift-jis (alias: shift_jis sjis)
  19. chinese-big5 (alias: big5 cn-big5 cp950)
  20. undecided 

  Other coding systems cannot be distinguished automatically
  from these, and therefore cannot be recognized automatically
  with the present coding system priorities.

Particular coding systems specified for certain file names:

  OPERATION TARGET PATTERN      CODING SYSTEM(s)
  --------- --------------      ----------------
  File I/O      "\\.dz\\'"              (no-conversion . no-conversion)
                "\\.txz\\'"             (no-conversion . no-conversion)
                "\\.xz\\'"              (no-conversion . no-conversion)
                "\\.lzma\\'"            (no-conversion . no-conversion)
                "\\.lz\\'"              (no-conversion . no-conversion)
                "\\.g?z\\'"             (no-conversion . no-conversion)
                "\\.\\(?:tgz\\|svgz\\|sifz\\)\\'"
                                        (no-conversion . no-conversion)
                "\\.tbz2?\\'"           (no-conversion . no-conversion)
                "\\.bz2\\'"             (no-conversion . no-conversion)
                "\\.Z\\'"               (no-conversion . no-conversion)
                "\\.elc\\'"             utf-8-emacs
                "\\.el\\'"              prefer-utf-8
                "\\.utf\\(-8\\)?\\'"    utf-8
                "\\.xml\\'"             xml-find-file-coding-system
                "\\(\\`\\|/\\)loaddefs.el\\'"
                                        (raw-text . raw-text-unix)
                "\\.tar\\'"             (no-conversion . no-conversion)
                "\\.po[tx]?\\'\\|\\.po\\."
                                        po-find-file-coding-system
                "\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'"
                                        latexenc-find-file-coding-system
                ""                      (undecided)
  Process I/O   nothing specified
  Network I/O   nothing specified
Drew
  • 75,699
  • 9
  • 109
  • 225
WilliamKF
  • 363
  • 5
  • 15
  • Specify your Emacs version and whether you are using terminal mode or a GUI version of Emacs. If it is recent enough then it is capable of displaying Unicode characters (and even if it is is somewhat older it is likely capable of displaying the accented characters you have). The problem is likely that the font you are currently using is not capable of displaying those characters. Change to another font. – Drew Apr 18 '18 at 03:54
  • `\212` is the windows-1252 encoding of Š while `\305\240` is the utf-8 one, post the output of `C-h C RET` for the original file and the edited one – matteol Apr 18 '18 at 06:06
  • @drew Updated question per your request. – WilliamKF Apr 18 '18 at 18:07
  • @matteol Encoding added. – WilliamKF Apr 18 '18 at 18:10
  • How is the file created? Can you post a sample file, without any sensitive or private information? Include it in an archive (tar, zip, ...) to avoid any change in the content from a server trying to be smart. – matteol Apr 18 '18 at 18:52
  • It is just a .txt file I downloaded off a website in Slovenia. E.g., http://www.genealogy.si/srd_indeks/G.txt – WilliamKF Apr 18 '18 at 23:10
  • I've had better luck with `C-x RET r`: revert-buffer-with-coding-system (and then select windows-1252) – amitp Apr 19 '18 at 04:47

2 Answers2

2

I guess the encoding of your file is not properly detected. Typing

C-x RET r windows-1252 RET

should display your file properly as it forces Emacs to interpret the file as windows-1252 and not utf-8 (the default).

If you want this file to always be opened with this encoding you can use a file-local variable.

Damien Cassou
  • 877
  • 4
  • 14
1

You can tell emacs the correct encoding adding a file local variable. Put

-*- coding: windows-1250 -*-

in the first line of the file or

;; Local Variables:
;; coding: windows-1250-dos
;; End:

at the end.

I used windows-1250 because the first line of the linked file is G bloiški, Ivan NOVAKOVIĆ with windows-1250 and G bloiški, Ivan NOVAKOVIÆ with windows-1252.

matteol
  • 1,868
  • 1
  • 9
  • 11
  • The file at K doesn't work with windows-1250: http://www.genealogy.si/srd_indeks/K.txt How can we figure the right encoding? Likely this is all Windows OS based source files, and Slovenian surnames with accents used in that language. – WilliamKF Apr 19 '18 at 20:45
  • It appears NOVAKOVIĆ is correct, but it fails windows-1250 encoding for letter K file? – WilliamKF Apr 19 '18 at 20:47
  • I usually find the correct encoding by trial and error. I don't speak Slovenian, which errors do you find with windows-1250? – matteol Apr 20 '18 at 05:06