5

TL;DR:

I have a string with characters that are unencodable in UTF-8, and I need to remove these characters before writing the string to a file.

For instance, the following snippet creates a file that contains an invalid UTF-8 byte sequence. I would like this sequence to just not be present in the output file.

(let ((coding-system-for-write 'utf-8))
  (write-region "This is invalid: \x19008e" nil "~/outputfile"))

By checking this character with describe-char, I can see Emacs knows it is not encodable through UTF-8. So there must be a way to detect these characters and remove them.


The Background

I have an emacs-lisp script which converts docstrings to html files, which must be entirely in UTF-8. The following snippet shows essentially what I've been doing so far.

(with-temp-file output-file
  (insert symbol-name "\n" symbol-documentation)
  (set-buffer-file-coding-system 'no-conversion))

Where symbol-documentation is the string returned by describe-variable (the content of the Help buffer).

This seemed to work fine at first, but now I've realised this was causing some files to contain byte-sequences that are not valid UTF-8. Some perpetrators are tibetan-precomposition-rule-alist, language-info-alist, composition-function-table. By manually checking their documentation, I can see my Emacs indeed fails to comprehend part of the content.

Q: How can I ensure only valid UTF-8 byte sequences are written to the output file?
I'm perfectly fine with skipping these invalid sequences entirely (instead of trying to understand them). I just want to make sure the output file is valid.

I tried changing no-conversion to utf-8, but then my script gets interrupted by a Select coding system: prompt, and the problem still happens.

Malabarba
  • 22,878
  • 6
  • 78
  • 163
  • Maybe write with `utf-16` (this will work), and convert in to `utf-8` with some command-line tool? – abo-abo Dec 25 '14 at 19:29
  • @abo-abo Good idea, I'll try to look for a tool for that. – Malabarba Dec 25 '14 at 19:42
  • How large is the output and how often do you need to produce it? I've written once a UTF-8 encoder in Emacs Lisp, but if this has to be fast, you'd probably do better using some existing library, like `iconv` for example. – wvxvw Dec 25 '14 at 20:01
  • @wvxvw Yes, it kinda needs to be fast. I'm looking into iconv now. – Malabarba Dec 25 '14 at 20:03
  • `iconv -cf UTF-8 -t UTF-8 ./outputfile > ./converted` should do it, I believe. – wvxvw Dec 25 '14 at 20:21
  • Using `utf-8` coding system is The Right Thing. Using `no-conversion` is just an admission that you don't know what you're doing (because, really, there is always some conversion going on). – Stefan Dec 28 '14 at 17:17

2 Answers2

5

Why not use the same mechanism that describe-char uses to identify whether a character is encodable using utf-8 i.e. use encode-coding-char. From the documentation

Encode CHAR by CODING-SYSTEM and return the resulting string. If CODING-SYSTEM can't safely encode CHAR, return nil. The 3rd optional argument CHARSET, if non-nil, is a charset preferred on encoding.

(defun my-remove-unencodeable-chars (string)
  (string-join
   (delq nil (mapcar (lambda (ch) (encode-coding-char ch 'utf-8 'unicode))
               string))))
Malabarba
  • 22,878
  • 6
  • 78
  • 163
Iqbal Ansari
  • 7,468
  • 1
  • 28
  • 31
  • 1
    Thanks, `encode-coding-char` is just what I was looking for. I've made small edit to the code (mapcar can work directly on strings, no need to split first). – Malabarba Dec 26 '14 at 20:28
1

You want to let-bind coding-system-for-write to utf-8 around the call to write-region. This won't prompt you, even if there are characters that can't be encoded.

Stefan
  • 26,154
  • 3
  • 46
  • 84