How to strip invalid UTF-8 characters from a string?

Question

TL;DR:

I have a string with characters that are unencodable in UTF-8, and I need to remove these characters before writing the string to a file.

For instance, the following snippet creates a file that contains an invalid UTF-8 byte sequence. I would like this sequence to just not be present in the output file.

(let ((coding-system-for-write 'utf-8))
  (write-region "This is invalid: \x19008e" nil "~/outputfile"))

By checking this character with describe-char, I can see Emacs knows it is not encodable through UTF-8. So there must be a way to detect these characters and remove them.

The Background

I have an emacs-lisp script which converts docstrings to html files, which must be entirely in UTF-8. The following snippet shows essentially what I've been doing so far.

(with-temp-file output-file
  (insert symbol-name "\n" symbol-documentation)
  (set-buffer-file-coding-system 'no-conversion))

Where symbol-documentation is the string returned by describe-variable (the content of the Help buffer).

This seemed to work fine at first, but now I've realised this was causing some files to contain byte-sequences that are not valid UTF-8. Some perpetrators are tibetan-precomposition-rule-alist, language-info-alist, composition-function-table. By manually checking their documentation, I can see my Emacs indeed fails to comprehend part of the content.

Q: How can I ensure only valid UTF-8 byte sequences are written to the output file?
I'm perfectly fine with skipping these invalid sequences entirely (instead of trying to understand them). I just want to make sure the output file is valid.

I tried changing no-conversion to utf-8, but then my script gets interrupted by a Select coding system: prompt, and the problem still happens.

Maybe write with `utf-16` (this will work), and convert in to `utf-8` with some command-line tool? — abo-abo, Dec 25 '14 at 19:29
How large is the output and how often do you need to produce it? I've written once a UTF-8 encoder in Emacs Lisp, but if this has to be fast, you'd probably do better using some existing library, like `iconv` for example. — wvxvw, Dec 25 '14 at 20:01
@wvxvw Yes, it kinda needs to be fast. I'm looking into iconv now. — Malabarba, Dec 25 '14 at 20:03
`iconv -cf UTF-8 -t UTF-8 ./outputfile > ./converted` should do it, I believe. — wvxvw, Dec 25 '14 at 20:21
Using `utf-8` coding system is The Right Thing. Using `no-conversion` is just an admission that you don't know what you're doing (because, really, there is always some conversion going on). — Stefan, Dec 28 '14 at 17:17

score 5 · Accepted Answer · edited Dec 26 '14 at 20:27

5

Why not use the same mechanism that describe-char uses to identify whether a character is encodable using utf-8 i.e. use encode-coding-char. From the documentation

Encode CHAR by CODING-SYSTEM and return the resulting string. If CODING-SYSTEM can't safely encode CHAR, return nil. The 3rd optional argument CHARSET, if non-nil, is a charset preferred on encoding.

(defun my-remove-unencodeable-chars (string)
  (string-join
   (delq nil (mapcar (lambda (ch) (encode-coding-char ch 'utf-8 'unicode))
               string))))

edited Dec 26 '14 at 20:27

Malabarba

22,878
6
78
163

answered Dec 26 '14 at 12:18

Iqbal Ansari

7,468
1
28
31

1

Thanks, `encode-coding-char` is just what I was looking for. I've made small edit to the code (mapcar can work directly on strings, no need to split first). – Malabarba Dec 26 '14 at 20:28

score 1 · Answer 2 · answered Dec 28 '14 at 17:23

1

You want to let-bind coding-system-for-write to utf-8 around the call to write-region. This won't prompt you, even if there are characters that can't be encoded.

answered Dec 28 '14 at 17:23

Stefan

26,154
3
46
84

How to strip invalid UTF-8 characters from a string?

TL;DR:

The Background

2 Answers2

Linked