6

In an ongoing project, we are inserting multibyte text into a buffer and experiencing strange behavior. When the question is first viewed, we receive something like:

Quote Environment à la Strunk\341\200\231s \342\200\230The Elements of Style\342\200\231

which, obviously, should be

Quote Environment à la Strunk’s ‘The Elements of Style’

Note the special quotes; reference question here. Interestingly, using

(set-buffer-multibyte nil)
(set-buffer-multibyte t)

in succession will fix the issue. Note not one or the other, but both in this order. What is going on here? It's interesting to note that à is correctly displayed.

Shameless plug: that ongoing project posted this question.

Sean Allred
  • 6,861
  • 16
  • 85
  • Additionally, we tried calling string-as-unibyte or string-as-multibyte on the during right before inserting it, but neither fixed it. – Malabarba Dec 04 '14 at 01:31
  • Possibly related but hopefully not: http://emacs.stackexchange.com/q/2931/2264 – Sean Allred Dec 04 '14 at 01:32
  • 1
    Unless you really know what you're doing, `string-as-unibyte` and `string-as-multibyte` are never the right answer. Their existence is actually a nuisance more than anything since it encourages people to have misconceived notions about "multibyte". – Stefan Dec 04 '14 at 14:00
  • The above display seems to indicate that you inserted a string of *bytes* into a normal buffer (which contains *characters* rather than *bytes*). Where does the string that you insert come from? – Stefan Dec 04 '14 at 14:01
  • @Stefan See [this line](https://github.com/vermiculus/sx.el/blob/master/sx-request.el#L134). I don't believe `string-as-unibyte` is the culprit. As I understand [the underlying function](https://github.com/vermiculus/sx.el/blob/master/sx-encoding.el#L134), it has no side-effects. It is the content portion of `(buffer-string)` of a `url` response buffer. – Sean Allred Dec 04 '14 at 14:04
  • @SeanAllred: I didn't mean to say that `string-as-unibyte` is the culprit. But adding it will just confuse the matter. As a general rule, stay away from it. – Stefan Dec 04 '14 at 14:10
  • @Stefan Is there an appropriate alternative in this case? – Sean Allred Dec 04 '14 at 14:11
  • To convert between "unibyte" (i.e. an array of bytes) and "multibyte" (aka a string of characters), use `decode-coding-string` and `encode-coding-string`. – Stefan Dec 04 '14 at 16:05

1 Answers1

5

IIUC the problem is that the string you insert is undecoded, i.e. it's a string of bytes you received over the network. So you need to find out what character encoding is used/assumed by the sender (typically utf-8 these days), then call decode-coding-string with this encoding, and then insert the string it returns (which will be a string of characters).

This said, maybe this is not The Right Solution: if your string arrives over the network via HTTP, it may very well be that the encoding is actually provided by the server in the response headers, so that the URL package should do the decoding for you. In that case, maybe you're hitting a bug (or a lack of functionality) in the URL package. To figure that out, we'd need to see the header of the HTTP response, the way you can the URL package and the answer you get.

Stefan
  • 26,154
  • 3
  • 46
  • 84
  • 1
    Thanks Stefan, this worked. For the record, the header said `Content-Type: application/json; charset=utf-8`, so decoding with `utf-8` was all that was necessary. I agree that might be something the url package should do. – Malabarba Dec 20 '14 at 18:18
  • @Malabarba: You might like to report it as a bug (tho IIUC we can't just silently automatically decode, so it would have to provided as an option or an extra function). – Stefan Dec 21 '14 at 02:08