12

I recently ran into the problem of decoding html entities. I have the following two strings (Note how two methods of encoding are used, named and numbered).

The old "how to fold xml" question
Babel doesn't wrap results in verbatim

And I need to convert them to

The old "how to fold xml" question
Babel doesn't wrap results in verbatim

Searching around, I found this old question on SO (which is what I'm doing for the moment), but I refuse to believe Emacs has no built-in way of doing this. We have several web browsers, at least two of which I know are built-in, not to mention mail clients and feed readers.

Is there not a built-in way of decoding html entities?
I'm looking for a function that takes a string from the first example and returns a string from the second example.

Malabarba
  • 22,878
  • 6
  • 78
  • 163
  • If there is anything, I bet it must be in the nxml code since it is able to parse DTDs and can validate entities in the document. – wasamasa Nov 05 '14 at 09:55
  • `libxml-parse-html-region` does this, of course, but it may do more than you want, in that it parses HTML tags as well… (And not all Emacs are built with LibXML support either, I guess). –  Nov 05 '14 at 09:56

2 Answers2

9

Emacs includes a pure-Elisp XML parser in xml.el, whose xml-parse-string function does the job, though it seems a bit like an undocumented internal function. I'm not sure if there are any HTML-only entities which won't be properly handled by treating the string as an XML fragment.

This wrapper function will simply omit any trailing tags from the input string, although you could make it stricter:

(defun decode-entities (html)
  (with-temp-buffer
    (save-excursion (insert html))
    (xml-parse-string)))

(decode-entities "The old "how to fold xml" question")
;; => "The old \"how to fold xml\" question"

(decode-entities "doesn't")
;; => "doesn't"

(decode-entities "string with trailing tag: <tag/>")
;; => "string with trailing tag: "

In Emacs with LibXML support, another slightly hackish way would be to write a wrapper around libxml-html-parse-region. Since the LibXML parser assumes its argument is a complete HTML document, the wrapper function has to extract the parsed character data from the returned document structure, using pcase. Trying to decode a string that contains any HTML tags will produce an error:

(defun decode-entities/libxml (html)
  (with-temp-buffer
    (insert html)
    (let ((document
           (libxml-parse-html-region (point-min) (point-max))))
      (pcase document
        (`(html nil
                (body nil
                      (p nil
                         ,(and (pred stringp)
                               content))))
          content)
        (_ (error "Unexpected parse result: %S" document))))))

Results:

(decode-entities/libxml "The old &quot;how to fold xml&quot; question")
     ; => "The old \"how to fold xml\" question"
(decode-entities/libxml "doesn&#39;t") ; => "doesn't"

(decode-entities/libxml "<html>")              ; produces an error

It does seem a little backward to decode a document fragment by parsing it as a complete document, only to immediately strip off the surrounding tags. On the other hand, using LibXML should be fast and give accurate results.

  • Sorry, I hadn't seen your xml edit. Looks awesome. – Malabarba Nov 26 '14 at 12:34
  • Thanks - I edited the answer to put the simpler `xml.el` solution first. –  Nov 30 '14 at 19:17
  • 2
    @Malabarba Note that `lisp/xml.el` has always included the function `xml-substitute-special`, which performs the same entity decoding as [Jon O.'s `decode-entities`](https://emacs.stackexchange.com/a/3141/15748). It does not, however, omit trailing tags. – Basil Aug 11 '17 at 18:33
4

web-mode.el does this with web-mode-dom-entities-replace.

itsjeyd
  • 14,586
  • 3
  • 58
  • 87
fxbois
  • 442
  • 2
  • 8