Emacs includes a pure-Elisp XML parser in xml.el
, whose xml-parse-string
function does the job, though it seems a bit like an undocumented internal function. I'm not sure if there are any HTML-only entities which won't be properly handled by treating the string as an XML fragment.
This wrapper function will simply omit any trailing tags from the input string, although you could make it stricter:
(defun decode-entities (html)
(with-temp-buffer
(save-excursion (insert html))
(xml-parse-string)))
(decode-entities "The old "how to fold xml" question")
;; => "The old \"how to fold xml\" question"
(decode-entities "doesn't")
;; => "doesn't"
(decode-entities "string with trailing tag: <tag/>")
;; => "string with trailing tag: "
In Emacs with LibXML support, another slightly hackish way would be to write a wrapper around libxml-html-parse-region
. Since the LibXML parser assumes its argument is a complete HTML document, the wrapper function has to extract the parsed character data from the returned document structure, using pcase
. Trying to decode a string that contains any HTML tags will produce an error:
(defun decode-entities/libxml (html)
(with-temp-buffer
(insert html)
(let ((document
(libxml-parse-html-region (point-min) (point-max))))
(pcase document
(`(html nil
(body nil
(p nil
,(and (pred stringp)
content))))
content)
(_ (error "Unexpected parse result: %S" document))))))
Results:
(decode-entities/libxml "The old "how to fold xml" question")
; => "The old \"how to fold xml\" question"
(decode-entities/libxml "doesn't") ; => "doesn't"
(decode-entities/libxml "<html>") ; produces an error
It does seem a little backward to decode a document fragment by parsing it as a complete document, only to immediately strip off the surrounding tags. On the other hand, using LibXML should be fast and give accurate results.