Org mode - Parsing rich HTML directly when pasting?

Question

Currently, for notetaking tools like Evernote and Quiver, I can directly copy HTML content from my favorite browser and then paste them into the app, with all the formatting + link preserved. However in orgmode it seems that all the formatting info is lost.

I've seen somebody suggest using eww to browse the web and copy the content via eww-org. However that is really tedious(I don't think there would be a lot of people browsing the web using eww instead of modern browsers nowadays. I'll have to open that link again in eww and do the copying, not to mention sometimes eww doesn't render the contents nicely).

Is it possible to let Emacs directly parse the copied HTML when pasting? Even if there's no existing tool for that yet, is it feasible to make one?

This is almost the only thing that stops me from switching to orgmode from other notetaking tools.

Please clarify what you mean by "directly parse the copied HTML" — mankoff, May 05 '15 at 01:56
@mankoff OK I guess I wasn't clear enough in my description. What I want is for example if the original HTML had `text` then after I `Cmd + C` on it, it can be converted to `*text*` in `org mode` by some means when pasting. Or if no, at least preserve the original HTML code so that I could view them in their original proper format later. The current situation is somehow only plain text will be rendered. — xji, May 05 '15 at 11:21
For example, we have here `
I've seen somebody suggest using eww to browse the web and copy the content via eww-org. However that is really tedious(I don't think there would be a lot of people browsing the web using eww instead of modern browsers nowadays. I'll have to open that link again in eww and do the copying, not to mention sometimes eww doesn't render the contents nicely).
`. If I copy this paragraph, I want to be able to reproduce its formatting in `orgmode`. — xji, May 05 '15 at 11:22
Can you explain how the answer below doesn't meet your requirements? Is just because it doesn't work on paste? — mankoff, May 05 '15 at 11:55
@mankoff Basically yes. For this to work I'd still have to view source code of the web page I'm browsing or save the page altogether, copy the corresponding source code into Emacs and then run the function. A big improvement, yes, but it still involves multiple steps... Weird though, why are some applications able to capture the formatting even if I copy directly from the web page inside of a browser. If they can retrieve the source, surely Emacs can also do it? I guess it involves some more complicated interaction with system clipboard? — xji, May 05 '15 at 12:11
Ah! I get it now. The answer below works with text in the clipboard, but your issue is that the clipboard doesn't contain the right text. I'm not sure how to address this. Perhaps AquaMacs has better support for advanced clipboard access? What platform and version of emacs are you using? — mankoff, May 05 '15 at 13:04
@mankoff OS X 10.10. I'm using `Emacs` built from `homebrew`. — xji, May 05 '15 at 13:10
OK. `osascript` allows access to the rich text clipboard outside of emacs. Let me know if the code below works for you... — mankoff, May 05 '15 at 14:07
@mankoff Wonderful! It worked! You're the man! I think you could even consider submitting it as an Emacs package etc. haha. This could make `org mode` so much more user-friendly. Actually I like it more with the formatting without intermediate RTF conversion because it preserves more info. For example `#+BEGIN_QUOTE` and `#+BEGIN_EXAMPLE` in your answer would not be preserved with the additional conversion. — xji, May 05 '15 at 14:20
The latest version cleans some html but not as much - I dropped textutil and only use pandoc, but filter the html through json format. I think it preserves the right amount of formatting now. This also means it can all be done with pipes (|) and not use a temp file. — mankoff, May 05 '15 at 14:37
A related package is https://github.com/Lindydancer/highlight2clipboard which allows you to copy highlighted text in Emacs and paste it into other applications with the highlighting retained. (Currently, this works under Windows and OS X.) — Lindydancer, Oct 16 '15 at 20:06
@Lindydancer Thanks for the suggestion. However I can't seem to get it to work. Copying the file as well as its dependency `htmlize.el` into load path and requiring it seems to break Emacs startup somehow. — xji, Oct 18 '15 at 16:01
@XiangJi, what version of Emacs are you using? What operating system? Does it happen on a clean system (without the rest of your normal init files). Which problems do you see with the emacs startup? — Lindydancer, Oct 19 '15 at 05:55
@mankoff Can you explain what filtering the HTML through the JSON format does? — incandescentman, Apr 04 '17 at 07:21
@incandescentman That's just attempt to strip some formatting information, e.g. some CSS within the web page. You may try it yourself and see what difference it makes. — xji, Apr 04 '17 at 09:13

score 20 · Accepted Answer · edited Jun 10 '20 at 14:24

20

is it feasible to make one?

Since this is emacs, yes.

My approach is to use a 3rd party tools that can take HTML and convert to plain text or even directly to Org format. I think this is an ugly hack, and there may be better ways to do this, but it looks like it works for my test cases.

(defun kdm/html2org-clipboard ()
  "Convert clipboard contents from HTML to Org and then paste (yank)."
  (interactive)
  (kill-new (shell-command-to-string "osascript -e 'the clipboard as \"HTML\"' | perl -ne 'print chr foreach unpack(\"C*\",pack(\"H*\",substr($_,11,-3)))' | pandoc -f html -t json | pandoc -f json -t org | sed 's/ / /g'"))
  (yank))

Unfortunately, HTML is incredibly complex now - no longer some simple hand-written tags. This complex HTML tagging requires the complicated shell command above. It does the following:

osascript gets the HTML text from the clipboard. It is hex encoded, so
perl converts the hex to a string
We could convert that HTML to Org directly with pandoc, but the HTML is full of complicated tags and therefore produces a ton of Org code. In order to simply the HTML to the minimal set of tags needed to capture the formatting, I
Convert the HTML to json, and then
Convert the json to Org (these two steps simplify the HTML).
Replace non-standard spaces with standard ones.

Note that osascript is for MacOS. To modify steps 1-2 for Linux, replace the argument of shell-command-to-string with

"xclip -o -t text/html | pandoc -f html -t json | pandoc -f json -t org"

In any case, the output of the pandoc command is returned to emacs, and inserted into the buffer.

Bind the new Emacs command to a key similar to "paste" but that means "paste-and-convert-from-html" to you, and it should work.

Alternatively, if you don't want to think about which paste command to use, here is a Linux version that will convert HTML when that is available on the clipboard and will otherwise fall back to plain text:

"xclip -o -t TARGETS | grep -q text/html && (xclip -o -t text/html | pandoc -f html -t json | pandoc -f json -t org) || xclip -o"

edited Jun 10 '20 at 14:24

avv

1,563
10
24

answered May 04 '15 at 12:41

mankoff

4,108
1
22
39

2

One addition: It seems that `pandoc` automatically uses [Non-breaking space](https://en.wikipedia.org/wiki/Non-breaking_space) quite a lot instead of normal space when converting formatted inline text(bold, italics, code etc.), which are not recognized by `orgmode` by default. You'd have to add it ( ) to `org-emphasis-regexp-components` in order for those texts to be formatted correctly in `orgmode`. – xji Jun 18 '15 at 14:57
notably the the "released" version of xclip does not support option -t; so xclip must be built from github. Also, you might need to pipe pandoc input and output through `iconv utf-8` – malcook Mar 10 '16 at 17:51
1

`xclip` is on OS X also (perhaps only w/ X11 and/or Developer Tools installed?), so the improved answer could work on OS X too. – mankoff Mar 11 '16 at 09:33
@JIXiang How would I modify the accepted answer so that it also converts non-breaking spaces to normal spaces? – incandescentman Apr 03 '17 at 23:36
1

@incandescentman I originally modified org-mode's package file so that it recognizes non-breaking space as a separator. However it turned out to be tedious with version changes. I then raised an issue on pandoc's repo which you can search about. Essentially you can use a "filter" in pandoc to perform automatic substitution. But that sometimes also fails. So now I just mostly manually select the pasted content and perform a substitution. My last substitution is almost always this one so I just scroll up my substitution history and apply. – xji Apr 04 '17 at 06:56
@JIXiang @mankoff I usually don't want to include all the extraneous HTML blocks this method produces. How would I add this line `(replace-regexp "#\\+BEGIN_HTML\\(?:.*\\|\n\\)*#\\+END_HTML" "")` to the function above? – incandescentman Apr 04 '17 at 17:10
3

Note that `pandoc` automatically wraps the output with linebreaks. If you don't want that you can add `--wrap=none` to the end of the command. – xji Apr 07 '17 at 09:07

kuanyui · Answer 2 · 2021-09-30T05:56:51.123

3

I wrote an add-on Copy as Org-Mode for Firefox which can do this in browser directly (copying the rich HTML directly, instead of copying the raw HTML code), it even can handle HTML tables into Org-mode format.

edited Sep 30 '21 at 05:56

answered Sep 30 '21 at 05:38

kuanyui

1,020
6
16

Рустам Усманов · Answer 3 · 2020-10-02T12:26:37.283

Short answer:
consider using org-web-tools-read-url-as-org.
Long answer:
If you just want to parse html, you can use dom module for example:

(let* ((dom (with-temp-buffer
                (insert html)
                (libxml-parse-html-region (point-min) (point-max))))
         (title (cl-caddr (car (dom-by-tag dom 'title)))))

If you want to automatically render html into something more or less readable, you can use eww, w3m and other text browsers. Personally I prefer to convert web documents into org mode. I know several tools of converting html to org mode: already mentioned pandoc, org-web-tools and html2org.
My favorate tool is org-web-tools, because pandoc and html2org are converting the hole document. org-web-tools conversly detect potential poorly-rendered parts and removes it. Below is short description of how it is done:
org-web-tools-read-url-as-org function loads html from specified url, converts it into org-mode document and then loads it in selected window. The main advantage of this approach is that its code org-web-tools--url-as-readable-org -> org-web-tools--eww-readable -> eww-score-readability calculates score for document parts and removes parts that are probably not worth viewing (adds, poorly rendered menu).
By the way html2org is extremely simple tool, which probably can be easily hacked to render specific parts of html page.

Please elaborate, explaining how this is an answer. Perhaps summarize what you are point to with those links. A link-only answer will likely be deleted. — Drew, Sep 29 '20 at 04:35
Thanks for the links. Though the question was about copying and pasting specific passages from the page into the corresponding org format, not the whole page/URL. The accepted answer has been working for that purpose for me. — xji, Oct 04 '20 at 20:16

score 0 · Answer 4 · answered May 02 '21 at 17:43

Assuming macOS, to get the clipboard as HTML, install:

git clone https://github.com/chbrown/macos-pasteboard
cd macos-pasteboard
make install

Then you can use the function

function pbpaste-html() {
    command pbv public.html public.utf8-plain-text
}

Together with pandoc:

        input="$(gmktemp)"
        pbpaste-plus > "$input"

        pandoc --wrap=none --from "html" --to "org" "$input" -o "-" | pbcopy

Then paste the result in emacs.

Org mode - Parsing rich HTML directly when pasting?

4 Answers4

Linked