html to orgmode via pandoc- get rid of all #+BEGIN_HTML blocks

Question

i tried converting html webpages to org mode file using a simple pandoc command:

pandoc -o output.org R\ Seminar:\ Introduction\ to\ ggplot2.htm

this works quite well but im left with tons of html blocks that look like this

#+BEGIN_HTML
  <div class="rcode">
#+END_HTML

#+BEGIN_HTML
  <div class="source">
#+END_HTML

#+BEGIN_EXAMPLE
    #declare data and x and y aesthetics, but no shapes yet
    ggplot(data = Milk, aes(x=Time, y=protein))
#+END_EXAMPLE

#+BEGIN_HTML
  </div>
#+END_HTML

#+BEGIN_HTML
  </div>
#+END_HTML

#+BEGIN_HTML
  <div class="rimage default">
#+END_HTML

any clue how to get rid of these annoying #+BEGIN_HTML blocks during conversion?

Duplicate? https://emacs.stackexchange.com/questions/12121/org-mode-parsing-rich-html-directly-when-pasting If that doesn't answer your question, please explain why and provide an MWE. — mankoff, Jul 19 '16 at 11:46
thx but the same issues arise in that answer since it uses pandoc for converting. this still adds the #+BEGIN_HTML
blocks all over when pasting.i want to get rid of these blocks — zeltak, Jul 19 '16 at 12:30
You can probably use regexp-replace to remove it all: `(replace-regexp "#\\+BEGIN_HTML\\(?:.*\\|\n\\)*#\\+END_HTML" "")` but without all those divs, the output may not be exactly what you started with. — amitp, Jul 19 '16 at 20:49
thx. i tried evaling the above code snippet in the org buffer but i get a debugging error: https://paste.xinu.at/oqJs/. any clue? — zeltak, Jul 20 '16 at 09:10
Hie, since I have this exact problem too, I bump up this question. The code purposed in the comment of the question gives the same error than the one the author encountered. And because of these #+BEGIN_HTML swarming everywhere in the file, the org file conversion is unusuable... Thanks you. — Martin Probst, Jun 04 '17 at 04:11
This question was asked (by me) and answered on the pandoc mailing list. See https://groups.google.com/forum/m/?utm_medium=email&utm_source=footer#!msg/pandoc-discuss/GwVP6mu38ZE/1FkyuCmHGgAJ — Ista, Aug 31 '17 at 21:19

aplaice · Answer 1 · 2018-08-31T22:11:17.260

From the solution linked to by ista (direct link to the solution), you can create a pandoc filter, say in file nodivs-filter.hs

import Text.Pandoc.JSON

main = toJSONFilter nodivs
  where nodivs (Div _ bs) = bs
        nodivs b          = [b]

You then compile the filter with ghc: ghc nodivs-filter.hs. Finally, you use the filter when converting, as follows:

pandoc --filter ./nodivs-filter input-file.html -o output.org

In order to compile the pandoc filter, you need to have the relevant libraries. For instance, on Ubuntu, you'd need the libghc-pandoc-types-dev package (sudo apt-get install libghc-pandoc-types-dev). More generally, you could also try installing via cabal (cabal install pandoc).

To understand the haskell filter

The relevant hackage documentation is here and here.

Re-writing the program in long form, and adding comments (starting with -- and hopefully useful for somebody not used to haskell):

import Text.Pandoc.JSON

main = toJSONFilter nodivs

-- Type signature (convert a block to a list of blocks)
nodivs :: Block -> [Block]
--- Case when our input block is a Div
-- Div constructors have the form 
--  Div Attr [Block] 
-- _ means we ignore the attribute (Attr)
nodivs (Div _ bs) = bs
--- Fall through (any other type of block)
-- bs (above) is a list of blocks, so to have consistent types
-- we must convert our fall though block into a one-member list of blocks
nodivs b          = [b]

Some alternatives

These all come from this thread on pandoc's github.

Disable the `native_divs` extension

In your case:

pandoc -f html-native_divs -t org -o output.org R\ Seminar:\ Introduction\ to\ ggplot2.htm

(-f html-native_divs means from html, without native_divs)

Use pandoc 2.0

AFAICT from the above-mentioned thread, the defaults will become slightly more convenient.

score 1 · Answer 2 · 2017-07-02T12:13:43.410

1

You can replace those blocks like this:

(replace-regexp (rx (optional "\n")
                    "#+BEGIN_HTML"
                    (minimal-match (1+ anything))
                    "#+END_HTML"
                    (optional "\n"))
                "")

edited Jul 02 '17 at 12:13

answered Jul 02 '17 at 11:48

Melioratus · Answer 3 · 2017-11-30T18:08:32.933

1

Try this

Use pandoc to convert to HTML to LaTex.

pandoc -o output.latex R\ Seminar:\ Introduction\ to\ ggplot2.htm

Use pandoc to convert LaTex to Org
```
pandoc -o output.org output.latex
```

Tested using
pandoc version: 1.19.2.1
Compiled with pandoc-types 1.17.0.4, texmath 0.9, skylighting 0.1.1.4

edited Nov 30 '17 at 18:08

answered Nov 30 '17 at 17:46

Melioratus

4,504
1
25
43

1

This actually works surprisingly well! – lkahtz Jan 22 '21 at 11:25

html to orgmode via pandoc- get rid of all #+BEGIN_HTML blocks

3 Answers3

To understand the haskell filter

Some alternatives

Disable the `native_divs` extension

Use pandoc 2.0

Linked

html to orgmode via pandoc- get rid of all #+BEGIN_HTML blocks

3 Answers3

To understand the haskell filter

Some alternatives

Disable the native_divs extension

Use pandoc 2.0

Linked

Disable the `native_divs` extension