4

i tried converting html webpages to org mode file using a simple pandoc command:

pandoc -o output.org R\ Seminar:\ Introduction\ to\ ggplot2.htm  

this works quite well but im left with tons of html blocks that look like this

#+BEGIN_HTML
  <div class="rcode">
#+END_HTML

#+BEGIN_HTML
  <div class="source">
#+END_HTML

#+BEGIN_EXAMPLE
    #declare data and x and y aesthetics, but no shapes yet
    ggplot(data = Milk, aes(x=Time, y=protein))
#+END_EXAMPLE

#+BEGIN_HTML
  </div>
#+END_HTML

#+BEGIN_HTML
  </div>
#+END_HTML

#+BEGIN_HTML
  <div class="rimage default">
#+END_HTML

any clue how to get rid of these annoying #+BEGIN_HTML blocks during conversion?

Drew
  • 75,699
  • 9
  • 109
  • 225
zeltak
  • 1,685
  • 11
  • 25
  • Duplicate? https://emacs.stackexchange.com/questions/12121/org-mode-parsing-rich-html-directly-when-pasting If that doesn't answer your question, please explain why and provide an MWE. – mankoff Jul 19 '16 at 11:46
  • thx but the same issues arise in that answer since it uses pandoc for converting. this still adds the #+BEGIN_HTML
    blocks all over when pasting.i want to get rid of these blocks
    – zeltak Jul 19 '16 at 12:30
  • You can probably use regexp-replace to remove it all: `(replace-regexp "#\\+BEGIN_HTML\\(?:.*\\|\n\\)*#\\+END_HTML" "")` but without all those divs, the output may not be exactly what you started with. – amitp Jul 19 '16 at 20:49
  • thx. i tried evaling the above code snippet in the org buffer but i get a debugging error: https://paste.xinu.at/oqJs/. any clue? – zeltak Jul 20 '16 at 09:10
  • Hie, since I have this exact problem too, I bump up this question. The code purposed in the comment of the question gives the same error than the one the author encountered. And because of these #+BEGIN_HTML swarming everywhere in the file, the org file conversion is unusuable... Thanks you. – Martin Probst Jun 04 '17 at 04:11
  • im also still very interested in a solution for this. – zeltak Jun 06 '17 at 06:39
  • 1
    This question was asked (by me) and answered on the pandoc mailing list. See https://groups.google.com/forum/m/?utm_medium=email&utm_source=footer#!msg/pandoc-discuss/GwVP6mu38ZE/1FkyuCmHGgAJ – Ista Aug 31 '17 at 21:19

3 Answers3

3

From the solution linked to by ista (direct link to the solution), you can create a pandoc filter, say in file nodivs-filter.hs

import Text.Pandoc.JSON

main = toJSONFilter nodivs
  where nodivs (Div _ bs) = bs
        nodivs b          = [b]

You then compile the filter with ghc: ghc nodivs-filter.hs. Finally, you use the filter when converting, as follows:

pandoc --filter ./nodivs-filter input-file.html -o output.org

In order to compile the pandoc filter, you need to have the relevant libraries. For instance, on Ubuntu, you'd need the libghc-pandoc-types-dev package (sudo apt-get install libghc-pandoc-types-dev). More generally, you could also try installing via cabal (cabal install pandoc).

To understand the haskell filter

The relevant hackage documentation is here and here.

Re-writing the program in long form, and adding comments (starting with -- and hopefully useful for somebody not used to haskell):

import Text.Pandoc.JSON

main = toJSONFilter nodivs

-- Type signature (convert a block to a list of blocks)
nodivs :: Block -> [Block]
--- Case when our input block is a Div
-- Div constructors have the form 
--  Div Attr [Block] 
-- _ means we ignore the attribute (Attr)
nodivs (Div _ bs) = bs
--- Fall through (any other type of block)
-- bs (above) is a list of blocks, so to have consistent types
-- we must convert our fall though block into a one-member list of blocks
nodivs b          = [b]

Some alternatives

These all come from this thread on pandoc's github.

Disable the native_divs extension

In your case:

pandoc -f html-native_divs -t org -o output.org R\ Seminar:\ Introduction\ to\ ggplot2.htm

(-f html-native_divs means from html, without native_divs)

Use pandoc 2.0

AFAICT from the above-mentioned thread, the defaults will become slightly more convenient.

aplaice
  • 2,126
  • 17
  • 23
1

You can replace those blocks like this:

(replace-regexp (rx (optional "\n")
                    "#+BEGIN_HTML"
                    (minimal-match (1+ anything))
                    "#+END_HTML"
                    (optional "\n"))
                "")
1

Try this

  1. Use pandoc to convert to HTML to LaTex.

    pandoc -o output.latex R\ Seminar:\ Introduction\ to\ ggplot2.htm 
    
  2. Use pandoc to convert LaTex to Org

    pandoc -o output.org output.latex
    

Tested using
pandoc version: 1.19.2.1
Compiled with pandoc-types 1.17.0.4, texmath 0.9, skylighting 0.1.1.4

Melioratus
  • 4,504
  • 1
  • 25
  • 43