8

If I export a docx file to org using Pandoc as follows

pandoc -s test.docx -o test.org

I get a nice org file with markup. However, I also get unwanted properties with each heading in the file, generated from the headings. For example:

* Heading One
  :PROPERTIES:
  :CUSTOM_ID: heading-one
  :END:

I don't want all these heading properties. Can I export without these drawers?

I found a similar article but this did not help me:

in org-mode, a function to delete all properties drawers?

Otherwise, I would like to be able to strip the exported org file of all of its property drawers. Is this possible?

Edman
  • 1,167
  • 7
  • 13
  • 1
    You could try exporting the resulting org file to a different org file, using the exporter that's built in to org mode in emacs, by using `C-c C-e O o`. This will produce a `test.org.org` file which has no drawers in it. Untested. – NickD Dec 16 '19 at 06:59
  • I tested this and it worked well with only one difference with the file created with tarleb's method: the org.org file has no space between header and text, whereas with tarleb's method the spaces are preserved. – Edman Dec 17 '19 at 10:40
  • 1
    I suspect that depending on the complexity of the document, there might be more differences as well. Glad you found a method that works for you! And the answer is informative: maybe it's time to learn something more about pandoc... – NickD Dec 17 '19 at 14:43

2 Answers2

7

The drawers are added only if a header has additional attributes. One can use a simple Lua filter to remove all attributes from headers in pandoc's internal document format:

function Header (header)
  return pandoc.Header(header.level, header.content, pandoc.Attr())
end

Write the above to a file named remove-header-attr.lua and call pandoc with the additional parameter --lua-filter=remove-header-attr.lua.

tarleb
  • 453
  • 4
  • 7
4

I think this is best solved on the level of pandoc conversion, with a pandoc filter.

Create a file, say noattrs-filter.hs containing:

import Text.Pandoc.JSON

main = toJSONFilter noAttrs

noAttrs :: Block -> Block
noAttrs (Header n _ i) = Header n nullAttr i
noAttrs (Div _ b) = Div nullAttr b
noAttrs b = b

Compile the file with ghc:

ghc noattrs-filter.hs

and run your conversion with:

pandoc -s --filter ./noattrs-filter test.docx -o test.org

In order to compile the pandoc filter, you need to have the relevant libraries. For instance, on Ubuntu, you'd need the libghc-pandoc-types-dev package (sudo apt-get install libghc-pandoc-types-dev). More generally, you could also try installing via cabal (cabal install pandoc).

To understand the haskell filter

The relevant hackage documentation is here and here.

Adding comments to the code (starting with -- and hopefully useful for somebody not used to haskell):

import Text.Pandoc.JSON

main = toJSONFilter noAttrs

-- Type signature (convert a block into a slightly modified block)
noAttrs :: Block -> Block
-- Header constructors have the form:
--  Header Int Attr [Inline]
-- _ means we ignore the attribute (Attr), since we're discarding it, anyway
-- nullAttr is, as the name suggests, an attribute containing no info
noAttrs (Header n _ i) = Header n nullAttr i
-- Div constructors have the form:
--  Div Attr [Block]
noAttrs (Div _ b) = Div nullAttr b
-- for completeness could also deal with 'CodeBlock's, which can also
-- have an attribute — left as an exercise for the reader.
-- 
-- we need a fallthrough
noAttrs b = b

(This is based on my own answer to a relatively similar question here.)

aplaice
  • 2,126
  • 17
  • 23
  • 3
    Using a Lua filter, as answered by tarleb, rather than a compiled one, is probably the more portable solution (I simply prefer haskell to lua...). – aplaice Dec 17 '19 at 09:52
  • Unfortunately, I was not able to test this solution on Windows as I failed to produce an install of pandoc with the required libraries. I would appreciate feedback from others on Linux who have tried it. – Edman Dec 17 '19 at 10:42