46

I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?

Perhaps related post, OCR with Ubuntu here.

Garrett
  • 167
otto
  • 561
  • 1
  • 4
  • 3

6 Answers6

45

You have a lot of options!

pdftotext from poppler has already been mentioned.

There's a Haskell program called pdf2line which works well.

calibre's ebook-convert commandline program (or calibre itself) is another option; it can convert PDF to plain text, or other ebook-format (RTF, ePub), in my opinion it generates better results than pdftotext, although it is considerably slower.

ebook-convert file.pdf file.txt

AbiWord can convert between any formats it knows from the command-line, and at least optionally has a PDF import plugin:

abiword --to=txt file.pdf

Yet another option is podofotextextract from the podofo PDF tools library. I haven't really tried that.

If you combine the two Ghostscript tools, pdf2ps and ps2ascii, you have yet another option.

I can actually think of a few more methods, but I'll leave it at that for now. ;)

frabjous
  • 8,691
  • 1
    calibre's ebook-convert... have you seen what it does to ligatures? bleargh. let's put it this way: it's not a very e ective program. pdftotext is much more faithful. i have never discovered any errors in its output. – ixtmixilix Feb 12 '12 at 01:23
  • 2
    You can use less for viewing pdf-files as text. It invokes a preprocessor, i.e. lesspipe, for invoking pdftotext or similar tools. – Daniel Näslund Mar 13 '12 at 11:37
  • pdftotext gives more accurate results than ebook-convert and it is very fast. ebook-convert is sluggish. – Amit Patel May 26 '15 at 09:56
  • pdftotext with -layout option rocks! calibre requires more than 600mb to install! That's crazy ) – Stalinko Nov 15 '18 at 06:14
10

You can convert PDFs to text on the command line with pdftotext (Ubuntu: poppler-utils; OpenBSD: xpdf-utils package).

You can use Recoll (Ubuntu: recoll; OpenBSD: no port, but there's one for FreeBSD.) to search inside various formatted text document types, including PDF. There's a GUI, and it builds an index automatically under the hood. It uses pdftotext to convert PDF to text.

Acrobat Reader (at least version 9 under Linux) has a limited multiple-file search capability (you can search in all the files in a directory).

4

pdftotext is likely what you are looking for: http://en.wikipedia.org/wiki/Pdftotext unless the text you want to extract is really under a graphical form, which is not that common with pdf documents.

jlliagre
  • 61,204
2

Comparison of how methods handle paragraphs

One issue with pdftotext from poppler-utils 22.12.0 which was is that it adds newlines within paragraphs when the paragraph is longer than the PDF page width, e.g. something like:

1:1 In the beginning God created the heaven and
the earth.
1:2 And the earth was without form, and void; and
darkness was upon the face of the deep. And the
Spirit of God moved upon the face of the waters.
1:3 And God said, Let there be light: and there
was light.
1:4 And God saw the light, that it was good: and
God divided the light from the darkness.

These extra newlines make the txt files really bad to read on a device like a Kindle.

I have found however that ebook-convert overcomes this very well, and produces something like:

1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light from the darkness.

which maintains paragraphs in single lines, regardless of how long the paragraph is, and adds a double newline between paragraphs, and behaves much better on a Kindle.

I'm going to test methods mentioned in other answers with this test PDF generated from this Libreoffice .odt file:

pdftotext output:

Title of my file
Table of Contents
H1 1......................................................................................................................................................1
H2 1 1...............................................................................................................................................1
H2 1 2...............................................................................................................................................1
H1 2......................................................................................................................................................1
H2 2 1...............................................................................................................................................1
H2 2 2...............................................................................................................................................1

H1 1 H2 1 1 H2 1 2 First very important paragraph. And now a very very very very very very very very very very very very very very very very very very very very very very very very long paragraph that gets split across two lines. Reference to H1 1 on page: 1 https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg

H1 2 H2 2 1 H2 2 2

ebook-convert output:

Title of my file

Table of Contents

H1 1......................................................................................................................................................1

H2 1 1...............................................................................................................................................1

H2 1 2...............................................................................................................................................1

H1 2......................................................................................................................................................1

H2 2 1...............................................................................................................................................1

H2 2 2...............................................................................................................................................1

H1 1

H2 1 1

H2 1 2

First very important paragraph.

And now a very very very very very very very very very very very very very very very very very very very very very very very very long paragraph that gets split across two lines.

Reference to H1 1 on page: 1

https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg

H1 2

H2 2 1

H2 2 2

Document Outline

H1 1 H2 1 1

H2 1 2

H1 2 H2 2 1

H2 2 2

The line break aspect was also asked more specifically at: How to convert pdf file to text without breaking lines

Tested on Ubuntu 23.04, poppler-utils 22.12.0, calibre 6.11.0.

Ciro Santilli OurBigBook.com
  • 18,092
  • 4
  • 117
  • 102
0

pdftotext caused various formatting problems for me (even with optional tweaks) but mutool convert worked perfectly

dicktyr
  • 413
-1

gPDFText converts ebook PDF content into ASCII text, reformatted for long line paragraphs, It works for me and it has a graphical interface.

  • 3
    Hi and welcome to the site. We like answers to be a bit more comprehensive here. For example, you could add where gPDFText can be obtained, how it can be installed and how it would be used to answer the OP's question. – terdon Aug 07 '14 at 16:45