I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?
Perhaps related post, OCR with Ubuntu here.
I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?
Perhaps related post, OCR with Ubuntu here.
You have a lot of options!
pdftotext
from poppler has already been mentioned.
There's a Haskell program called pdf2line
which works well.
calibre's ebook-convert
commandline program (or calibre itself) is another option; it can convert PDF to plain text, or other ebook-format (RTF, ePub), in my opinion it generates better results than pdftotext, although it is considerably slower.
ebook-convert file.pdf file.txt
AbiWord can convert between any formats it knows from the command-line, and at least optionally has a PDF import plugin:
abiword --to=txt file.pdf
Yet another option is podofotextextract
from the podofo PDF tools library. I haven't really tried that.
If you combine the two Ghostscript tools, pdf2ps
and ps2ascii
, you have yet another option.
I can actually think of a few more methods, but I'll leave it at that for now. ;)
pdftotext
gives more accurate results than ebook-convert
and it is very fast. ebook-convert
is sluggish.
– Amit Patel
May 26 '15 at 09:56
pdftotext
with -layout
option rocks! calibre
requires more than 600mb to install! That's crazy )
– Stalinko
Nov 15 '18 at 06:14
You can convert PDFs to text on the command line with pdftotext (Ubuntu: poppler-utils; OpenBSD: xpdf-utils
package).
You can use Recoll
(Ubuntu: recoll; OpenBSD: no port, but there's one for FreeBSD.) to search inside various formatted text document types, including PDF. There's a GUI, and it builds an index automatically under the hood. It uses pdftotext
to convert PDF to text.
Acrobat Reader (at least version 9 under Linux) has a limited multiple-file search capability (you can search in all the files in a directory).
pdftotext is likely what you are looking for: http://en.wikipedia.org/wiki/Pdftotext unless the text you want to extract is really under a graphical form, which is not that common with pdf documents.
Comparison of how methods handle paragraphs
One issue with pdftotext
from poppler-utils
22.12.0 which was is that it adds newlines within paragraphs when the paragraph is longer than the PDF page width, e.g. something like:
1:1 In the beginning God created the heaven and
the earth.
1:2 And the earth was without form, and void; and
darkness was upon the face of the deep. And the
Spirit of God moved upon the face of the waters.
1:3 And God said, Let there be light: and there
was light.
1:4 And God saw the light, that it was good: and
God divided the light from the darkness.
These extra newlines make the txt files really bad to read on a device like a Kindle.
I have found however that ebook-convert
overcomes this very well, and produces something like:
1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light from the darkness.
which maintains paragraphs in single lines, regardless of how long the paragraph is, and adds a double newline between paragraphs, and behaves much better on a Kindle.
I'm going to test methods mentioned in other answers with this test PDF generated from this Libreoffice .odt file:
pdftotext output:
Title of my file
Table of Contents
H1 1......................................................................................................................................................1
H2 1 1...............................................................................................................................................1
H2 1 2...............................................................................................................................................1
H1 2......................................................................................................................................................1
H2 2 1...............................................................................................................................................1
H2 2 2...............................................................................................................................................1
H1 1
H2 1 1
H2 1 2
First very important paragraph.
And now a very very very very very very very very very very very very very very very very very
very very very very very very very long paragraph that gets split across two lines.
Reference to H1 1 on page: 1
https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg
H1 2
H2 2 1
H2 2 2
ebook-convert
output:
Title of my file
Table of Contents
H1 1......................................................................................................................................................1
H2 1 1...............................................................................................................................................1
H2 1 2...............................................................................................................................................1
H1 2......................................................................................................................................................1
H2 2 1...............................................................................................................................................1
H2 2 2...............................................................................................................................................1
H1 1
H2 1 1
H2 1 2
First very important paragraph.
And now a very very very very very very very very very very very very very very very very very very very very very very very very long paragraph that gets split across two lines.
Reference to H1 1 on page: 1
https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg
H1 2
H2 2 1
H2 2 2
Document Outline
H1 1 H2 1 1
H2 1 2
H1 2 H2 2 1
H2 2 2
The line break aspect was also asked more specifically at: How to convert pdf file to text without breaking lines
Tested on Ubuntu 23.04, poppler-utils 22.12.0, calibre 6.11.0.
pdftotext caused various formatting problems for me (even with optional tweaks) but mutool convert worked perfectly
gPDFText converts ebook PDF content into ASCII text, reformatted for long line paragraphs, It works for me and it has a graphical interface.
gPDFText
can be obtained, how it can be installed and how it would be used to answer the OP's question.
– terdon
Aug 07 '14 at 16:45
pdftotext
=pdfcat
. – isomorphismes Mar 21 '15 at 18:55