How to OCR a PDF file and get the text stored within the PDF?

Question

First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.

I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but is there a solution on Linux, specifically on Fedora?

This seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.

There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway... — Maxim, Mar 14 '17 at 06:04

score 116 · Accepted Answer · edited Jun 18 '22 at 11:13

116

ocrmypdf does a good job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

or

sudo apt install ocrmypdf     # ubuntu
sudo dnf -y install ocrmypdf  # fedora

edited Jun 18 '22 at 11:13

ingli

1,847

answered Feb 03 '18 at 19:23

Eduard Florinescu

11,703

6

Used ocrmypdf on Fedora 30 (via dnf install) - worked like a charm. – Heinrich Ulbricht Jan 23 '20 at 13:02
3

very good, thanks. Unlike the other ocr proposed in this thread, this ocr gives an output only slighlty bigger than the original (image pdf). It would even better if it could give an output smaller (only text): it is possible? – Duns Apr 16 '20 at 16:36
@Duns if you will do that the resulting text layout will have a lot of errors, so you cannot using this tool, there is no way to have a decent looking OCRed without the original layout output, another better dimension and better looking solution would be Adobe Clearscan or alternative to clearscan https://softwarerecs.stackexchange.com/questions/10242/are-there-other-products-than-adobes-that-support-clearscan-or-similiar – Eduard Florinescu Apr 16 '20 at 17:14
@MartinThoma I would leave it there maybe there are older versions that still have the app and maybe pypdfocr will be cotinued again at some point – Eduard Florinescu Sep 07 '20 at 08:15
7

OCRmyPDF worked like a dream for me too. It’s based on Tesseract under the hood, so (among other things) handles many languages well: I just used it for a document in a mixture of English and Georgian (ქართული ენა) and got near-perfect results. – PLL Feb 17 '21 at 09:33
1

Ubuntu 20.04: When creating an ocr pdf, ocrmypdf states that jbig2enc is not installed and is needed for compressing and higher quality PDF files. jbig2enc must be built from source, but it has dependencies of libtool [that contains both libtoolize and glibtoolize] to be installed with sudo apt install libtool, and libleptonica-dev (which contains Leptonica): sudo apt install libleptonica-dev. Then, follow the instructions for git-cloning jbig2enc at git clone https://github.com/agl/jbig2enc and running ./autogen.sh / ./config / make / sudo make install – lawlist Sep 05 '22 at 01:00

score 22 · Answer 2 · edited Sep 07 '20 at 08:22

22

After learning that Tesseract can now also produce searchable PDFs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/

after installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script's guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich
./configure
make
sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable PDF.

Here is a list of repositories (e.g., Debian Stable, AUR, Homebrew) containing pdfsandwich.

edited Sep 07 '20 at 08:22

Matthias Braun

8,239

answered Aug 04 '16 at 15:39

ingli

1,847

for a related, but separate question, building on this one, see https://unix.stackexchange.com/questions/306051/tesseract-is-it-possible-to-change-font-output-in-ocred-pdf – ingli Aug 27 '16 at 18:25
3

FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well. – Laurence Gonsalves Mar 14 '18 at 06:25
Just came across https://fedoramagazine.org/4-cool-new-projects-try-copr-october-2018/ showing a COPR package for fedora that packages pdfsandwich – ingli Oct 26 '18 at 08:59

score 8 · Answer 3 · answered Oct 18 '18 at 04:14

An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.

Gabriel Staples · Answer 4 · 2022-01-20T07:07:04.873

6

I had this same problem so I wrote this over the weekend. Give it a shot; it works great! It is a simple wrapper around tesseract. It uses pdftoppm to convert a PDF into a bunch of TIFF files, then it uses tesseract to perform OCR (Optical Character Recognition) on them and produce a searchable PDF as output. All intermediate temporary files are automatically deleted when the script completes.

Source code: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF

Instructions to install & use `pdf2searchablepdf`:

Tested on Ubuntu 18.04 on 11 Nov 2019 and on Ubuntu 20.04 Nov. 2020.

Install:

git clone https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.git
./PDF2SearchablePDF/install.sh
sudo apt update
sudo apt install tesseract-ocr

Use:

# General:
pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]
Make a PDF searchable:
pdf2searchablepdf mypdf.pdf
Make an entire directory of images into a single searchable PDF:
pdf2searchablepdf directory_of_imgs

You'll now have a pdf called mypdf_searchable.pdf, which contains searchable text!

Done. It has no python dependencies, as it's currently written entirely in bash.

See pdf2searchablepdf -h for the help menu and more options and examples.

References or Related Resources:

PDF2SearchablePDF: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
https://askubuntu.com/questions/473843/how-to-turn-a-pdf-into-a-text-searchable-pdf/1187881#1187881
https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution
https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf/1187844#1187844
pdfsandwich: Alternative software wrapper I just discovered, that is worth checking out too! http://www.tobias-elze.de/pdfsandwich/

edited Jan 20 '22 at 07:07

answered Nov 11 '19 at 09:22

Gabriel Staples

2,562

Good utility. One thing you might do is add support for file names with spaces in them. Right now, that doesn't work (you get a usage message for pdftoppm). Just adding a few quotation marks in some of the commands should do it. – Wilson F Jan 04 '20 at 01:01
1

Thanks for the feedback! I'll see when I can make the change and test it. I opened an issue here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/issues/6 – Gabriel Staples Jan 05 '20 at 08:51
1

@WilsonF, done! v0.4.0 just released to resolve this issue. https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/releases – Gabriel Staples Mar 15 '20 at 03:54
Excellent, thank you! (Trying to add an '@' tag in edit, but it isn't taking it.) – Wilson F Mar 19 '20 at 18:17
the problem is that the output file, for me (but in examples as well), is 100% bigger than original image pdf... – Duns Apr 16 '20 at 16:30
Yeah I've got an issue open for that. I'll fix it when able. It requires generating one high-res image for the OCR and another low-res image to be overlaid onto the OCRed text to regenerate a PDF after doing the OCR processing. See here: Allow user to set the compression level of the output pdf (https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/issues/5) – Gabriel Staples Apr 16 '20 at 19:10
@Duns, I still have to work out many of the details for a better long-term solution built in to my tool, but meanwhile, for some PDF compression options, see my answer here: https://askubuntu.com/questions/113544/how-can-i-reduce-the-file-size-of-a-scanned-pdf-file/1303196#1303196. – Gabriel Staples Mar 02 '21 at 08:53
1

Excellent utility. Faster and more stable than my brief experiments with ocrmypdf and pdfsandwich. This on Ubuntu 18.04 and only a couple of PDF scanned documents as images. My issues arose when having relatively high resolutions (300dpi). – Patrick Refondini Dec 01 '21 at 14:57
@PatrickRefondini, are you saying your "issues arose" when using my pdf2searchablepdf tool on resolutions >= 300 dpi, or when using one of the other tools you mentioned? With which tool did you have issues on high resolutions? – Gabriel Staples Dec 01 '21 at 16:35
1

@GabrielStaples pdf2searchablepdf is fast and stable. I had issue with ocrmypdf and pdfsandwich when doing only a couple of tests with resolutions >= 300dpi. – Patrick Refondini Dec 03 '21 at 15:13

How to OCR a PDF file and get the text stored within the PDF?

4 Answers4

Instructions to install & use `pdf2searchablepdf`:

Install:

Use:

Make a PDF searchable:

Make an entire directory of images into a single searchable PDF:

References or Related Resources:

Linked

How to OCR a PDF file and get the text stored within the PDF?

4 Answers4

Instructions to install & use pdf2searchablepdf:

Install:

Use:

Make a PDF searchable:

Make an entire directory of images into a single searchable PDF:

References or Related Resources:

Linked

Instructions to install & use `pdf2searchablepdf`: