tesseract: is it possible to change font output in OCRed pdf?

Question

Following up on how to OCR a pdf file and get the text stored within pdf? I have successfully produced OCRed pdf pages.

In Evince, however, the letters are not shown; by this I mean that I cannot see the characters, but I can select them, copy them and paste them elsewhere successfully. This does not seem to be a bug of Evince: https://bugzilla.redhat.com/show_bug.cgi?id=1364201

When initiating an OCR of a pdf page with pdfsandwich, tesseract produces a page that

contains a font which doesn't have any usable glyphs (they named it GlyphLessFont). It has only .notdef and .null replacements (the squares). Evince uses the .notdef glyph if there is no glyph for the character. The reason that Okular highlight the text is because it does it in the image not as a regular text as evince does.

pdftotext recognises the characters.

Now, the question is: can tesseract be told to use a different font?

Does pdftotext recognize the characters? (you may need to tweak character encoding) — grochmal, Aug 27 '16 at 13:28
Short answer, no. More discussion: https://github.com/tesseract-ocr/tesseract/issues/1769 — We Are All Monica, Sep 21 '20 at 03:18
@WeAreAllMonica pdftotext outputs recognizable characters now (at least as of 5.0alpha). — Geremia, May 07 '21 at 21:53

score 2 · Answer 1 · answered Mar 22 '17 at 21:39

2

You could customise this portion of the source code to your liking and change the font here. You will have to rebuild tesseract from source once you make the change.

Tesseract Github Renderer.h

answered Mar 22 '17 at 21:39

Adit Sharda

64

thanks, but would you mind adding to your answer a bit more on how I would approach that? Sorry, am not a programmer, but can follow guidelines for rebuilding :) trying to understanding https://github.com/tesseract-ocr/tesseract/blob/master/CONTRIBUTING.md does not even help me much to decide how to suggest such a change as a feature request – ingli Mar 24 '17 at 14:56

tesseract: is it possible to change font output in OCRed pdf?

1 Answers1

Linked