Is there any ligature-aware alternative for "pdfgrep" in command line?

Question

I always use "pdfgrep" to search inside of multiple PDF files from the command line. But I met a problem: This ligature character "ﬁ" (see https://www.compart.com/en/unicode/U+FB01). "ﬁ" is in the word "fixed", so I could not search the term "fixed point operator" with pdfgrep -iR 'fixed point operator'. However, when I open the file with PDF readers such as Foxit reader and Evince, "ﬁ" is split into "f" and "i", thus searchable. Is there any more reliable alternative for the "pdfgrep"? Or is there any option keywords in "pdfgrep" to expand the encoding?

The PDF file is http://direct.mit.edu/books/chapter-pdf/238450/9780262321037_can.pdf .

Ubuntu 20.04, amd64, kernel version Linux 5.6.0-1018-oem. pdfgrep has an option --unac. But if I install pdfgrep with sudo apt-get install pdfgrep, command --unac will report "pdfgrep: UNAC support disabled at compile time!"

pdfgrep:
  Installed: 2.1.2-1build1
  Candidate: 2.1.2-1build1
  Version table:
 *** 2.1.2-1build1 500
        500 http://mirrors.huaweicloud.com/ubuntu focal/universe amd64 Packages
        100 /var/lib/dpkg/status

That character is called a ligature (= combined letters): https://en.wikipedia.org/wiki/Orthographic_ligature — Oskar Skog, Aug 29 '20 at 17:35
(1) I tried to download the PDF file, but it required a login. (2) Can you use the ligature in the pdfgrep command; e.g., pdfgrep -iR 'ﬁxed point operator'? — G-Man Says 'Reinstate Monica', Aug 30 '20 at 09:14
(1) I also tried to search, but did not find free download. I downloaded it with external organizational access, I am not sure if it is appropriate to upload that pdf file. (2) I can use ligature "ﬁ" inside of the command, then the searching result is shown. But I think I am not smart enough to be aware of every ligature situation. — la la, Aug 30 '20 at 09:45
Just spitballing ideas here. (1) Maybe you could extract the text from the pdf (is pdftotext a thing?) and pipe it to a unicode canonicalization utility to get something greppable. (2) If you are comfortable with building packages yourself you can try figuring out how to build pdfgrep with unac enabled. I think the first idea sounds easier though. — jw013, Sep 02 '20 at 16:16

score 1 · Answer 1 · answered Dec 28 '20 at 23:56

To solve this problem, you should first use pdftotext to find out what your ligature looks like in form of UTF-8, for example I run this:

pdftotext -f 11 -l 13 ~/Mathematics/Analysis/MeasureTheory.pdf text && cat text

and get a line of results looks like this

   1.6.  Inﬁnite and σ-ﬁnite measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

then I know fi is in fact a ring phone ☎ in terminal, howerver it renders as fi on browser.

So I continue with pdfgrep

pdfgrep --page-range=11-13 ﬁ ~/Mathematics/Analysis/MeasureTheory.pdf

Finally, of course I get desired results:

   1.6.  Inﬁnite and σ-ﬁnite measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
   2.4.  The general deﬁnition of the Lebesgue integral . . . . . . . . . . . . . . 118
   2.6.  Integration with respect to inﬁnite measures . . . . . . . . . . . . . . . . 124
   3.5.  Inﬁnite products of measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Is there any ligature-aware alternative for "pdfgrep" in command line?

1 Answers1