How can I grep in PDF files?

Question

Is there a way to search PDF files using grep, without converting to text first in Ubuntu?

See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files. — Gilles 'SO- stop being evil', Jan 31 '11 at 20:01
For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files? — Martin Thoma, Jan 02 '16 at 22:09

score 304 · Accepted Answer · edited Mar 05 '20 at 17:25

304

Install the package pdfgrep, then use the command:

find /path -iname '*.pdf' -exec pdfgrep pattern {} +

——————

Simplest way to do that:

pdfgrep 'pattern' *.pdf
pdfgrep 'pattern' file.pdf

edited Mar 05 '20 at 17:25

AdminBee

22,803

answered Dec 23 '11 at 18:40

enzotib

51,661

8

This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks. – mikiemorales Jan 23 '14 at 01:28
9

Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document. – vowel-house-might Sep 16 '14 at 11:11
13

pdfgrep also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö. – Rovanion Jan 14 '16 at 12:11
1

Actually, the -n option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing). – JepZ Nov 10 '17 at 20:18
6

This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's pattern? What's {}? What's up with the +? I have no idea upon first reading... so off to the manpage I go, I suppose. – Mark Amery Apr 20 '18 at 14:44
3

@MarkAmery This answer is unnecessarily complex because he is find. The usage is simply pdfgrep 'pattern' file.pdf. The {} is just a way to drop the file name in from find. – Jonathan Cross Aug 28 '18 at 10:30
2

As pdfgrep is quite slow, you can increase the speed by using parallel find: find . -type f -iname \*.pdf -print0 | xargs -0 -P 4 -L 1 pdfgrep -H -n pattern. That obviously depends on the number of CPUs and available IO. – reox Nov 14 '19 at 19:51
see here for examples too https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files – Matthew Lock Sep 20 '21 at 06:13
1

Recursive search worked on my Debian just with find, e.g.: find . -iname '*.pdf' -exec pdfgrep 'wandtank' {} + – Christian Schulzendorff Aug 22 '23 at 05:44
upon some initial testing doesn't find successfully the files that contain the words. – bugrasan Jan 22 '24 at 11:59
It worked excellently on Ubuntu 22.04 LTS – Pablo Adames Feb 19 '24 at 00:55

score 85 · Answer 2 · answered Jan 31 '11 at 13:45

85

If you have poppler-utils installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep:

pdftotext my.pdf - | grep 'pattern'

This won't create a .txt file.

answered Jan 31 '11 at 13:45

wag

35,944
12
67
51

1

so .. you extract the text before you grep it which means the answer is "no". – akira Jan 31 '11 at 15:18
20

@akira The OP probably meant "without opening the PDF in a viewer and exporting to text" – Michael Mrozek Jan 31 '11 at 17:36
@Michael Mrozek: then that should be clarified by OP, but "grep only" is pretty clear to me. – akira Jan 31 '11 at 18:38
6

@akira Where do you see "grep only"? – Michael Mrozek Jan 31 '11 at 18:55
@Michael Mrozek: "the power of grep, without converting the text" is pretty clear. "grep foo *.pdf". – akira Feb 01 '11 at 05:21
6

@akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to – Michael Mrozek Feb 01 '11 at 05:52
@Michael Mrozek: "then that should be clarified by OP". – akira Feb 01 '11 at 07:58
1

Nice solution. What is the purpose of the - character preceding the pipe? I observed that without it, or with any or character, the a file is created with the same name and the grep is not executed as expected. Is this how all linux piping is done when the intermediate file is unnecessary? – sherrellbc Dec 04 '15 at 18:42
3

@sherrellbc The second argument of pdftotext is the filename it should write to. However, by convention, tools typically allow you to write to stdout instead of to a file by specifying a - instead. Similarly, some tools would write to stdout by default if you omit such an argument entirely (but this is not always possible without creating ambiguity). – Joost Sep 23 '16 at 14:06
@Joost what if one wants to pipe pdftotext to find? I guess while read is in order. – Erwann Feb 13 '22 at 06:12

score 33 · Answer 3 · answered Jun 19 '15 at 01:06

pdfgrep was written for exactly this purpose and is available in Ubuntu.

It tries to be mostly compatible to grep and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive, --ignore-case or --color.

In contrast to pdftotext | grep, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count or --quiet).

The basic usage is:

pdfgrep PATTERN FILE..

where PATTERN is your search string and FILE a list of filenames (or wildcards in a shell).

See the manpage for more infos.

As of release 2.0 pdfgrep has a --cache option to drastically speed up multiple searches on the same files. — Stefan Schmidt, Oct 24 '22 at 00:48

score 8 · Answer 4 · answered May 09 '14 at 10:00

There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

The advantage over the similar answer here is the --with-filename flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.

https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files

I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to. — Bernhard, May 09 '14 at 12:07

score 7 · Answer 5 · answered Jan 31 '11 at 15:17

7

No.

A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.

You can do that either per file with tools such as pdf2text and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.

But no, you can not grep pdf files and hope for reliable answers without extracting the text first.

answered Jan 31 '11 at 15:17

akira

1,054

15

Considering pdfgrep exists (see above), a flat "no" is incorrect. – Jonathan Cross Aug 28 '18 at 10:18
3

@JonathanCross, considering the question says "using the power of grep, without converting to text first", a flat "no" is correct. – Jivan Pal Nov 24 '20 at 08:58
A good way to ask for an alternative without explicit requesting an alternative. Anyway, lucene as indexer works generally well. – ChoCho Apr 04 '23 at 15:02
@JonathanCross The verb convert is ambiguous. To insist on one interpretation instead of a possibly different one in the mind of the OP is uncharitable pedantry. – Pound Hash Sep 10 '23 at 21:08
@JonathanCross :) "use grep without convert to text" is the limiting part of the question. A question like "can I use $something on CLI to search for text in PDFs" would have yielded a different answer from me … or none at all because those pdftotext answers exist already. – akira Sep 21 '23 at 06:04

score 7 · Answer 6 · edited May 16 '13 at 22:18

7

Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.

edited May 16 '13 at 22:18

Michael Mrozek

93,103
40
240
233

answered May 16 '13 at 20:52

user39336

71

score 5 · Answer 7 · answered Oct 23 '13 at 12:30

Take a look at the common resource grep tool crgrep which supports searching within PDF files.

It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.

Nico · Answer 8 · 2020-08-27T17:53:04.990

4

Here is a quick script for search pdf in the current directory :

#!/bin/bash
if [ $# -ne 1 ]; then
  echo "usage $0 VALUE" 1>&2
  exit 1
fi
echo 'SEARCH IS CASE SENSITIVE' 1>&2
find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' "$1" ;

edited Aug 27 '20 at 17:53

answered Jun 01 '16 at 19:01

Nico

141

I cannot edit this due to being to little: The $1 in the find invocation should be quoted, otherwise this won't work with search terms with spaces. – ankon Aug 25 '20 at 11:44
1

@ankon fixed it :) – Nico Aug 27 '20 at 17:53

score 3 · Answer 9 · edited Dec 23 '11 at 20:17

3

try this

find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i; \
    pdftotext "$i" - | grep pattern; done

for printing the lines the pattern occurs inside the pdf

edited Dec 23 '11 at 20:17

enzotib

51,661

answered Dec 23 '11 at 19:35

harish.venkat

7,573

score 3 · Answer 10 · answered Jan 31 '11 at 13:45

3

You could pipe it through strings first:-

cat file.pdf | strings | grep <...etc...>

answered Jan 31 '11 at 13:45

Andy Smith

198
1
3

10

Just use strings file.pdf | grep <...>, you don't need cat – phunehehe Jan 31 '11 at 14:31
Yeah - my mind seems to work better with streams... :-) – Andy Smith Jan 31 '11 at 14:57
14

wont work if text is compressed, which it is most of the times. – akira Jan 31 '11 at 15:18
8

Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for strings or grep. – Jander Jan 31 '11 at 16:08
Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others. – hourback Nov 24 '15 at 19:58

score 2 · Answer 11 · answered Apr 19 '15 at 19:26

2

cd to your folder containing your pdf-file and then..

pdfgrep 'pattern' your.pdf

or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)

pdfgrep 'pattern'  `ls *.pdf`

or

pdfgrep 'pattern' $(ls *.pdf)

answered Apr 19 '15 at 19:26

Rasmuss Rall

29
1

1

why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use ls output as the input to other commands. Just pdfgrep 'pattern' *.pdf is enough – phuclv Jan 31 '19 at 05:07
@phuclv Your are wrong. pdfgrep 'pattern' *.pdf will not work. – f0nzie Feb 25 '20 at 19:55
@f0nzie you're wrong. $(ls *.pdf) will be almost exactly the same as *.pdf, only worse because special files are not protected in quotes – phuclv Feb 26 '20 at 01:45

score 2 · Answer 12 · answered Jan 31 '19 at 05:05

If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings you can use the below

grep -a STRING file.pdf
cat -v file.pdf | grep STRING

From grep --help:

      --binary-files=TYPE   assume that binary files are TYPE;
                            TYPE is 'binary', 'text', or 'without-match'
  -a, --text                equivalent to --binary-files=text

and cat --help:

  -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB

Eduard Florinescu · Answer 13 · 2018-02-08T09:33:43.347

I assume you mean tp not convert it on the disk, you can convert them to stdout and then grep it with pdftotext. Grepping the pdf without any sort of conversion is not a practical approach since PDF is mostly a binary format.

In the directory:

ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {}  - | grep "keyword"

or in the directory and its subdirectories:

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {}  - | grep "keyword"

Also because some pdf are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be greped and OCR them.

I noticed if a pdf file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts.

First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:

gedit check_pdf_searchable.sh

then paste this

#!/bin/bash 
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
pypdfocr "$1"
fi

then make it executable

chmod +x check_pdf_searchable.sh

then list all non-searchable pdfs in the directory:

ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}

or in the directory and its subdirectories:

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}

score 1 · Answer 14 · answered Aug 17 '20 at 08:54

1

pdfgrep -r --include "*.pdf" -i 'pattern'

answered Aug 17 '20 at 08:54

Gavin Gao

121

3

Welcome to the site, and thank you for your contribution. Could you add some explanation on what these options mean? This could also help explain how your approach differs from other answers to this question that also recommend pdfgrep. – AdminBee Aug 17 '20 at 09:53

score 1 · Answer 15 · answered Jun 28 '21 at 19:46

1

ripgrep-all (or rga) enables ripgrep functionality on multiple file types, including PDFs.

answered Jun 28 '21 at 19:46

Sjoerd

441

score 1 · Answer 16 · edited Apr 17 '19 at 18:36

1

gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep to find some pattern.

edited Apr 17 '19 at 18:36

Rui F Ribeiro

56,709
26
150
232

answered Jan 31 '11 at 14:03

Dharmit

4,310

score 0 · Answer 17 · edited Jan 21 '20 at 08:10

0

Quickest way is

grep -rinw "pattern" --include \*.pdf *

edited Jan 21 '20 at 08:10

Stephen Kitt

434,908

answered Jan 21 '20 at 07:46

Parth

11

1

Welcome to the site. Would you mind adding more explanation to your proposed solution to make it more accessible to the non-expert? For example, your grep command-line searches recursively in sub-directories which someone not familiar with grep might be unaware of. Also, you included the -i flag although ignoring the case may not always be what the user wants. In addition, please explain in what way your approach differs from the asnwer of e.g. @phuclv and others. – AdminBee Jan 21 '20 at 08:12
1

As AdminBee says, the question doesn’t ask for a case-insensitive search or a recursive directory search. The -n and -w options aren’t justified by the question, either. But, more importantly, this answer tells how to search through text files whose names end with .pdf — you’ve missed the point of the question. – G-Man Says 'Reinstate Monica' Jan 21 '20 at 08:22

score 0 · Answer 18 · edited Jun 11 '23 at 13:39

0

put this in your bashrc:

LESSOPEN="|/usr/bin/lesspipe %s"; export LESSOPEN

Then you can use less:

less mypdf.pdf | grep "Hello, World"

Check : https://www.zeuthen.desy.de/~friebel/unix/lesspipe.html : to get more about this.

edited Jun 11 '23 at 13:39

Prem

3,342

answered Nov 03 '20 at 19:18

user7343148

101

score 0 · Answer 19 · edited Feb 29 '24 at 09:55

0

If you want to use a GUI you can try pdfgrepgui.

edited Feb 29 '24 at 09:55

AdminBee

22,803

answered Feb 26 '24 at 05:05

Stephan

1

How can I grep in PDF files?

19 Answers19

Linked

Related