How to extract bookmarks from a PDF file

Question

I have a PDF file. I need the bookmarks in that file extracted to a text file or an Excel spreadsheet. I also need to validate the bookmarks from the large PDF file. How would I do that?

vinc17 · Answer 1 · 2019-07-30T14:39:21.683

You can use pdftk to extract data (in particular, bookmarks) from PDF files.

Example: with pdftk 2.02,

pdftk file.pdf dump_data_utf8 | grep '^Bookmark'

outputs the list of bookmarks, 4 lines for each bookmark, under the form:

BookmarkBegin
BookmarkTitle: <title in UTF8>
BookmarkLevel: <number>
BookmarkPageNumber: <number>

where for instance, level 1 corresponds to sections, level 2 to subsections, and so on. Instead of dump_data_utf8, you can use dump_data, which will give you HTML/XML numeric entities for non-ASCII characters (e.g. è for "è").

Note: Without the grep, you can get other interesting data, such as the metadata (creation date, author, keywords, title, etc.), the number of pages and the dimensions of each page. This pdftk utility can do other things on the PDF file(s); see its man page for a full description.

Just a note to self- pdftk can export as well as import bookmark data to and from a text file. https://www.pdflabs.com/blog/export-and-import-pdf-bookmarks/ — Prem, Sep 06 '21 at 20:12

score 5 · Answer 2 · answered Jun 29 '20 at 15:26

5

with qpdf

This should get you started:

qpdf --json your.pdf | jq '.objects' | grep -Po 'Title": \K.*'

That command will also yield the title of the PDF, though.

Have a look at the qpdf manual regarding its JSON output.

I'm pretty sure the command can be simplified, getting rid of grep, by using jq's wildcards.

answered Jun 29 '20 at 15:26

Matthias Braun

8,239

4

Kudos for qpdf! A JSON output of bookmarks using --json --json-key=outlines could not be simpler. Easy to parse for further processing, this is what I searched for, far too long. – CodeBrauer Aug 25 '20 at 16:17
1

Thanks! qpdf --json --json-key=outlines test.pdf > test.json, works like a charm :) – Alex G Dec 22 '23 at 09:05

score 3 · Answer 3 · answered Jul 10 '14 at 23:49

You can use the CLI of jpdftweak to extract bookmarks in CSV format:

java -jar -Xmx512M jpdftweak.jar "file.pdf" -savebookmarks "bmarks.csv" /dev/null

After validating and possibly modifying the bookmark data you could load it back into the PDF file with the following command:

java -jar -Xmx512M jpdftweak.jar "file.pdf" -loadbookmarks "bmarks.csv" "file_updated.pdf"

The -Xmx512M Java parameter is optional but can help with processing larger PDF files that require more memory.

You might want to read this related Q&A as well.

How to extract bookmarks from a PDF file

3 Answers3

with qpdf