Extract TOC of epub file

Question

Lately I hit the command that will print the TOC of a pdf file.

mutool show file.pdf outline

I'd like to use a command for the epub format with similar simplicity of usage and nice result as the above for pdf format.

Is there something like that?

cas · Accepted Answer · 2016-05-20T05:07:18.013

.epub files are .zip files containing XHTML and CSS and some other files (including images, various metadata files, and maybe an XML file called toc.ncx containing the table of contents).

The following script uses unzip -p to extract toc.ncx to stdout, pipe it through the xml2 command, then sed to extract just the text of each chapter heading.

It takes one or more filename arguments on the command line.

#! /bin/sh

# This script needs InfoZIP's unzip program
# and the xml2 tool from http://ofb.net/~egnor/xml2/
# and sed, of course.

for f in "$@" ; do
    echo "$f:"
    unzip -p "$f" toc.ncx | 
        xml2 | 
        sed -n -e 's:^/ncx/navMap/navPoint/navLabel/text=:  :p'
    echo
done

It outputs the epub's filename followed by a :, then indents each chapter title by two spaces on the following lines. For example:

book.epub:
  Chapter One
  Chapter Two
  Chapter Three
  Chapter Four
  Chapter Five

book2.epub:
  Chapter One
  Chapter Two
  Chapter Three
  Chapter Four
  Chapter Five

If an epub file doesn't contain a toc.ncx, you'll see output like this for that particular book:

book3.epub:
caution: filename not matched:  toc.ncx
error: Extra content at the end of the document

The first error line is from unzip, the second from xml2. xml2 will also warn about other errors it finds - e.g. an improperly formatted toc.ncx file.

Note that the error messages are on stderr, while the book's filename is still on stdout.

xml2 is available pre-packaged for Debian, Ubuntu and other debian-derivatives, and probably most other Linux distros too.

For simple tasks like this (i.e. where you just want to convert XML into a line-oriented format for use with sed, awk, cut, grep, etc), xml2 is simpler and easier to use than xmlstarlet.

BTW, if you want to print the epub's title as well, change the sed script to:

sed -n -e 's:^/ncx/navMap/navPoint/navLabel/text=:  :p
           s!^/ncx/docTitle/text=!  Title: !p'

or replace it with an awk script:

awk -F= '/(navLabel|docTitle)\/text/ {print $2}'

score 1 · Answer 2 · answered Jan 25 '17 at 21:13

While the answer provided by @cas works in some cases, it's based on an assumption of a epub version 2.0, with a NCX document named toc.ncx at the top level of the zip container. Out of 223 epubs I have in one folder, only 5 still meet this assumption - and those only contain it for compatibility with older reader systems. The toc.ncx is not a required file- the one required file is META-INF/content.xml. This will contain pointers to all the other elements of the epub. This makes scripting via bash a little more complex, but possible. Here's a script that will pull title and author from the opf file (pointed to via the content.xml):

#! /bin/sh

for f in "$@" ; do
    echo -n "$f""   "
    opf=$(unzip -p "$f" META-INF/container.xml | 
        xml2 | 
        sed -n -e 's:^/container/rootfiles/rootfile/@full-path=::p')
    unzip -p "$f" "$opf" |
        xml2 |
        sed -n -e 's!^/package/metadata/dc:title=!  !p' | tr  '
' ' '
    unzip -p "$f" "$opf" |
        xml2 |
        sed -n -e 's!^/package/metadata/dc:creator=!    !p' | tr  '
' ' '
    echo
done

Yes, it parses the opf twice, in order to ensure the order of the results - this generates a tab separated 3-column file (those are tabs in the sed lines between the two bangs), suitable for spreadsheet import.

Going one more step to find the ncx file is a bit trickier, since using xml2 to generate a single line for each tag and attribute works against us here: we need the value of the href attribute who's media-type attribute is equal to application/x-dtbncx+xml. We can cheat a bit and hope the original item is all on one line, and use grep to extract just that fragment, then process that with xml2 to get the href value.

Since that's a relative url, we also need to extract the path part from the opf entry. Putting it all together, gives us:

#! /bin/sh

for f in "$@" ; do
    echo "$f""  "
    opf=$(unzip -p "$f" META-INF/container.xml | 
        xml2 | 
        sed -n -e 's:^/container/rootfiles/rootfile/@full-path=::p')
    ncx=$(unzip -p "$f" "$opf" |
        grep application/x-dtbncx+xml| 
        xml2 |
        sed -n -e 's!^/item/@href=!!p')
    opf_filename=${opf##*/}
    opf_path=${opf%$opf_filename}
    unzip -p "$f" ${opf_path}${ncx} |
        xml2 |
        sed -n -e 's:^/ncx/navMap/navPoint/navLabel/text=:  :p
                   s!^/ncx/docTitle/text=!Title: !p'
done

This still makes assumptions, the strongest being that these are epub2 compatible files, and therefore contain an ncx file, somewhere. Epub3 docs use a different HTML based nav format. Even so, I do get TOCs for all 223 of my test files (though some lack titles in the ncx)

Extract TOC of epub file

2 Answers2

Linked