.epub
files are .zip
files containing XHTML and CSS and some other files (including images, various metadata files, and maybe an XML file called toc.ncx
containing the table of contents).
The following script uses unzip -p
to extract toc.ncx
to stdout, pipe it through the xml2 command, then sed
to extract just the text of each chapter heading.
It takes one or more filename arguments on the command line.
#! /bin/sh
# This script needs InfoZIP's unzip program
# and the xml2 tool from http://ofb.net/~egnor/xml2/
# and sed, of course.
for f in "$@" ; do
echo "$f:"
unzip -p "$f" toc.ncx |
xml2 |
sed -n -e 's:^/ncx/navMap/navPoint/navLabel/text=: :p'
echo
done
It outputs the epub's filename followed by a :
, then indents each chapter title by two spaces on the following lines. For example:
book.epub:
Chapter One
Chapter Two
Chapter Three
Chapter Four
Chapter Five
book2.epub:
Chapter One
Chapter Two
Chapter Three
Chapter Four
Chapter Five
If an epub file doesn't contain a toc.ncx
, you'll see output like this for that particular book:
book3.epub:
caution: filename not matched: toc.ncx
error: Extra content at the end of the document
The first error line is from unzip
, the second from xml2
. xml2
will also warn about other errors it finds - e.g. an improperly formatted toc.ncx
file.
Note that the error messages are on stderr, while the book's filename is still on stdout.
xml2
is available pre-packaged for Debian, Ubuntu and other debian-derivatives, and probably most other Linux distros too.
For simple tasks like this (i.e. where you just want to convert XML into a line-oriented format for use with sed
, awk
, cut
, grep
, etc), xml2
is simpler and easier to use than xmlstarlet
.
BTW, if you want to print the epub's title as well, change the sed
script to:
sed -n -e 's:^/ncx/navMap/navPoint/navLabel/text=: :p
s!^/ncx/docTitle/text=! Title: !p'
or replace it with an awk
script:
awk -F= '/(navLabel|docTitle)\/text/ {print $2}'