How to extract table of contents from fb2 book?

Question

I have a book in fb2 format. I want to print the table of contents, containing names and numbers of "parts", "chapters", "episodes" and so on.

Is there a way I can do this from terminal? There is a similar question, but for epub format.

I know fb2 is an xml format. But is there a tool to extract only TOC? They are inside tags <section>, <title> and <subtitle>.

If there is not, I guess it is possible to make xsl file based on official FB2_to_txt.xsl file. Also maybe ebook-convert could do this?

The book that I am working on has the following structure:

<?xml version="1.0" encoding="utf8"?>
<FictionBook xmlns:l="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.gribuser.ru/xml/fictionbook/2.0">
  <description>
    <title-info>
      <genre>fiction</genre>
      <author>
        <first-name>John</first-name>
        <last-name>Doe</last-name>
      </author>
      <book-title>Fiction Book</book-title>
      <annotation>
        <p>Hello</p>
      </annotation>
      <keywords>john, doe, fiction</keywords>
      <date value="2011-07-18">18.07.2011</date>
      <coverpage></coverpage>
      <lang>en</lang>
    </title-info>
    <document-info>
      <author>
        <first-name></first-name>
        <last-name></last-name>
        <nickname></nickname>
      </author>
      <program-used>Fb2 Gem</program-used>
      <date value="2011-07-18">18.07.2011</date>
      <src-url></src-url>
      <src-ocr></src-ocr>
      <id></id>
      <version>1.0</version>
    </document-info>
    <publish-info>
    </publish-info>
  </description>
  <body>
    <title>
      <p>John Doe</p>
      <empty-line/>
      <p>Fiction Book</p>
    </title>
    <section>
      <title>
        <p>Part 1</p>
        <p>Some name of Part 1</p>
      </title>
      <section>
        <title>
          <p>Chapter 1</p>
          <p>Some name of Chapter 1</p>
        </title>
        <subtitle>Episode 1</subtitle>
        <p>Line one of the first episode</p>
        <p>Line two of the first episode</p>
        <p>Line three of the first episode</p>
        <subtitle>Episode 2</subtitle>
        <p>Line one of the second episode</p>
        <p>Line two of the second episode</p>
        <p>Line three of the second episode</p>
      </section>
    </section>
    <section>
      <title>
        <p>Part 2</p>
        <p>Some name of Part 2</p>
      </title>
      <section>
        <title>
          <p>Chapter 3</p>
          <p>Some name of Chapter 3</p>
        </title>
        <subtitle>Episode 3</subtitle>
        <p>Line one of the third episode</p>
        <p>Line two of the third episode</p>
        <p>Line three of the third episode</p>
        <subtitle>Episode 4</subtitle>
        <p>Line one of the fourth episode</p>
        <p>Line two of the fourth episode</p>
        <p>Line three of the fourth episode</p>
      </section>
    </section>
  </body>
</FictionBook>

I want to get the following on the output:

Part 1
Some name of Part 1
Chapter 1
Some name of Chapter 1
Episode 1
Episode 2
Part 2
Some name of Part 2
Chapter 3
Some name of Chapter 3
Episode 3
Episode 4

Kusalananda · Accepted Answer · 2023-05-05T08:11:16.130

Using xmlstarlet:

xmlstarlet select --template \
    --value-of '//_:section/_:title/_:p | //_:subtitle' \
    -nl file.xml

or, using short options,

xmlstarlet sel -t \
    -v '//_:section/_:title/_:p | //_:subtitle' \
    -n file.xml

The XPath query used here will extract the values of the p nodes of the title nodes beneath each section, and also the values of all subtitle nodes.

The prefix _: before each node name in the expression is an anonymous placeholder for the namespace identifier that the document is using.

The output of either of the two commands above, given your example document, would be

Part 1
Some name of Part 1
Chapter 1
Some name of Chapter 1
Episode 1
Episode 2
Part 2
Some name of Part 2
Chapter 3
Some name of Chapter 3
Episode 3
Episode 4

Would you also want the book's title, then remove the _:section restriction in the expression (this would make the book's title's p nodes match too).

Another way to get the title and subtitle of each section (avoiding the book's title), which may look a bit cleaner (as it shows that the subtitles are picked up from the sections and not just from anywhere), is to first restrict the matching to sections and then get the data from these:

xmlstarlet select --template \
    --match '//_:section' \
    --value-of '_:title/_:p | _:subtitle' \
    -nl file.xml

@GillesQuénot https://xmlstar.sourceforge.net/doc/UG/ch05.html#idm47989546158480 — Kusalananda, May 05 '23 at 13:30
@GillesQuénot Oh, yes, it's an xmlstarlet-specific thing. — Kusalananda, May 05 '23 at 14:09

Gilles Quénot · Answer 2 · 2023-05-05T18:15:19.890

2

With a XPath3 aware FOSS (GPLv3) command line tool, xidel:

XPath2 Constructing Sequences:

xidel -e '(//section/title/p, //subtitle)'  file.xml

XPath1:

xidel -e '//section/title/p | //subtitle'  file.xml

Part 1
Some name of Part 1
Chapter 1
Some name of Chapter 1
Episode 1
Episode 2
Part 2
Some name of Part 2
Chapter 3
Some name of Chapter 3
Episode 3
Episode 4

xidel is a Swiss army knife for querying XML/HTML/JSON. It's smart enough to manage the default namespace on it's own.

edited May 05 '23 at 18:15

answered May 05 '23 at 14:10

Gilles Quénot

33,867

Thanks for your answer. It is absolutely valid, but I marked @Kusalananda's answer as accepted, just because I used xmlstarlet previously, and did not heard about xidel. – Ashark May 06 '23 at 13:12

score -1 · Answer 3 · answered May 05 '23 at 07:37

-1

It appears to me that the output contains the result of the XPath expression (//title/p | //subtitle). So you just need to find a tool suitable for your environment that can execute that XPath expression and display the result.

See https://www.baeldung.com/linux/evaluate-xpath for some suggested command line tools. There is also Saxon's Gizmo tool (my company's product).

answered May 05 '23 at 07:37

Michael Kay

534

Compiled (C) commands line tools are faster than using (restricted use for Saxon/XPath3, need commercial version) Java libs. We promote usually FOSS software here. Worth to mention that Gizmo use Saxon, so have the same frustrating limitation. xidel is Open Source, quick and efficient, with the ability of Xpath3/Xquery3 aware query. – Gilles Quénot May 05 '23 at 13:48
https://stackoverflow.com/a/76087934/465183 There's also BaseX as FOSS software to run XPath3/XQuery3, with no limitation in use, like update. – Gilles Quénot May 05 '23 at 14:04
The article I pointed to mentions a number of FOSS tools, no idea why you criticised it or downvoted it. Note also that Saxon is available as FOSS software, and is also available as compiled code in the form of SaxonC. – Michael Kay May 06 '23 at 16:46
1

I upvoted your answer. Was -1. I just explain that Saxon need commercial support to allow update. I think you got downvote because this is not a real answer + promote commercial libs – Gilles Quénot May 06 '23 at 16:48

How to extract table of contents from fb2 book?

3 Answers3