Extracting text from an xml file

Question

If I have an xml file containing entries like this

<root>
  <d:entry d:title="OYSTER">
    <span class="foot">
      <span role="text">
      foo</span>
    </span>
    <span class="sg">
      <span id="004">
    <span role="text">
      <span class="pos">
        <span class="baz">tart</span>
        <d:pos></d:pos>
      </span>
    </span>
    <span id="005" class="star">
      <span class="NAME">GUYBRUSH THREEPWOOD
      <d:def></d:def></span>
      <span role="text" class="bar">:</span>
      <span role="text" class="grog">
        <span class="ex">pirate
        </span>
        <span class="parrot">.</span>
      </span>
    </span>
      </span>
    </span>
  </d:entry>
</root>

how can I extract the text 'GUYBRUSH THREEPWOOD' by supplying the (d:)title 'OYSTER' and the class 'NAME'?

The best way is to use an HTML parser but first, add the expected output to your question. Don't post it in the comments. — Nasir Riley, May 16 '21 at 16:20
Is that the actual XML? There is a namespace, d, that you haven't declared. — Kusalananda, May 16 '21 at 16:59

Kusalananda · Answer 1 · 2021-05-16T20:27:03.217

Using xq (part of yq, a jq-like parser collection for YAML, XML, and TOML, from https://kislyuk.github.io/yq/), because xmlstarlet is too strict about your missing namespace declaration (see end of question for an xmlstarlet solution anyway).

xq -r --arg title "OYSTER" --arg class "NAME" '
    (.. | select(."@d:title"? == $title)) |
    (.. | select(."@class"?   == $class))."#text"' file.xml

This recursively selects any document node that has a d:title attribute (the initial @ used in the expression denotes a node's attribute rather than a node's name) that has the value OYSTER.

Given these nodes (only one in the example), they are searched recursively for any node that has a class attribute with value NAME.

For each such node, the node's value is extracted.

The two strings OYSTER and NAME are tied to internal variables on the command line with the --arg option.

The output, given the document in the question:

GUYBRUSH THREEPWOOD

If other nodes than d:entry can have a d:title attribute, and/or other nodes than span can have a class attribute and you want to avoid matching these attributes in the wrong type of node, then make sure that you only look in the appropriate nodes:

xq -r --arg title "OYSTER" --arg class "NAME" '
    (.. | ."d:entry"? | select(."@d:title"? == $title)) |
    (.. | .span?[]?   | select(."@class"?   == $class))."#text"' file.xml

As a reference, since xq is actually calling jq with a JSON document behind the scenes, the following is the JSON document that your XML document is translated into:

{
  "root": {
    "d:entry": {
      "@d:title": "OYSTER",
      "span": [
        {
          "@class": "foot",
          "span": {
            "@role": "text",
            "#text": "foo"
          }
        },
        {
          "@class": "sg",
          "span": {
            "@id": "004",
            "span": [
              {
                "@role": "text",
                "span": {
                  "@class": "pos",
                  "span": {
                    "@class": "baz",
                    "#text": "tart"
                  },
                  "d:pos": null
                }
              },
              {
                "@id": "005",
                "@class": "star",
                "span": [
                  {
                    "@class": "NAME",
                    "d:def": null,
                    "#text": "GUYBRUSH THREEPWOOD"
                  },
                  {
                    "@role": "text",
                    "@class": "bar",
                    "#text": ":"
                  },
                  {
                    "@role": "text",
                    "@class": "grog",
                    "span": [
                      {
                        "@class": "ex",
                        "#text": "pirate"
                      },
                      {
                        "@class": "parrot",
                        "#text": "."
                      }
                    ]
                  }
                ]
              }
            ]
          }
        }
      ]
    }
  }
}

Assuming the document has a proper declaration of the d namespace, xmlstarlet may be used to extract the wanted text like so:

xmlstarlet sel -t \
    -m '//d:entry[@d:title = "OYSTER"]' \
    -v '//span[@class = "NAME"]' -nl file.xml

Or, with internal variables set on the command line with --var (note the inclusion of the quotation in the values),

xmlstarlet sel -t --var title='"OYSTER"' --var class='"NAME"' \
    -m '//d:entry[@d:title = $title]' \
    -v '//span[@class = $class]' -nl file

Both of these start with matching any d:entry node whose d:title attribute is OYSTER. For each such matching node, it recursively looks for span nodes with a class attribute with value NAME. The value of each such node is outputted.

thank you. would it be an alternative to add a declaration? (i'm 100% ignorant about xml.) — humbug, May 16 '21 at 18:38
@humbug I'm assuming that your actual XML file has a root element saying something like <root xmlns:d="http://some.url.here">, right? — Kusalananda, May 16 '21 at 18:43

Extracting text from an xml file

1 Answers1