3

Is there a way to find a pattern (of a child tag) and replace the entire parent tag, using regular expressions?  I'm working from a Linux server without a graphics environment.

I have XML like:

<?xml version="1.0" encoding="UTF-8"?>  
<bookstore>  
  <book category="COOKING">  
    <title lang="en">Everyday Italian</title>  
    <author>Giada De Laurentiis</author>  
    <year>2005</year>  
    <price>30.00</price>  
  </book>  
  <book category="CHILDREN">  
    <title lang="en">Harry Potter</title>  
    <author>J K. Rowling</author>  
    <year>2005</year>  
    <price>29.99</price>  
  </book>  
  <book category="WEB">  
    <title lang="en">Learning XML</title>  
    <author>Erik T. Ray</author>  
    <year>2003</year>  
    <price>39.95</price>  
  </book>  
</bookstore>  

I need a shell script that finds the pattern:

<author>J K. Rowling</author>

then replace its complete block:

  <book category="CHILDREN">  
    <title lang="en">Harry Potter</title>  
    <author>J K. Rowling</author>  
    <year>2005</year>  
    <price>29.99</price>  
  </book>  

with:

  <book category="CHILDREN">  
    <title lang="en">Hamlet</title>  
    <author>William Shakespeare</author>  
  </book>

to finally get:

<?xml version="1.0" encoding="UTF-8"?>  
<bookstore>  
  <book category="COOKING">  
    <title lang="en">Everyday Italian</title>  
    <author>Giada De Laurentiis</author>  
    <year>2005</year>  
    <price>30.00</price>  
  </book>  
  <book category="CHILDREN">  
    <title lang="en">Hamlet</title>  
    <author>William Shakespeare</author>  
  </book>  
  <book category="WEB">  
    <title lang="en">Learning XML</title>  
    <author>Erik T. Ray</author>  
    <year>2003</year>  
    <price>39.95</price>  
  </book>  
</bookstore> 

with something like <book*<author>J K. Rowling</author>*</book>, where * is a wildcard for all text or code between <book and <author>...

I have an idea, using Perl, contemplating these logic steps:

  1. Search the line number where the pattern is
  2. Identify the line number of parent block open and close tags
  3. Remove all this content, inside these lines.
  4. Add the new block inside these lines

But, it is possible, I prefer the first approach.

  • Why do you want to do this using a regular expression? Why not use the proper tool for the job? – Toby Speight Dec 18 '21 at 16:27
  • @oswaldog, re: your "reopen" edit just now, neither of the answers in the linked duplicate use sed; why don't they address your concern? – Jeff Schaller Dec 30 '21 at 04:15
  • @JeffSchaller: What are you looking at?  *Three* of the answers to the duplicate target use sed, including the accepted answer, which was written by you. – G-Man Says 'Reinstate Monica' Dec 30 '21 at 14:56
  • None of the answers to the duplicate target, including the awk, Perl and XMLStarlet ones, shows how to do what this question asks for: *replace an entire block.* – G-Man Says 'Reinstate Monica' Dec 30 '21 at 14:56
  • Whew, I must have misread the GUI somehow, @G-ManSays'ReinstateMonica'. I saw answers that did not use sed and somehow missed the others, including my own! Anyway, an accepted answer is just an indication that it worked for the asker; the other two (higher-voted!) answers do not use sed. My comment was mainly in reaction to this author proposing a reopening based on "not using sed", but to your point, if there's other issues that separate the questions, then this should be reopened. – Jeff Schaller Dec 30 '21 at 15:03
  • @oswaldog: (1) Your sample data raise a problem: What if the input contains a work whose author is J.K. Rowling (as presented on *her* web site), J. K. Rowling (preferred by Wikipedia), JK Rowling, J K Rowling, or any other variant?  If you want to handle pattern-matching (and, perhaps, correction / normalization / standardization) in the author field, you should say so explicitly. Otherwise, you might want to use a less problematic example, like John Grisham. … (Cont’d) – G-Man Says 'Reinstate Monica' Dec 30 '21 at 15:24
  • (Cont’d) …  (2) You say “I prefer the first approach.”  What “first approach”? – G-Man Says 'Reinstate Monica' Dec 30 '21 at 15:24

2 Answers2

2

My preferred approach tends to be to use xmlstarlet to manipulate XML data. We declare an xmlstarlet variable $book to reference the subtree that we need to edit

xmlstarlet <682660.xml ed                                               \
    --var book '//book[author="J K. Rowling"]'                          \
    --update '$book' --value ''                                         \
    --update '$book/@category' --value 'CHILDREN'                       \
    --subnode '$book' --type 'elem' --name 'title'  --value 'Hamlet'    \
    --subnode '$book/title' --type attr --name 'lang' --value 'en'      \
    --subnode '$book' --type 'elem' --name 'author' --value 'William Shakespeare'

Output

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
  <book category="COOKING">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="CHILDREN">
    <title lang="en">Hamlet</title>
    <author>William Shakespeare</author>
  </book>
  <book category="WEB">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>

You could also just delete the relevant <book/> subtree and append a new one, but that might break sequential processing so I didn't do that here.

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • Hi @roaima, I didn't know anything about "xmlstarlet". Is it possible to pass a file as parameters instead --update and --subnode option? – oswaldog Dec 17 '21 at 02:40
  • @oswaldog I don't know of a way to have xmlstarlet read secondary data from a file. It wasn't in your question so I didn't consider it – Chris Davies Dec 17 '21 at 16:32
1

When working with structured document formats, use tools explicitly made for working with these. Regular expressions are mainly for matching text, and an XML document is not really text but rather data structured in a particular way (and newlines etc. are not always significant). Likewise, sed is a tool for working with lines of text, and this is also not what XML is in general.

Using xq from https://kislyuk.github.io/yq/

xq -x '.book as $new | input |
    (
        .bookstore.book[] |
        select(.author == "J K. Rowling")
    ) |= $new' insert.xml file.xml

This uses xq to convert both your bookstore XML and the element that you want to insert (in insert.xml) into JSON. It then applies a particular jq expression to the generated JSON document to extract each entry of the .bookstore.book array. Each element in that array that has an .author field equal to exactly J K. Rowling is then replaced with the element read from insert.xml.

In more detail: We read the contents of the new .book object into an internal variable called $new and then proceed to get the main document by calling input. The select() statement acts on each individual element of the .bookstore.book array and pulls out the ones with a particular author. The result of this is a number of "paths" to these matching book entries. These are updated using |= (the update operator) to the $new value created earlier.

If you want to provide the new XML on the command line instead of via a file, then use a here-document:

xq -x '.book as $new | input |
    (
        .bookstore.book[] |
        select(.author == "J K. Rowling")
    ) |= $new' - file.xml <<'NEW_XML'
<book category="CHILDREN">
  <title lang="en">Hamlet</title>
  <author>William Shakespeare</author>
</book>
NEW_XML

Note that the input filename insert.xml was replaced by a dash on the command line.

The result, given the data in your question, would be

<bookstore>
  <book category="COOKING">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="CHILDREN">
    <title lang="en">Hamlet</title>
    <author>William Shakespeare</author>
  </book>
  <book category="WEB">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>

The xq utility can do in-place editing, if you use its --in-place (or -i) option.


For reference, xq is translating your XML to the following internal JSON representation, which is then handled by jq:

{
  "bookstore": {
    "book": [
      {"@category":"COOKING","title":{"@lang":"en","#text":"Everyday Italian"},"author":"Giada De Laurentiis","year":"2005","price":"30.00"},
      {"@category":"CHILDREN","title":{"@lang":"en","#text":"Harry Potter"},"author":"J K. Rowling","year":"2005","price":"29.99"},
      {"@category":"WEB","title":{"@lang":"en","#text":"Learning XML"},"author":"Erik T. Ray","year":"2003","price":"39.95"}
    ]
  }
}

The data to insert would be converted to something equivalent of

{
  "book": {
    "@category": "CHILDREN",
    "title": {
      "@lang": "en",
      "#text": "Hamlet"
    },
    "author": "William Shakespeare"
  }
}
Kusalananda
  • 333,661
  • Hi, thank you so much!!! I didn't know these tools, amazing, but I noticed the tag is repeated in the final output, at the entry changed. – oswaldog Dec 16 '21 at 19:48
  • @oswaldog You're right, the tag is repeated, because I was not reading the output closely enough this morning. Give me a moment and I'll clear it up. – Kusalananda Dec 16 '21 at 20:01
  • @oswaldog Now fixed. Thanks for the heads-up! – Kusalananda Dec 16 '21 at 20:05
  • You are a guru! @they, really very grateful. Lastly: do you think I can lose data when converting from XML to JSON and vice versa? – oswaldog Dec 17 '21 at 13:26
  • @oswaldog There are definitely JSON documents that can't be represented as XML, but when converting back and forth like this there is no issue. As for performing the actual replacement of data, you could definitely replace more than you bargained for, but that would be just carelessness (not realizing that there are several books by the same author and then replacing all by one single book, etc.) – Kusalananda Dec 17 '21 at 20:16