Remove all nodes != tag value

Question

I know that xml parsers are the ideal way to go here, but none are available or are able to be added to my environment.

Lets take XML which follows the following structure:

<CONTAINER>
  <FOLDER NAME="I_RS_INT">
  </FOLDER>
  <FOLDER NAME="I_R_INR">
  </FOLDER>
  <FOLDER NAME="I_RS_TRN">
  </FOLDER>
</CONTAINER>

In a bash script, I wish to remove all nodes where the <FOLDER NAME= matches *RS* OR remove all nodes where <FOLDER NAME != $var_folder

Any help greatly appreciated!

magor · Accepted Answer · 2016-10-27T09:22:16.537

2

This should do it :

cat /tmp/xml  | sed -e '/<FOLDER NAME=.*RS.*>/ { N; d; }'

For every line that matches pattern between the two / characters, the code in the {} gets executed. N takes the next line into the pattern space as well, and then d deletes the whole thing before moving on to the next line. This works in any POSIX-compatible sed.

Try the below to remove every line between <FOLDER NAME=.*RS.*> and </FOLDER.> :

 awk '/<FOLDER NAME=.*RS.*>/,/<\/FOLDER>/ {next} {print}' xmlfile

The next command stops the processing of the current match. Follow that with a simple print.

edited Oct 27 '16 at 09:22

answered Oct 27 '16 at 07:48

magor

3,752
2
13
28

is this looking to remove folders from the filesystem - i.e rmdir -rf?? – Chris Finlayson Oct 27 '16 at 07:57
I just need to remove the nodes from the XML itself - Sorry if question wasn't clear enough – Chris Finlayson Oct 27 '16 at 07:57
yeah i thought it's to remove the folders. i'll modify. – magor Oct 27 '16 at 08:04
modified, should be better now :D – magor Oct 27 '16 at 08:20
Cheers for that. Your update successfully removes the offending <FOLDER NAME... lines, but I require it to take out the entire node (with its child elements), otherwise it will invalidate the xml. Again, apologies if not being absolutely clear in my question. – Chris Finlayson Oct 27 '16 at 08:47
My solution will remove the matched line and the following line, based on the given example. To remove lines until the is met, another solution must be used. looking into it. – magor Oct 27 '16 at 09:05
Let us continue this discussion in chat. – Chris Finlayson Oct 27 '16 at 09:12
You are a gentleman and a scholar! Thanks very much for the great solution. – Chris Finlayson Oct 27 '16 at 09:29

Kusalananda · Answer 2 · 2021-05-02T20:33:19.253

You should do this with an XML parser. For example, using XMLStarlet on the command line:

$ xmlstarlet ed -d '/CONTAINER/FOLDER[contains(@NAME, "RS")]' data.xml
<?xml version="1.0"?>
<CONTAINER>
  <FOLDER NAME="I_R_INR">
  </FOLDER>
</CONTAINER>

Or,

$ var="I_R_INR"
$ xmlstarlet ed -d "/CONTAINER/FOLDER[@NAME != '$var']" data.xml
<?xml version="1.0"?>
<CONTAINER>
  <FOLDER NAME="I_R_INR">
  </FOLDER>
</CONTAINER>

Note that these two are not equivalent as the first example performs a substring match while the second performs an exact match.

With the xq wrapper around jq:

$ xq -x --arg substring "RS" 'del(.CONTAINER.FOLDER[] | select(."@NAME" | contains($substring)))' file.xml
<CONTAINER>
  <FOLDER NAME="I_R_INR"></FOLDER>
</CONTAINER>

$ xq -x --arg name "I_R_INR" 'del(.CONTAINER.FOLDER[] | select(."@NAME" != $name))' file.xml
<CONTAINER>
  <FOLDER NAME="I_R_INR"></FOLDER>
</CONTAINER>

score 0 · Answer 3 · edited May 23 '17 at 11:33

OK, seriously - parsing XML with regular expressions is bad news. XML is NOT a regular language, so no regular expression can handle it correctly. Anything you write will be hacky and brittle as a result.

However, XML does have something similar to regular expressions, called xpath.

To tackle your problem, I'd do it like this:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#process the file as XML
my $twig = XML::Twig -> parsefile ( 'your_file.xml' );

#iterate 'FOLDER' elements
foreach my $folder ( $twig -> get_xpath ('//FOLDER' ) ) {
   #delete any that regex match /RS/
   if ( $folder -> att('NAME') =~ m/RS/ ) { 
      $folder -> delete;
   }
}

#print the result. 
$twig -> set_pretty_print('indented_a');
$twig -> print;

score -1 · Answer 4 · answered Jan 21 '17 at 11:22

-1

sed -r '/<FOLDER NAME=.*RS.*>/{ :X N; /<\/FOLDER>/d; bX }' file
<CONTAINER>
  <FOLDER NAME="I_R_INR">
  </FOLDER>
</CONTAINER>

answered Jan 21 '17 at 11:22

mug896

965
9
12

Remove all nodes != tag value

4 Answers4

Linked