3

I know that xml parsers are the ideal way to go here, but none are available or are able to be added to my environment.

Lets take XML which follows the following structure:

<CONTAINER>
  <FOLDER NAME="I_RS_INT">
  </FOLDER>
  <FOLDER NAME="I_R_INR">
  </FOLDER>
  <FOLDER NAME="I_RS_TRN">
  </FOLDER>
</CONTAINER>

In a bash script, I wish to remove all nodes where the <FOLDER NAME= matches *RS* OR remove all nodes where <FOLDER NAME != $var_folder

Any help greatly appreciated!

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255

4 Answers4

2

This should do it :

cat /tmp/xml  | sed -e '/<FOLDER NAME=.*RS.*>/ { N; d; }'

For every line that matches pattern between the two / characters, the code in the {} gets executed. N takes the next line into the pattern space as well, and then d deletes the whole thing before moving on to the next line. This works in any POSIX-compatible sed.

Try the below to remove every line between <FOLDER NAME=.*RS.*> and </FOLDER.> :

 awk '/<FOLDER NAME=.*RS.*>/,/<\/FOLDER>/ {next} {print}' xmlfile

The next command stops the processing of the current match. Follow that with a simple print.

magor
  • 3,752
  • 2
  • 13
  • 28
2

You should do this with an XML parser. For example, using XMLStarlet on the command line:

$ xmlstarlet ed -d '/CONTAINER/FOLDER[contains(@NAME, "RS")]' data.xml
<?xml version="1.0"?>
<CONTAINER>
  <FOLDER NAME="I_R_INR">
  </FOLDER>
</CONTAINER>

Or,

$ var="I_R_INR"
$ xmlstarlet ed -d "/CONTAINER/FOLDER[@NAME != '$var']" data.xml
<?xml version="1.0"?>
<CONTAINER>
  <FOLDER NAME="I_R_INR">
  </FOLDER>
</CONTAINER>

Note that these two are not equivalent as the first example performs a substring match while the second performs an exact match.


With the xq wrapper around jq:

$ xq -x --arg substring "RS" 'del(.CONTAINER.FOLDER[] | select(."@NAME" | contains($substring)))' file.xml
<CONTAINER>
  <FOLDER NAME="I_R_INR"></FOLDER>
</CONTAINER>
$ xq -x --arg name "I_R_INR" 'del(.CONTAINER.FOLDER[] | select(."@NAME" != $name))' file.xml
<CONTAINER>
  <FOLDER NAME="I_R_INR"></FOLDER>
</CONTAINER>
Kusalananda
  • 333,661
0

OK, seriously - parsing XML with regular expressions is bad news. XML is NOT a regular language, so no regular expression can handle it correctly. Anything you write will be hacky and brittle as a result.

However, XML does have something similar to regular expressions, called xpath.

To tackle your problem, I'd do it like this:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#process the file as XML
my $twig = XML::Twig -> parsefile ( 'your_file.xml' );

#iterate 'FOLDER' elements
foreach my $folder ( $twig -> get_xpath ('//FOLDER' ) ) {
   #delete any that regex match /RS/
   if ( $folder -> att('NAME') =~ m/RS/ ) { 
      $folder -> delete;
   }
}

#print the result. 
$twig -> set_pretty_print('indented_a');
$twig -> print;
Sobrique
  • 4,424
-1
sed -r '/<FOLDER NAME=.*RS.*>/{ :X N; /<\/FOLDER>/d; bX }' file
<CONTAINER>
  <FOLDER NAME="I_R_INR">
  </FOLDER>
</CONTAINER>
mug896
  • 965
  • 9
  • 12