0

I'm using xmllint --shell on a large XML file, and using the write command to write out an XML snippet to be used in testing. The snippet that gets written out needs a couple of lines from the original XML file (the declaration, namespace, and the root node.) I want to be able to add these lines to the file, without having to manually copy the lines over. Instead, I'd like to use sed to append these lines for me so I can write a function to automate this very tedious task. To illustrate, this is a sample of what I'm trying to accomplish.

Source XML (source.xml):

<?xml version="1.0" encoding="UTF-8"?>
<foo:root xmlns:foo="TheFooNameSpaceIsImportant">
    <foo:Entry>
        <foo:SomeNode>Foo1</foo:SomeNode>
        <foo:AnotherNode>Bar1</foo:AnotherNode>
    </foo:Entry>
    <foo:Entry>
        <foo:SomeNode>Foo2</foo:SomeNode>
        <foo:AnotherNode>Bar2</foo:AnotherNode>
    </foo:Entry>
    <foo:Entry>
        <foo:SomeNode>Foo3</foo:SomeNode>
        <foo:AnotherNode>Bar3</foo:AnotherNode>
    </foo:Entry>
    <!-- tens of thousands of others -->
    <foo:Entry>
        <foo:SomeNode>Foo20432</foo:SomeNode>
        <foo:AnotherNode>Bar20432</foo:AnotherNode>
    </foo:Entry>

</foo:root>

Saved XML snippet (sample.xml):

<foo:Entry>
    <foo:SomeNode>Foo</foo:SomeNode>
    <foo:AnotherNode>Bar</foo:AnotherNode>
</foo:Entry>

So I need to wrap this with the top two lines and the bottom line of the source.xml. But the following fails due to the < character:

$ sed -i 1i"`head -n 2 source.xml`" sample.xml
sed: -e expression #1, char 43: unknown command: `<'

Is there a way to escape this character when it's being fed from a sub-command like this?

jktravis
  • 2,216

3 Answers3

4

The sed command "i" expects \ followed by text, as explained in the output of BSD sed when given the command you provide above.

However, it only expects one line of text. To insert more than that, you need to include a backslash at the end of the first line:

sed "1i\\
$(head -n 2 source.xml | sed 's/$/\\/')
" sample.xml

This (nested sed calls) gets a little bit ridiculous. As I've written about elsewhere, the tool of choice for in-place scripted file edits is not sed, but ex:

ex -sc '1,2ya | n! | 0pu | x' source.xml sample.xml

The -s flag starts ex in silent mode, for batch processing. -c specifies the command to be run.

1,2ya yanks (i.e. copies) the first two lines of the first file, source.xml.

| is the command separator.

n! goes to the next file, discarding any changes made to the current file. (We haven't made any in this case so n would work just as well.)

0pu "puts" (i.e. pastes) the lines we copied earlier, placing them just after line "0" (that is, pasting them above the first line).

x exits, saving the changes that have been made to the current file.

Unlike sed -i which is not specified in POSIX (and which doesn't work on BSD sed which requires a backup file extension be given the -i switch even if empty), the above ex command is fully POSIX compliant.

Wildcard
  • 36,499
  • Cool! Thanks! I'll look at using ex for these kinds of things in the future! – jktravis Mar 11 '16 at 15:51
  • 1
    A common problem with ex in scripts is that if there's a problem, like the first file has fewer than 2 lines, or one of the file is not readable or writable, ex will end up hanging waiting for user input. The error handling with ed or ex is generally tricky. – Stéphane Chazelas Mar 11 '16 at 16:02
  • That's true. In which case the thing to do is type q! and press enter, then try the command again, removing the -s option, to see what error you're getting. – Wildcard Mar 11 '16 at 16:04
  • @StéphaneChazelas, would there be any issue with using something like printf '1,2ya\nn\n0p\nx\n' | ex source.xml sample.xml? As far as I can tell, ex will always reliably exit in this case, having either successfully performed the expected edit or having done nothing. – Wildcard Feb 03 '17 at 03:39
1

When you insert/append multiple lines you have to escape the end of line so that sed knows when to stop inserting/appending. In your case you could run

head -n 2 source.xml | sed '1i\
1i\\
s/\\/&&/g
$!s/$/\\/' | sed -f - sample.xml

The first sed processes the input (adds the 1i\ command before those two lines, escapes any backslashes and also the end of lines if not the last line) and passes it as a sed script to the second command. Add -i to the second sed if you wish to edit in place.

don_crissti
  • 82,805
1

Don't use sed with XML. XML is a contextual data structure, and regex simply doesn't support that well. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Use a parser. perl has XML::Twig which works quite nicely:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use XML::Twig;

my $xml_to_insert = XML::Twig -> parse ( '<foo:Entry>
    <foo:SomeNode>Foo</foo:SomeNode>
    <foo:AnotherNode>Bar</foo:AnotherNode>
</foo:Entry>') -> root -> copy;

my $xml = XML::Twig -> parse ( \*DATA ); 

$xml_to_insert -> paste ( 'first_child', $xml -> root );
$xml -> set_pretty_print ( 'indented_a');
$xml -> print;


__DATA__
<?xml version="1.0" encoding="UTF-8"?>
<foo:root xmlns:foo="TheFooNameSpaceIsImportant">
    <foo:Entry>
        <foo:SomeNode>Foo1</foo:SomeNode>
        <foo:AnotherNode>Bar1</foo:AnotherNode>
    </foo:Entry>
    <foo:Entry>
        <foo:SomeNode>Foo2</foo:SomeNode>
        <foo:AnotherNode>Bar2</foo:AnotherNode>
    </foo:Entry>
    <foo:Entry>
        <foo:SomeNode>Foo3</foo:SomeNode>
        <foo:AnotherNode>Bar3</foo:AnotherNode>
    </foo:Entry>
    <!-- tens of thousands of others -->
    <foo:Entry>
        <foo:SomeNode>Foo20432</foo:SomeNode>
        <foo:AnotherNode>Bar20432</foo:AnotherNode>
    </foo:Entry>

</foo:root>

Output:

<?xml version="1.0" encoding="UTF-8"?>
<foo:root xmlns:foo="TheFooNameSpaceIsImportant">
  <foo:Entry>
    <foo:SomeNode>Foo</foo:SomeNode>
    <foo:AnotherNode>Bar</foo:AnotherNode>
  </foo:Entry>
  <foo:Entry>
    <foo:SomeNode>Foo1</foo:SomeNode>
    <foo:AnotherNode>Bar1</foo:AnotherNode>
  </foo:Entry>
  <foo:Entry>
    <foo:SomeNode>Foo2</foo:SomeNode>
    <foo:AnotherNode>Bar2</foo:AnotherNode>
  </foo:Entry>
  <foo:Entry>
    <foo:SomeNode>Foo3</foo:SomeNode>
    <foo:AnotherNode>Bar3</foo:AnotherNode>
  </foo:Entry>
  <!-- tens of thousands of others -->
  <foo:Entry>
    <foo:SomeNode>Foo20432</foo:SomeNode>
    <foo:AnotherNode>Bar20432</foo:AnotherNode>
  </foo:Entry>
</foo:root>

This is longer and more verbose for the sake of illustration - but it essentially takes your snippet, and copy-pastes it into your structure. Nice and simple.

XML::Twig also supports "parsefile_inplace" which allows you to do much the same as sed -i. So your example would look a bit more like:

my $xml_to_insert = XML::Twig -> parsefile ( 'source.xml' ) -> root -> copy;

XML::Twig -> new ( pretty_print => 'indented_a',
                   twig_handlers => { 
                       'foo:root' => sub {  
                            $xml_to_insert -> paste ( 'first_child', $_ ) 
                        } }) -> parsefile_inplace ('sample.xml'); 

Or if that looks a bit too convoluted:

sub insert_source {
    my ( $twig, $branch ) = @_;  
    my $xml_to_insert = XML::Twig -> parsefile ( 'source.xml' ) -> root -> copy; 
    $xml_to_insert -> paste ( 'first_child', $branch ); 
}

my $xml = XML::Twig -> new ( twig_handlers => { 'foo:root' => \&insert_source } );
   $xml -> parsefile_inplace ( 'sample.xml'); 
Sobrique
  • 4,424
  • Thanks, but I'm not using sed to parse the source. The sample file will be used in an XSLT as a test case with XSpec. But since the source document is very large, I explore the document using xmllint --shell and file the cases I want, save them, and then write my XSpec test cases. – jktravis Mar 11 '16 at 16:21
  • You can almost certainly do the whole task using XML::Twig too in that case. It supports large source files, thanks to being able to purge or flush as you go. – Sobrique Mar 11 '16 at 16:27
  • 1
    +1. This is the correct answer and should not have been downvoted. sed (or any regexp-based extraction) is not the right tool for parsing or extracting structured data like XML. grepping out a single line from an XML file...yeah, maybe...but XML files often don't have newlines, it's quite common for them to be a single line tens of thousands of characters long. – cas Mar 12 '16 at 01:25