How to remove nodes from a HUGE (>2gb) XML file?

Question

I'm working with several huge (>2gb) XML files and their size is causing problems.

(My application uses XMLReader in a PHP script to parse smaller ~500mb files, and that works fine, but XMLReader won't open these large files.)

So - my idea is to eliminate big parent nodes of the file that I know I don't need.

For example, if the structure of the file looks like this:

<record id="1">
    <a>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </a>
    <b>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>
...
<record id="999999">
    <a>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </a>
    <b>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>

For my purposes - I only need the data in parent node <a> for each record. If I could eliminate parent nodes  and <c> from every record, I could reduce the size of the file substantially, so it would be small enough to work with normally.

What's the best way to do something like this?

Most of the "XML-aware" utilities I've tried choke on a file this large, so I'm hoping that I can do this with something like sed or grep.

XML, like (X)HTML is not a regular language; as a result parsing or manipulating it with regex is not feasible. You may be able to do some minimal manipulation with awk or sed, but you should not rely on regex to allow the manipulation or parsing of arbitrary XML. — HalosGhost, Sep 18 '14 at 17:28
If you like perl: http://search.cpan.org/~mirod/XML-Twig-3.48/Twig.pm was designed to handle huge XML files — charlesbridge, Sep 18 '14 at 17:55
@HalosGhost - Thanks for your comment. Just curious though... Why can't you manipulate it with regex? For example, wouldn't (.|\n)*? match all of the  nodes, regardless of what else was within them? — mattstuehler, Sep 18 '14 at 18:10
Sure, but the trouble becomes really obvious once you see attributes () or self closing tags (<hr />). Regex are only truly functional on regular languages; XML/(X)HTML are not regular languages. — HalosGhost, Sep 18 '14 at 18:25
have you a programm call xsltproc ? if yes, xml/xsl master somewhere (or me tomorrow) can produce a parser to fit your need. — Archemar, Sep 18 '14 at 21:18
Note that xmlreader should be able to work because it's a streaming API. I guess you're on 32-bit & it hasn't been built with "large file support", so it's using a 32-bit int for file offsets (which only goes up to 2GB) :(. — sourcejedi, Sep 19 '14 at 19:02

score 1 · Answer 1 · edited Apr 16 '19 at 18:27

1

Personnaly, I would do something in C (which is at lowest level possible before assembly) and Loop through all the nodes using libxml.

Here are some examples : http://www.xmlsoft.org/examples/

Use GCC to compile your code.

edited Apr 16 '19 at 18:27

Rui F Ribeiro

56,709
26
150
232

answered Sep 18 '14 at 17:23

Simon Paquet

143

gena2x · Accepted Answer · 2014-09-19T17:50:43.763

1

You may use awk:

$cat my.xml | awk '/<b>/{hide=1} /<\/record>/ {hide=0} {if (hide==0) print;}' >mynew.xml

this will hide everything since line which contains  and start display with line containing </record>

Per your comment, if your XML is one big line - just split it to lines and remove newline chars back after you've done the conversion.

$cat my.xml|sed 's/>/>\n/g'| awk ....... | tr -d '\n' >.....

Throw away xml, start using YAML or JSON!

edited Sep 19 '14 at 17:50

answered Sep 18 '14 at 19:14

gena2x

2,397

this is an elegant solution. However, made a mistake in my original post... My XML doesn't actually have line returns after each node (I just showed it that way for readability). My actual XML file flows continuously on one line. (However, oddly, there are a handful of randomly scattered line breaks throughout.) So, I need a solution that doesn't depend on line-breaks. – mattstuehler Sep 19 '14 at 15:40
Just add line-break after each > and then remove all breaks (updated answer). – gena2x Sep 19 '14 at 16:33

How to remove nodes from a HUGE (>2gb) XML file?

2 Answers2