I'm working with several huge (>2gb) XML files and their size is causing problems.
(My application uses XMLReader in a PHP script to parse smaller ~500mb files, and that works fine, but XMLReader won't open these large files.)
So - my idea is to eliminate big parent nodes of the file that I know I don't need.
For example, if the structure of the file looks like this:
<record id="1">
<a>
<detail>blah</detail>
....
<detail>blah</detail>
</a>
<b>
<detail>blah</detail>
....
<detail>blah</detail>
</b>
<c>
<detail>blah</detail>
....
<detail>blah</detail>
</c>
</record>
...
<record id="999999">
<a>
<detail>blah</detail>
....
<detail>blah</detail>
</a>
<b>
<detail>blah</detail>
....
<detail>blah</detail>
</b>
<c>
<detail>blah</detail>
....
<detail>blah</detail>
</c>
</record>
For my purposes - I only need the data in parent node <a>
for each record. If I could eliminate parent nodes <b>
and <c>
from every record, I could reduce the size of the file substantially, so it would be small enough to work with normally.
What's the best way to do something like this?
Most of the "XML-aware" utilities I've tried choke on a file this large, so I'm hoping that I can do this with something like sed
or grep
.
awk
orsed
, but you should not rely on regex to allow the manipulation or parsing of arbitrary XML. – HalosGhost Sep 18 '14 at 17:28<b>(.|\n)*?</b>
match all of the<b>
nodes, regardless of what else was within them? – mattstuehler Sep 18 '14 at 18:10<b test-attrib="things">
) or self closing tags (<hr />
). Regex are only truly functional on regular languages; XML/(X)HTML are not regular languages. – HalosGhost Sep 18 '14 at 18:25int
for file offsets (which only goes up to 2GB) :(. – sourcejedi Sep 19 '14 at 19:02