4

I want to check well-formedness of a big XML file. (about 4GB.)

However, when I try xmlwf, all it tells me is

filename.xml: Value too large for defined data type

What to do with it? Is there any other way to check it?

(I am using debian linux and gentoo linux)

Karel Bílek
  • 1,951
  • I'd guess any XML parser would work, as aren't they required to reject documents which aren't well-formed? A quick suggests checking if xmlstarlet does what you want. – derobert Feb 22 '13 at 18:34
  • 1
    From man xmlwf: "-r Normally xmlwf memory-maps the XML file before parsing; this can result in faster parsing on many platforms. -r turns off memory-mapping and uses normal file IO calls instead. Of course, memory-mapping is automatically turned off when reading from standard input." By the way, I assume you are using a 64-bit setup... – Deer Hunter Feb 22 '13 at 19:22

4 Answers4

2
xmllint --noout 4GB.xml

That sort of works.

It goes out of memory, too, but at least it checks something before it dies.

Karel Bílek
  • 1,951
2

You might like to try dtdgen, a program I wrote many years ago to generate a DTD for a document. It not only tells you whether a large file is well-formed, it also tells you what's in it (I wrote it because I wanted to know both).

Kazark
  • 979
  • 3
  • 12
  • 31
0

It's an older question, but as I haven't seen it suggested yet:

Perl with XML::Twig can handle large XML files thanks to having a 'purge' method, which discards in memory data as you go.

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new(
    twig_handlers => {
        _all_ => sub { $_->purge }
    }
)->parsefile( 'my_xml_file.xml' );

The _all_ handler is triggered each element of the twig, and discards in memory data. That's important on a 4G file, because the memory footprint of XML is about 10x. But it'll throw an alert and abort if the XML is not well formed:

mismatched tag at line 12, column 27, byte 274 at C:/Perl/lib/XML/Parser.pm line 187.

(but bear in mind because it aborts, it'll only show you the first error it encounters).

Works on my (much smaller than 4G) sample data anyway.

Sobrique
  • 4,424
0

Not try it myself, but try this out :

xmllint --valid 4GB.xml