Recipe 22.8 Processing Files Larger Than Available Memory

22.8.1 Problem

You want to work with a large XML file, but you can't read it into memory to form a DOM or other kind of tree because it's too big.

22.8.2 Solution

Use SAX (as described in Recipe 22.3) to process events instead of building a tree.

Alternatively, use XML::Twig to build trees only for the parts of the document you want to work with (as specified by XPath expressions):

use XML::Twig; my $twig = XML::Twig->new( twig_handlers => {                                $XPATH_EXPRESSION => \&HANDLER,                                # ...                             }); $twig->parsefile($FILENAME); $twig->flush( );

You can call a lot of DOM-like functions from within a handler, but only the elements identified by the XPath expression (and whatever those elements enclose) go into a tree.

22.8.3 Discussion

DOM modules turn the entire document into a tree, regardless of whether you use all of it. With SAX modules, there are no trees built if your task depends on document structure, you must keep track of that structure yourself. A happy middle ground is XML::Twig, which creates DOM trees only for the bits of the file that you're interested in. Because you work with files a piece at a time, you can cope with very large files by processing pieces that fit in memory.

For example, to print the titles of books in books.xml (Example 22-1), you could write:

use XML::Twig; my $twig = XML::Twig->new( twig_roots => { '/books/book' => \&do_book }); $twig->parsefile("books.xml"); $twig->purge( ); sub do_book {   my($title) = $_->find_nodes("title");   print $title->text, "\n"; }

For each book element, XML::Twig calls do_book on its contents. That subroutine finds the title node and prints its text. Rather than having the entire file parsed into a DOM structure, we keep only one book element at a time.

Consult the XML::Twig manpages for details on how much DOM and XPath the module supports it's not complete, but it's growing all the time. XML::Twig uses XML::Parser for its XML parsing, and as a result the functions available on nodes are slightly different from those provided by XML::LibXSLT's DOM parsing.

22.8.4 See Also

Recipe 22.6; the documentation for the module XML::Twig



Perl Cookbook
Perl Cookbook, Second Edition
ISBN: 0596003137
EAN: 2147483647
Year: 2003
Pages: 501

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net