Recipe 12.5. Parsing Large XML Documents

12.5.1. Problem

You want to parse a large XML document. This document is so large that it's impractical to use SimpleXML or DOM because you cannot hold the entire document in memory. Instead, you must load the document in one section at a time.

12.5.2. Solution

Use the XMLReader extension:

<?php $reader = new XMLReader(); $reader->open('card-catalog.xml'); /* Loop through document */ while ($reader->read()) {     /* If you're at an element named 'author' */     if ($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'author') {         /* Move to the text node and print it out */         $reader->read();         print $reader->value . "\n";     } } ?>

12.5.3. Discussion

There are two major types of XML parsers: ones that hold the entire document in memory at once, and ones that hold only a small portion of the document in memory at any given time.

The first kind are called tree-based parsers, since they store the document into a data structure known as a tree. The SimpleXML and DOM extensions, from Recipes 12.3 and 12.4, are tree-based parsers. Using a tree-based parser is easier for you, but requires PHP to use more RAM. With most XML documents, this isn't a problem. However, when your XML document is quite large, then this can cause major performance issues.

The other kind of XML parser is a stream-based parser. Stream-based parsers don't store the entire document in memory; instead, they read in one node at a time and allow you to interact with it in real time. Once you move onto the next node, the old one is thrown away'unless you explicitly store it yourself for later use. This makes stream-based parsers faster and less memory consuming, but you may have to write more code to process the document.

The easiest way to process XML data using a stream-based parser is using the XMLReader extension. This extension is based on the C# XmlTextReader API. If you're familiar with the SAX (Simple API for XML) interface from PHP 4, it's still available in PHP 5, but the XMLReader extension is more intuitive, feature-rich, and faster.

XMLReader is enabled by default as of PHP 5.1. If you're running PHP 5.0.x, grab the extension from PECL at http://pecl.php.net/package/xmlReader and install it yourself.

Begin by creating a new instance of the XMLReader class and specifying the location of your XML data:

<?php // Create a new XMLReader object $reader = new XMLReader(); // Load from a file or URL $reader->open('document.xml'); // Or, load from a PHP variable $reader->XML($document); ?>

Most of the time, you'll use the XMLReader::open( ) method to pull in data from an external source, but you can also load it from an existing PHP variable with XMLReader::XML( ).

Once the object is configured, you begin processing the data. At the start, you're positioned at the top of the document. You can maneuver through the document using a combination of the two navigation methods XMLReader provides: XMLReader::read( ) and XMLReader::next( ). The first method reads in the piece of XML data that immediately follows the current position. The second method moves to the next sibling element after the current position.

For example, look at the XML in Example 12-8.

Card catalog in XML

<books>     <book isbn="1565926811">         <title>PHP Cookbook</title>         <author>Sklar</author>         <author>Trachtenberg</author>         <subject>PHP</subject>     </book>     <book isbn="0596003137">         <title>Perl Cookbook</title>         <author>Christiansen</author>         <author>Torkington</author>         <subject>Perl</subject>     </book> </books>

When the object is positioned at the first <book> element, the read( ) method moves you to the next element underneath <book>. (This is technically the whitespace between <book> and <title>.) In comparison, next( ) moves you to the next <book> element and skips the entire PHP Cookbook subtree.

These methods return TRue when they're able to successfully move to another node, and false when they cannot. So, it's typical to use them inside a while loop, as such:

/* Loop through document */ while ($reader->read()) {     /* Process XML */ }

This causes the object to read in the entire XML document one piece at a time. Inside the while( ), examine $reader and process it accordingly.

A common aspect to check is the node type. This lets you know if you've reached an element (and then check the name of that element), a closing element, an attribute, a piece of text, some whitespace, or any other part of an XML document. Do this by referencing the nodeType attribute:

/* Loop through document */ while ($reader->read()) {     /* If you're at an element named 'author' */     if ($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'author') {         /* Process author element */     } }

This code checks if the node is an element and, if so, that its name is author. For a complete list of possible values stored in nodeType, check out Table 12-1.

Table 12-1. XMLReader node type values
Node type	Description
`XMLReader::NONE`	No node type
`XMLReader::ELEMENT`	Start element
`XMLReader::ATTRIBUTE`	Attribute node
`XMLReader::TEXT`	Text node
`XMLReader::CDATA`	CDATA node
`XMLReader::ENTITY_REF`	Entity Reference node
`XMLReader::ENTITY`	Entity Declaration node
`XMLReader::PI`	Processing Instruction node
`XMLReader::COMMENT`	Comment node
`XMLReader::DOC`	Document node
`XMLReader::DOC_TYPE`	Document Type node
`XMLReader::DOC_FRAGMENT`	Document Fragment node
`XMLReader::NOTATION`	Notation node
`XMLReader::WHITESPACE`	Whitespace node
`XMLReader::SIGNIFICANT_WHITESPACE`	Significant Whitespace node
`XMLReader::END_ELEMENT`	End Element
`XMLReader::END_ENTITY`	End Entity
`XMLReader::XML_DECLARATION`	XML Declaration node

From there, you can decide how to handle that element and the data it contains. For example, printing out all the author names in the card catalog:

$reader = new XMLReader(); $reader->open('card-catalog.xml'); /* Loop through document */ while ($reader->read()) {     /* If you're at an element named 'author' */     if ($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'author') {         /* Move to the text node and print it out */         $reader->read();         print $reader->value . "\n";     } } Sklar Trachtenberg Christiansen Torkington

Once you've reached the <author> element, call $reader->read( ) to advance to the text inside it. From there, you can find the author names inside of $reader->value.

The XMLReader::value attribute provides you access with a node's value. This only applies to nodes where this is a meaningful concept, such as text nodes or CDATA nodes. In all other cases, such as element nodes, this attribute is set to the empty string.

Table 12-2 contains a complete listing of XMLReader object properties, including value.

Table 12-2. XMLReader node type values
Name	Type	Description
`attributeCount`	int	Number of node attributes
`baseURI`	string	Base URI of the node
`depth`	int	Tree depth of the node, starting at 0
`hasAttributes`	bool	If the node has attributes
`hasValue`	bool	If the node has a text value
`isDefault`	bool	If the attribute value is defaulted from DTD
`isEmptyElement`	bool	If the node is an empty element tag
`localName`	string	Local name of the node
`name`	string	Qualified name of the node
`namespaceURI`	string	URI of the namespace associated with the node
`nodeType`	int	Node type of the node
`prefix`	string	Namespace prefix associated with the node
`value`	string	Text value of the node
`xmlLang`	string	`xml:lang` scope of the node

There's one remaining major piece of XMLReader functionality: attributes. XMLReader has a special set of methods to access attribute data when it's on top of an element node, including the following: moveToAttribute( ), moveToFirstAttribute( ), and moveToNextAttribute( ).

The moveToAttribute( ) method lets you specify an attribute name. For example, here's code using the card catalog XML to print out all the ISBN numbers:

<?php $reader = new XMLReader(); $reader->XML($catalog); /* Loop through document */ while ($reader->read()) {     /* If you're at an element named 'book' */     if ($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'book') {         $reader->moveToAttribute('isbn');         print $reader->value . "\n";     } } ?>

Once you've found the <book> element, call moveToAttribute('isbn') to advance to the isbn attribute, so you can read its value and print it out.

In the examples in this recipe, we print out information on all books. However, it's easy to modify them to retrieve data only for one specific book. For example, this code combines pieces of the examples to print out all the data for Perl Cookbook in an efficient fashion:

<?php $reader = new XMLReader(); $reader->XML($catalog); // Perl Cookbook ISBN is 0596003137 // Use array to make it easy to add additional ISBNs $isbns = array('0596003137' => true); /* Loop through document to find first <book> */ while ($reader->read()) {     /* If you're at an element named 'book' */     if ($reader->nodeType == XMLREADER::ELEMENT &&         $reader->localName == 'book') {         break;     } } /* Loop through <book>s to find right ISBNs */ do {     if ($reader->moveToAttribute('isbn') &&         isset($isbns[$reader->value])) {         while ($reader->read()) {             switch ($reader->nodeType) {             case XMLREADER::ELEMENT:                 print $reader->localName . ": ";                 break;             case XMLREADER::TEXT:                 print $reader->value . "\n";                 break;             case XMLREADER::END_ELEMENT;                 if ($reader->localName == 'book') {                     break 2;                 }             }         }     } } while ($reader->next()); ?> title: Perl Cookbook author: Christiansen author: Torkington subject: Perl

The first while( ) iterates sequentially until it finds the first <book> element.

Having lined yourself up correctly, you then break out of the loop and start checking ISBN numbers. That's handled inside a do... while( ) loop that uses $reader->next( ) to move down the <book> list. You cannot use a regular while( ) here or you'll skip over the first <book>. Also, this is a perfect example of when to use $reader->next( ) instead of $reader->read( ).

If the ISBN matches a value in $isbns, then you want to process the data inside the current <book>. This is handled using yet another while( ) and a switch( ).

There are three different switch( ) cases: an opening element, element text, and a closing element. If you're opening an element, you print out the element's name and a colon. If you're text, you print out the textual data. And if you're closing an element, you check to see whether you're closing the <book>. If so, then you've reached the end of the data for that particular book, and you need to return to the do... while( ) loop. This is handled using a break 2;; while jumps back two levels, instead of the usual one level.

12.5.4. See Also

Recipe 12.3 for parsing simple XML documents; Recipe 12.4 for parsing complex XML documents; documentation on XMLReader at http://www.php.net/xmlreader; more information about the underlying libxml2 C library's XMLReader functions at http://xmlsoft.org/xmlreader.html.

12.5.1. Problem

12.5.2. Solution

12.5.3. Discussion

Card catalog in XML

Table 12-1. XMLReader node type values

Table 12-2. XMLReader node type values

12.5.4. See Also