Recipe 12.5. Parsing Large XML Documents


12.5.1. Problem

You want to parse a large XML document. This document is so large that it's impractical to use SimpleXML or DOM because you cannot hold the entire document in memory. Instead, you must load the document in one section at a time.

12.5.2. Solution

Use the XMLReader extension:

<?php $reader = new XMLReader(); $reader->open('card-catalog.xml'); /* Loop through document */ while ($reader->read()) {     /* If you're at an element named 'author' */     if ($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'author') {         /* Move to the text node and print it out */         $reader->read();         print $reader->value . "\n";     } } ?>

12.5.3. Discussion

There are two major types of XML parsers: ones that hold the entire document in memory at once, and ones that hold only a small portion of the document in memory at any given time.

The first kind are called tree-based parsers, since they store the document into a data structure known as a tree. The SimpleXML and DOM extensions, from Recipes 12.3 and 12.4, are tree-based parsers. Using a tree-based parser is easier for you, but requires PHP to use more RAM. With most XML documents, this isn't a problem. However, when your XML document is quite large, then this can cause major performance issues.

The other kind of XML parser is a stream-based parser. Stream-based parsers don't store the entire document in memory; instead, they read in one node at a time and allow you to interact with it in real time. Once you move onto the next node, the old one is thrown away'unless you explicitly store it yourself for later use. This makes stream-based parsers faster and less memory consuming, but you may have to write more code to process the document.

The easiest way to process XML data using a stream-based parser is using the XMLReader extension. This extension is based on the C# XmlTextReader API. If you're familiar with the SAX (Simple API for XML) interface from PHP 4, it's still available in PHP 5, but the XMLReader extension is more intuitive, feature-rich, and faster.

XMLReader is enabled by default as of PHP 5.1. If you're running PHP 5.0.x, grab the extension from PECL at http://pecl.php.net/package/xmlReader and install it yourself.

Begin by creating a new instance of the XMLReader class and specifying the location of your XML data:

<?php // Create a new XMLReader object $reader = new XMLReader(); // Load from a file or URL $reader->open('document.xml'); // Or, load from a PHP variable $reader->XML($document); ?>

Most of the time, you'll use the XMLReader::open( ) method to pull in data from an external source, but you can also load it from an existing PHP variable with XMLReader::XML( ).

Once the object is configured, you begin processing the data. At the start, you're positioned at the top of the document. You can maneuver through the document using a combination of the two navigation methods XMLReader provides: XMLReader::read( ) and XMLReader::next( ). The first method reads in the piece of XML data that immediately follows the current position. The second method moves to the next sibling element after the current position.

For example, look at the XML in Example 12-8.

Card catalog in XML

<books>     <book isbn="1565926811">         <title>PHP Cookbook</title>         <author>Sklar</author>         <author>Trachtenberg</author>         <subject>PHP</subject>     </book>     <book isbn="0596003137">         <title>Perl Cookbook</title>         <author>Christiansen</author>         <author>Torkington</author>         <subject>Perl</subject>     </book> </books>

When the object is positioned at the first <book> element, the read( ) method moves you to the next element underneath <book>. (This is technically the whitespace between <book> and <title>.) In comparison, next( ) moves you to the next <book> element and skips the entire PHP Cookbook subtree.

These methods return TRue when they're able to successfully move to another node, and false when they cannot. So, it's typical to use them inside a while loop, as such:

/* Loop through document */ while ($reader->read()) {     /* Process XML */ }

This causes the object to read in the entire XML document one piece at a time. Inside the while( ), examine $reader and process it accordingly.

A common aspect to check is the node type. This lets you know if you've reached an element (and then check the name of that element), a closing element, an attribute, a piece of text, some whitespace, or any other part of an XML document. Do this by referencing the nodeType attribute:

/* Loop through document */ while ($reader->read()) {     /* If you're at an element named 'author' */     if ($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'author') {         /* Process author element */     } }

This code checks if the node is an element and, if so, that its name is author. For a complete list of possible values stored in nodeType, check out Table 12-1.

Table 12-1. XMLReader node type values

Node type

Description

XMLReader::NONE

No node type

XMLReader::ELEMENT

Start element

XMLReader::ATTRIBUTE

Attribute node

XMLReader::TEXT

Text node

XMLReader::CDATA

CDATA node

XMLReader::ENTITY_REF

Entity Reference node

XMLReader::ENTITY

Entity Declaration node

XMLReader::PI

Processing Instruction node

XMLReader::COMMENT

Comment node

XMLReader::DOC

Document node

XMLReader::DOC_TYPE

Document Type node

XMLReader::DOC_FRAGMENT

Document Fragment node

XMLReader::NOTATION

Notation node

XMLReader::WHITESPACE

Whitespace node

XMLReader::SIGNIFICANT_WHITESPACE

Significant Whitespace node

XMLReader::END_ELEMENT

End Element

XMLReader::END_ENTITY

End Entity

XMLReader::XML_DECLARATION

XML Declaration node


From there, you can decide how to handle that element and the data it contains. For example, printing out all the author names in the card catalog:

$reader = new XMLReader(); $reader->open('card-catalog.xml'); /* Loop through document */ while ($reader->read()) {     /* If you're at an element named 'author' */     if ($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'author') {         /* Move to the text node and print it out */         $reader->read();         print $reader->value . "\n";     } } Sklar Trachtenberg Christiansen Torkington

Once you've reached the <author> element, call $reader->read( ) to advance to the text inside it. From there, you can find the author names inside of $reader->value.

The XMLReader::value attribute provides you access with a node's value. This only applies to nodes where this is a meaningful concept, such as text nodes or CDATA nodes. In all other cases, such as element nodes, this attribute is set to the empty string.

Table 12-2 contains a complete listing of XMLReader object properties, including value.

Table 12-2. XMLReader node type values

Name

Type

Description

attributeCount

int

Number of node attributes

baseURI

string

Base URI of the node

depth

int

Tree depth of the node, starting at 0

hasAttributes

bool

If the node has attributes

hasValue

bool

If the node has a text value

isDefault

bool

If the attribute value is defaulted from DTD

isEmptyElement

bool

If the node is an empty element tag

localName

string

Local name of the node

name

string

Qualified name of the node

namespaceURI

string

URI of the namespace associated with the node

nodeType

int

Node type of the node

prefix

string

Namespace prefix associated with the node

value

string

Text value of the node

xmlLang

string

xml:lang scope of the node


There's one remaining major piece of XMLReader functionality: attributes. XMLReader has a special set of methods to access attribute data when it's on top of an element node, including the following: moveToAttribute( ), moveToFirstAttribute( ), and moveToNextAttribute( ).

The moveToAttribute( ) method lets you specify an attribute name. For example, here's code using the card catalog XML to print out all the ISBN numbers:

<?php $reader = new XMLReader(); $reader->XML($catalog); /* Loop through document */ while ($reader->read()) {     /* If you're at an element named 'book' */     if ($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'book') {         $reader->moveToAttribute('isbn');         print $reader->value . "\n";     } } ?>

Once you've found the <book> element, call moveToAttribute('isbn') to advance to the isbn attribute, so you can read its value and print it out.

In the examples in this recipe, we print out information on all books. However, it's easy to modify them to retrieve data only for one specific book. For example, this code combines pieces of the examples to print out all the data for Perl Cookbook in an efficient fashion:

<?php $reader = new XMLReader(); $reader->XML($catalog); // Perl Cookbook ISBN is 0596003137 // Use array to make it easy to add additional ISBNs $isbns = array('0596003137' => true); /* Loop through document to find first <book> */ while ($reader->read()) {     /* If you're at an element named 'book' */     if ($reader->nodeType == XMLREADER::ELEMENT &&         $reader->localName == 'book') {         break;     } } /* Loop through <book>s to find right ISBNs */ do {     if ($reader->moveToAttribute('isbn') &&         isset($isbns[$reader->value])) {         while ($reader->read()) {             switch ($reader->nodeType) {             case XMLREADER::ELEMENT:                 print $reader->localName . ": ";                 break;             case XMLREADER::TEXT:                 print $reader->value . "\n";                 break;             case XMLREADER::END_ELEMENT;                 if ($reader->localName == 'book') {                     break 2;                 }             }         }     } } while ($reader->next()); ?> title: Perl Cookbook author: Christiansen author: Torkington subject: Perl 

The first while( ) iterates sequentially until it finds the first <book> element.

Having lined yourself up correctly, you then break out of the loop and start checking ISBN numbers. That's handled inside a do... while( ) loop that uses $reader->next( ) to move down the <book> list. You cannot use a regular while( ) here or you'll skip over the first <book>. Also, this is a perfect example of when to use $reader->next( ) instead of $reader->read( ).

If the ISBN matches a value in $isbns, then you want to process the data inside the current <book>. This is handled using yet another while( ) and a switch( ).

There are three different switch( ) cases: an opening element, element text, and a closing element. If you're opening an element, you print out the element's name and a colon. If you're text, you print out the textual data. And if you're closing an element, you check to see whether you're closing the <book>. If so, then you've reached the end of the data for that particular book, and you need to return to the do... while( ) loop. This is handled using a break 2;; while jumps back two levels, instead of the usual one level.

12.5.4. See Also

Recipe 12.3 for parsing simple XML documents; Recipe 12.4 for parsing complex XML documents; documentation on XMLReader at http://www.php.net/xmlreader; more information about the underlying libxml2 C library's XMLReader functions at http://xmlsoft.org/xmlreader.html.




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net