Section 8.3. Parsing XML | PHP 5 Power Programming

8.3. Parsing XML

Two techniques are used for parsing XML documents in PHP: SAX (Simple API for XML) and DOM (Document Object Model). By using SAX, the parser goes through your document and fires events for every start and stop tag or other element found in your XML document. You decide how to deal with the generated events. By using DOM, the whole XML file is parsed into a tree that you can walk through using functions from PHP. PHP 5 provides another way of parsing XML: the SimpleXML extension. But first, we explore the two mainstream methods.

8.3.1. SAX

We now leave the somewhat boring theory behind and start with an example. Here, we're parsing the example XHTML file we saw earlier. We do that by using the XML functions available in PHP (http://php.net/xml). First, we create a parser object:

 $xml = xml_parser_create('UTF-8');

The optional parameter, 'UTF-8', denotes the encoding to use while parsing. When this function executes successfully, it returns an XML parser handle for use with all the other XML parsing functions.

Because SAX works by handling events, you need to set up the handlers. In this basic example, we focus on the two most important handlers: one for start and end tags, and one for character data (content):

 xml_set_element_handler($xml, 'start_handler', 'end_handler'); xml_set_character_data_handler($xml, 'character_handler');

These statements set up the handlers, but they must be implemented before any actions occur. Let's look at how the handler functions should be implemented.

In the previous statement, the start_handler is passed three parameters: the XML parser object, the name of the tag, and an associative array containing the attributes defined for the tag.

 function start_handler ($xml, $tag, $attributes) {     global $level;     echo "\n". str_repeat('  ', $level). ">>>$tag";     foreach ($attributes as $key => $value) {         echo " $key $value";     }     $level++; }

The tag name is passed with all characters uppercased if case folding is enabled (the default). You can turn off this behavior by setting an option on the XML parser object, as follows:

 xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, false);

The end handler is not passed the attributes array, only the XML parser object and the tag name:

 function end_handler ($xml, $tag) {     global $level;     $level--;     echo str_repeat('  ', $level, '  '). "<<<$tag; }

To make our test script work, we need to implement the character handler to show all content. We wrap the text in this handler so that it fits nicely on our terminal screen:

 function character_handler ($xml, $data) {     global $level;     $data = split("\n", wordwrap($data, 76  ($level * 2)));     foreach ($data as $line) {         echo str_repeat(($level + 1), '  '). $line. "\n";     } }

After we implement all the handlers, we can start parsing our XML file:

 xml_parse($xml, file_get_contents('test1.xhtml'));

The first part of the output of our script looks like this:

 >>>HTML XMLNS='http://www.w3.org/1999/xhtml' XML:LANG='en' LANG='en'     ||     ||     |  |   >>>HEAD       ||       ||       |    |     >>>TITLE         |XML Example|     <<<TITLE

It doesn't look very pretty. There's a lot of whitespace because the character data handler is called for every bit of data. We can improve the results by putting all data in a buffer, and only outputting the data when the tag closes or when another tag starts. The new script looks like this:

 <?php     /* Initialize variables */     $level = 0;     $char_data = '';     /* Create the parser handle */     $xml = xml_parser_create('UTF-8');     /* Set the handlers */     xml_set_element_handler($xml, 'start_handler', 'end_handler');     xml_set_character_data_handler($xml, 'character_handler');     /* Start parsing the whole file in one run */     xml_parse($xml, file_get_contents('test1.xhtml'));     /****************************************************************      * Functions      */     /*      * Flushes collected data from the character handler      */     function flush_data ()     {         global $level, $char_data;         /* Trim data and dump it when there is data */         $char_data = trim($char_data);         if (strlen($char_data) > 0) {             echo "\n";             // Wrap it nicely, so that it fits on a terminal screen             $data = split("\n", wordwrap($char_data, 76-($level *2)));             foreach ($data as $line) {                 echo str_repeat('  ', ($level +1))."[".$line."]\n";             }         }         /* Clear the data in the buffer */         $char_data = '';     }     /*      * Handler for start tags      */     function start_handler ($xml, $tag, $attributes)     {         global $level;         /* Flush collected data from the character handler */         flush_data();         /* Dump attributes as a string */         echo "\n". str_repeat('  ', $level). "$tag";         foreach ($attributes as $key => $value) {             echo " $key='$value'";         }         /* Increase indentation level */         $level++;     }     function end_handler ($xml, $tag)     {         global $level;         /* Flush collected data from the character handler */         flush_data();         /* Decrease indentation level and print end tag */         $level--;         echo "\n". str_repeat('  ', $level). "/$tag";     }     function character_handler ($xml, $data)     {         global $level, $char_data;         /* Add the character data to the buffer */         $char_data .= ' '. $data;     } ?>

The output looks more decent, of course:

 HTML XMLNS='http://www.w3.org/1999/xhtml' XML:LANG='en' LANG='en'   HEAD     TITLE         [XML Example]     /TITLE   /HEAD   BODY BACKGROUND='bg.png'     P         [Moved to]       A HREF='http://example.org/'           [example.org]       /A         [.]       BR       /BR         [foo  &  bar]     /P   /BODY /HTML

8.3.2. DOM

Parsing a simple X(HT)ML file with a SAX parser is a lot of work. Using the DOM (http://www.w3.org/TR/DOM-Level-3-Core/) method is much easier, but you pay a pricememory usage. Although it might not be noticeable in our small example, it's definitely noticeable when you parse a 20MB XML file with the DOM method. Rather than firing events for every element in the XML file, DOM creates a tree in memory containing your XML file. Figure 8.1 shows the DOM tree that represents the file from the previous section.

Figure 8.1. DOM tree.

We can show all the content without tags by walking through the tree of objects. We do so in this example by recursively going over all node children:

  1 <?php  2   $dom = new DomDocument();  3   $dom->load('test2.xml');  4   $root = $dom->documentElement;  5  6   process_children($root);  7  8   function process_children($node)  9   { 10         $children = $node->childNodes; 11 12         foreach ($children as $elem) { 13               if ($elem->nodeType == XML_TEXT_NODE) { 14                     if (strlen(trim($elem->nodeValue))) { 15                            echo trim($elem->nodeValue)."\n"; 16               } 17               } else if ($elem->nodeType == XML_ELEMENT_NODE) { 18                     process_children($elem); 19               } 20         } 21     } 22 ?>

The output is the following:

 XML Example Moved to example.org . foo & bar

The example shows some very simple DOM processing. We only read attributes of elements and do not call any methods. In line 4, we retrieve the root element of the DOM document that was loaded in line 3. For every element we encounter, we call process_children() (in lines 6 and 18), which iterates over the list of child nodes (line 12). If the node is a text node, we echo its value (lines 1316) and if it's an element, we call process_children recursively (lines 1718). The DOM extension is more powerful than what is shown in this example. It implements almost all the functionality described in the DOM2 specification.

The following example uses the getAttribute() methods of the DomElement class to return the background attribute of the body tag:

  1 <?php  2      $dom = new DomDocument();  3      $dom->load('test2.xml');  4      $root = $dom->documentElement;  5  6      process_children($root);  7  8      function process_children($node)  9      { 10          $children = $node->childNodes; 11 12          foreach ($children as $elem) { 13              if ($elem->nodeType == XML_ELEMENT_NODE) { 14                  if ($elem->nodeName == 'body') { 15                      echo $elem->getAttributeNode('background') ->value. "\n"; 16                  } 17                  process_children($elem); 18              } 19          } 20      } 21 ?>

We still need to recursively search through the tree to find the correct element, but because we know about the structure of the document, we can simplify the example:

 1 <?php 2     $dom = new DomDocument(); 3     $dom->load('test2.xml'); 4     $body = $dom->documentElement->getElementsByTagName('body') ->item(0); 5     echo $body->getAttributeNode('background')->value. "\n"; 6 ?>

Line 4 is the main processing line. First, we request the documentElement of the DOM document, which is the root node of the DOM tree. From that element, we request all child elements with tag name body by using getElementsByTagName. Then, we want the first item in the list (because we know that it is the first body tag in the file is the correct one). In line 5, we request the background attribute with getAttributeNode, and display its value by reading the value property.

8.3.2.1 Using XPath

By using XPath, we can further simplify the previous example. XPath is a query language for XML documents, and it is also used in XSLT for matching nodes. We can use XPath to query a DOM document for certain nodes and attributes, similar to using SQL to query a database:

 1 <?php 2     $dom = new DomDocument(); 3     $dom->load('test2.xml'); 4     $xpath = new DomXPath($dom); 5     $nodes = $xpath->query("*[local-name()='body']", $dom ->documentElement); 6     echo $nodes->item(0)->getAttributeNode('background')->value.       "\n"; 7 ?>

8.3.2.2 Creating a DOM Tree

The DOM extension can do more than parse XML. It can create an XML document from scratch. In your script, you can build a tree of objects that you can dump to disk as an XML file. This ideal way to write XML files is not easy to do from within a script, but we're going to do it anyway. In this example, we create a file with content similar to that shown in the example XML file we used in the previous section. We cannot guarantee that the file will be exactly the same because the DOM extension might not handle the whitespace in the XML file as cleanly as a human would. Let's start by creating the DOM object and the root node:

 <?php     $dom = new DomDocument();     $html = $dom->createElement('html');     $html->setAttribute("xmlns", "http://www.w3.org/1999/xhtml");     $html->setAttribute("xml:lang", "en");     $html->setAttribute("lang", "en");     $dom->appendChild($html);

First, a DomDocument class is created with new DomDocument(). All elements are created by calling the createElement() method of the DomDocument class or createTextNode() for text nodes. The name of the elementin this case, htmlis passed to the method, and an object of the type DomElement is returned. The returned object is used to add attributes to the element. After the DomElement has been created, we add it to the DomDocument by calling the appendChild() method. Then, we add the head to the html element and a title element to the head element:

 $head = $dom->createElement('head'); $html->appendChild($head); $title = $dom->createElement('title'); $title->appendChild($dom->createTextNode("XML Example")); $head->appendChild($title);

As before, we first create a DomElement object (for example, head) by calling the createElement() method of the DomDocument object, and then we add the newly created object to the existing DomElement object (for example, $html) with appendChild(). We then add the body element with its background attribute. Then, we add the 'p' element, which contains the main content of our X(HT)ML document, as a child of the body element:

 /* Create the body element */ $body = $dom->createElement('body'); $body->setAttribute("backgound", "bg.png"); $html->appendChild($body); /* Create the p element */ $p = $dom->createElement('p'); $body->appendChild($p);

The contents of our <p> element are more complicated. It consists (in order) of a text element ("Moved to "), an <a> element, another text element (our dot), the <br> element, and finally, a third text element ("foo & bar"):

 /* Add the "Moved to" */ $text = $dom->createTextNode("Moved to "); $p->appendChild($text); /* Add the a */ $a = $dom->createelement('a'); $a->setAttribute("href", "http://example.org/"); $a->appendChild($dom->createTextNode("example.org")); $p->append_child($a); /* Add the ".", br and "foo & bar" */ $text = $dom->createTextNode("."); $p->appendChild($text); $br = $dom->createElement('br'); $p->appendChild($br); $text = $dom->createTextNode("foo & bar"); $p->appendChild($text);

When we're finished creating the DOM of our X(HT)ML document, we echo it to the screen:

     echo $dom->saveXML(); ?>

The output resembles our original document, but without some of the whitespace (which is added here for readability):

 <?xml version="1.0"?> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">   <head>     <title>XML Example</title>   </head>   <body background="bg.png">     <p>Moved to <a href="http://example.org/">example.org</a>. <br>foo &amp; bar</p>   </body> </html>