Traversing the DOM with PHP s DOM Classes

I l @ ve RuBoard

Traversing the DOM with PHP's DOM Classes

Because PHP's DOM parser works by creating standard objects to represent XML structures, an understanding of these objects and their capabilities is essential to using this technique effectively. This section examines the classes that form the blueprint for these objects in greater detail.

DomDocument Class

A DomDocument object is typically the first object created by the DOM parser when it completes parsing an XML document. It may be created by a call to xmldoc() :

 $doc = xmldoc("<?xml version='1.0'?><element>potassium</element>"); 

Or, if your XML data is in a file (rather than a string), you can use the xmldocfile() function to create a DomDocument object:

 $doc = xmldocfile("element.xml"); 

Treading the Right Path

If you're using Windows, you'll need to give xmldocfile() the full path to the XML file. Don't forget to include the drive letter!

When you examine the structure of the DomDocument object with print_r() ,you can see that it contains basic information about the XML document ”including the XML version, the encoding and character set, and the URL of the document:

 DomDocument Object  (     [name] =>      [url] =>      [version] => 1.0      [standalone] => -1      [type] => 9      [compression] => -1      [charset] => 1  ) 

Peekaboo!

You'll notice that many examples in this book (particularly in this chapter) use the print_r() function to display the structure of a particular PHP variable. In case you're not familiar with this function, you should know that it provides an easy way to investigate the innards of a particular variable, array, or object. Use it whenever you need to look inside an object to see what makes it tick; and, if you're feeling really adventurous, you might also want to take a look at the var_dump() and var_export() functions, which provide similar functionality.

Each of these properties provides information on some aspect of the XML document:

  • name ” Name of the XML document

  • url ” URL of the document

  • version ” XML version used

  • standalone ” Whether or not the document is a standalone document

  • type ” Integer corresponding to one of the DOM node types (see Table 3.1)

  • compression ” Whether or not the file was compressed

  • charset ” Character set used by the document

The application can use this information to make decisions about how to process the XML data ”for example, as Listing 3.3 demonstrates , it may reject documents based on the version of XML being used.

Listing 3.3 Using DomDocument Properties to Verify XML Version Information
 <?php  // XML data  $xml_string = "<?xml version='1.0'?><element>potassium</element>";  // create a DOM object  if (!$doc = xmldoc($xml_string))  {      die("Error in XML");  }  // version check  else if ($doc->version > 1.0)  {      die("Unsupported XML version");  }  else  {      // XML processing code here  }  ?> 

In addition to the properties described previously, the DomDocument object also comes with the following methods :

  • root() ” Returns a DomElement object representing the document element

  • dtd() ” Returns a DTD object containing information about the document's DTD

  • add_root() ” Creates a new document element, and returns a DomElement object representing that element

  • dumpmem() ” Dumps the XML structure into a string variable

  • xpath_new_context() ” Creates an XPathContext object for XPath evaluation

While parsing XML data, you'll find that the root() method is the one you use most often, whereas the add_root() and dumpmem() methods come in handy when you're creating or modifying an XML document tree in memory (discussed in detail in the "Manipulating DOM Trees" section).

X Marks the Spot

In case you're wondering, XPath, or the XML Path Language, provides an easy way to address specific parts of an XML document. The language uses directional axes, coupled with conditional tests, to create node collections matching a specific criterion, and also provides standard constructs to manipulate these collections.

PHP's XPath implementation is discussed in detail in the upcoming section titled "Traversing the DOM with PHP's XPath Classes."

In Listing 3.4, the variable $fruit contains the root node (the element named fruit ).

Listing 3.4 Accessing the Document Element via the DOM
 <?php  // create a DomDocument object  $doc = xmldoc("<?xml version='1.0' encoding='UTF-8' standalone='yes'?><fruit>watermelon</ graphics/ccc.gif fruit>");  // root node  $fruit = $doc->root();  ?> 

To DTD or Not to DTD

The dtd() method of the DomDocument object creates a DTD object, which contains basic information about the document's Document Type Definition. Here's what it looks like:

 Dtd Object  (     [systemId] => weather.dtd      [name] => weather  ) 

This DTD object exposes two properties: the systemId property reveals the filename of the DTD document, whereas the name property contains the name of the document element.

DomElement Class

The PHP parser represents every element within the XML document as an instance of the DomElement class, which makes it one of the most important in this lineup. When you view the structure of a DomElement object, you see that it has two distinct properties that represent the element name and type, respectively.You'll remember from Listing 3.2 that these properties can be used to identify individual elements and extract their values. Here is an example:

 DomElement Object  (     [type] => 1      [tagname] => vegetable  ) 

A special note should be made here of the type property, which indicates the type of node under discussion. This type property contains an integer value mapping to one of the parser's predefined node types. Table 3.1 lists the important types.

Table 3.1. DOM Node Types

Integer

Node type

Description

1

XML_ELEMENT_NODE

Element

2

XML_ATTRIBUTE_NODE

Attribute

3

XML_TEXT_NODE

Text

4

XML_CDATA_SECTION_NODE

CDATA section

5

XML_ENTITY_REF_NODE

Entity reference

7

XML_PI_NODE

Processing instruction

8

XML_COMMENT_NODE

Comment

9

XML_DOCUMENT_NODE

XML document

12

XML_NOTATION_NODE

Notation

If you plan to use the type property within a script to identify node types (as I will be doing shortly in Listing 3.5), you should note that it is considered preferable to use the named constants rather than their corresponding integer values, both for readability and to ensure stability across API changes.

The DomElement object also exposes a number of useful object methods:

  • children() ” Returns an array of DomElement objects representing the children of this node

  • parent() ” Returns a DomElement object representing the parent of this node

  • attributes() ” Returns an array of DomAttribute objects representing the attributes of this node

  • get_attribute() ” Returns the value of an attribute of this node

  • new_child() ” Creates a new DomElement object, and attaches it as a child of this node (note that this newly created node is placed at the end of the existing child list)

  • set_attribute() ” Sets the value of an attribute of this node

  • set_content() ” Sets the content of this node

Again, the two most commonly used ones are the children() and attributes() methods, which return an array of DomElement and DomAttribute objects, respectively. The get_attribute() method can be used to return the value of a specific attribute of an element (refer to Listing 3.8 for an example), whereas the new_child() , set_attribute() , and set_content() methods are used when creating or modifying XML trees in memory, and are discussed in detail in the section entitled "Manipulating DOM Trees."

Note that PHP's DOM implementation does not currently offer any way of removing an attribute previously set with the set_attribute() method.

Choices

Most of the object methods discussed in this chapter can also be invoked as functions by prefixing the method name with domxml and passing a reference to the object as the first function argument. The following snippets demonstrate this:

 <?php  // these two are equivalent  $root1 = $doc->root();  $root2 = domxml_root($doc);  // these two are equivalent  $children1 = $root1->children();  $children2 = domxml_children($root2);  ?> 

Listing 3.5 demonstrates one of these in action by combining the children() method of a DomElement object with a recursive function and HTML's unordered lists to create a hierarchical tree mirroring the document structure (similar in concept, though not in approach, to Listing 2.5). At the end of the process, a count of the total number of elements encountered is displayed.

Listing 3.5 Representing an XML Document as a Hierarchical List
 <?php  // XML file  $xml_file = "letter.xml";  // parse it  if (!$doc = xmldocfile($xml_file))  {      die("Error in XML document");  }  // get the root node  $root = $doc->root();  // get its children  $children = get_children($root);  // element counter  // start with 1 so as to include document element  $elementCount = 1;  // start printing  print_tree($children);  // this recursive function accepts an array of nodes as argument,  // iterates through it and prints a list for each element found  function print_tree($nodeCollection)  {        global $elementCount;       // iterate through array       echo "<ul>";       for ($x=0; $x<sizeof($nodeCollection); $x++)       {           // add to element count            $elementCount++;            // print element as list item            echo "<li>" . $nodeCollection[$x]->tagname;            // go to the next level of the tree            $nextCollection = get_children($nodeCollection[$x]);            // recurse!            print_tree($nextCollection);       }       echo "</ul>";  }  // function to return an array of children, given a parent node  function get_children($node)  {       $temp = $node->children();        $collection = array();       // iterate through children array       for ($x=0; $x<sizeof($temp); $x++)       {           // filter out all nodes except elements            // and create a new array            if ($temp[$x]->type == XML_ELEMENT_NODE)            {                $collection[] = $temp[$x];            }       }       // return array containing child nodes       return $collection;  }  echo "Total number of elements in document: $elementCount";  ?> 

Listing 3.5 is fairly easy to understand. The first step is to obtain a reference to the root of the document tree via the root() method; this reference serves as the starting point for the recursive print_tree() function. This function obtains a reference to the children of the root node, processes them, and then calls itself again to process the next level of nodes in the tree. The process continues until all the nodes in the tree have been exhausted. An element counter is used to track the number of elements found, and to display a total count of all the elements in the document.

DomText Class

Character data within an XML document is represented by the DomText class. Here's what it looks like:

 DomText Object  (       [type] => 3        [content] => cabbages  ) 

The type property represents the node type ( XML_TEXT_NODE in this case, as can be seen from Table 3.1), whereas the content property holds the character data itself. In order to illustrate this, consider Listing 3.6, which takes an XML-encoded list of country names , parses it, and puts that list into a PHP array.

Listing 3.6 Using DomText Object Properties to Retrieve Character Data from an XML Document
 <?php  // XML data  $xml_string = "<?xml version='1.0'?>  <earth>        <country>Albania</country>        <country>Argentina</country>        <!-- and so on -->        <country>Zimbabwe</country>  </earth>";  // create array to hold country names  $countries = array();  // create a DOM object from the XML data  if(!$doc = xmldoc($xml_string))  {       die("Error parsing XML");  }  // start at the root  $root = $doc->root();    // move down one level to the root's children  $nodes = $root->children();  // iterate through the list of children  foreach ($nodes as $n)  {       // for each <country> element        // get the text node under it        // and add it to the $countries[] array        $text = $n->children();        if ($text[0]->content != "")        {             $countries[] = $text[0]->content;        }  }  // uncomment this line to see the contents of the array  // print_r($countries);  ?> 

Fairly simple ”a loop is used to iterate through all the <country> elements, adding the character data found within each to the global $countries array.

Taking up Space

It's important to remember that XML, unlike HTML, does not ignore whitespace, but treats it as literal character data. Consequently, if your XML document includes whitespace or line breaks, PHP's DOM parser identifies them as text nodes, and creates DomText objects to represent them. This is a common cause of confusion for DOM newbies, who are often stumped by the "extra" nodes that appear in their DOM tree.

DomAttribute Class

A call to the attributes() method of the DomElement object generates an array of DomAttribute objects, each of which looks like this:

 DomAttribute Object  (       [name] => color        [value] => green  ) 

The attribute name can be accessed via the name property, and the corresponding attribute value can be accessed via the value property. Listing 3.7 demonstrates how this works by using the value of the color attribute to highlight each vegetable or fruit name in the corresponding color.

Listing 3.7 Accessing Attribute Values with the DomAttribute Object
 <?php  // XML data  $xml_string = "<?xml version='1.0'?>  <sentence>  What a wonderful profusion of colors and smells in the market - <vegetable  color='green'>cabbages</vegetable>, <vegetable color='red'>tomatoes</vegetable>,  <fruit color='green'>apples</fruit>, <vegetable  color='purple'>aubergines</vegetable>, <fruit color='yellow'>bananas</fruit>  </sentence>";  // parse it  if (!$doc = xmldoc($xml_string))  {      die("Error in XML document");  }  // get the root node  $root = $doc->root();  // get its children  $children = $root->children();  // iterate through child list  for ($x=0; $x<sizeof($children); $x++)  {      // if element node       if ($children[$x]->type == XML_ELEMENT_NODE)       {           // get the text node under it            $text = $children[$x]->children();            $cdata = $text[0]->content;            // check its attributes to see if "color" is present            $attributes = $children[$x]->attributes();            if (is_array($attributes) && ($index = graphics/ccc.gif is_color_attribute_present($attributes)))            {                // if it is, colorize the element content                 echo "<font color=" . $index . ">" . $cdata . "</font>";            }            else            {                 // else print it as is                  echo $cdata;            }       }       // if text node        else if ($children[$x]->type == XML_TEXT_NODE)       {     // simply print the content           echo $children[$x]->content;      }  }  // function to iterate through attribute list  // and return the value of the "color" attribute if available  function is_color_attribute_present($attributeList)  {      foreach($attributeList as $attrib)       {           if ($attrib->name == "color")            {                $color = $attrib->value;                 break;            }       }       return $color;  }  ?> 

There is, of course, a simpler way to do this ”just use the DomElement object's get_attribute() method. Listing 3.8, which generates equivalent output to Listing 3.7, demonstrates this alternative (and much shorter) approach.

Listing 3.8 Accessing Attribute Values (a Simpler Approach)
 <?php  // XML data  $xml_string = "<?xml version='1.0'?>  <sentence>  What a wonderful profusion of colors and smells in the market - <vegetable  color='green'>cabbages</vegetable>, <vegetable color='red'>tomatoes</vegetable>,  <fruit color='green'>apples</fruit>, <vegetable  color='purple'>aubergines</vegetable>, <fruit color='yellow'>bananas</fruit>  </sentence>";  // parse it  if (!$doc = xmldoc($xml_string))  {      die("Error in XML document");  }  // get the root node  $root = $doc->root();  // get its children  $children = $root->children();  // iterate through child list  for ($x=0; $x<sizeof($children); $x++)  {      // if element node        if ($children[$x]->type == XML_ELEMENT_NODE)        {           // get the text node under it              $text = $children[$x]->children();              $cdata = $text[0]->content;            // check to see if element contains the "color" attribute            if ($children[$x]->get_attribute("color"))            {                // "color" attribute is present, colorize text                 echo "<font color=" . $children[$x]->get_attribute("color") . ">" . $cdata graphics/ccc.gif . "</font>";            }             else             {                 // otherwise just print the text as is                  echo $cdata;            }       }        // if text node        else if ($children[$x]->type == XML_TEXT_NODE)        {           // print content as is            echo $children[$x]->content;       }  }  ?> 

A Composite Example

Now that you know how it works, how about seeing how it plays out in real life? This example takes everything you learned thus far, and uses that knowledge to construct an HTML file from an XML document.

I'll be using a variant of the XML invoice (Listing 2.21) from Chapter 2, adapting the SAX-based approach demonstrated there to the new DOM paradigm. As you'll see, although the two techniques are fundamentally different, they can nonetheless achieve a similar effect. Listing 3.9 is the marked -up invoice.

Listing 3.9 An XML Invoice ( invoice.xml )
 <?xml version="1.0"?>  <invoice>        <customer>              <name>Joe Wannabe</name>              <address>                    <line>23, Great Bridge Road</line>                    <line>Bombay, MH</line>                    <line>India</line>              </address>        </customer>        <date>2001-09-15</date>        <reference>75-848478-98</reference>        <items>             <item cid="AS633225">                   <desc>Oversize tennis racquet</desc>                   <price>235.00</price>                   <quantity>1</quantity>                   <subtotal>235.00</subtotal>             </item>             <item cid="GT645">                   <desc>Championship tennis balls (can)</desc>                   <price>9.99</price>                   <quantity>4</quantity>                   <subtotal>39.96</subtotal>             </item>             <item cid="U73472">                   <desc>Designer gym bag</desc>                   <price>139.99</price>                   <quantity>1</quantity>                   <subtotal>139.99</subtotal>             </item>             <item cid="AD848383">                   <desc>Custom-fitted sneakers</desc>                   <price>349.99</price>                   <quantity>1</quantity>                   <subtotal>349.99</subtotal>             </item>       </items>       <delivery>Next-day air</delivery>  </invoice> 

Listing 3.10 parses the previous XML data to create an HTML page, suitable for printing or viewing in a browser.

Listing 3.10 Formatting an XML Document with the DOM
 <html>  <head>  <basefont face="Arial">  </head>  <body bgcolor="white">  <font size="+3">Sammy's Sports Store</font>  <br>  <font size="-2">14, Ocean View, CA 12345, USA http://www.sammysportstore.com/</font>  <p>  <hr>  <center>INVOICE</center>  <hr>  <?php  // arrays to associate XML elements with HTML output  $startTagsArray = array( 'CUSTOMER' => '<p> <b>Customer: </b>',  'ADDRESS' => '<p> <b>Billing address: </b>',  'DATE' => '<p> <b>Invoice date: </b>',  'REFERENCE' => '<p> <b>Invoice number: </b>',  'ITEMS' => '<p> <b>Details: </b> <table width="100%" border="1" cellspacing="0" graphics/ccc.gif cellpadding="3"><tr><td><b>Item description</b></td><td><b>Price</b></td><td><b> graphics/ccc.gif Quantity</b></td><td><b>Sub-total</b></td></tr>',  'ITEM' => '<tr>',  'DESC' => '<td>',  'PRICE' => '<td>',  'QUANTITY' => '<td>',  'SUBTOTAL' => '<td>',  'DELIVERY' => '<p> <b>Shipping option:</b> ',  'TERMS' => '<p> <b>Terms and conditions: </b> <ul>',  'TERM' => '<li>'  );  $endTagsArray = array( 'LINE' => ',',  'ITEMS' => '</table>',  'ITEM' => '</tr>',  'DESC' => '</td>',  'PRICE' => '</td>',  'QUANTITY' => '</td>',  'SUBTOTAL' => '</td>',  'TERMS' => '</ul>',  'TERM' => '</li>'  );    // array to hold sub-totals  $subTotals = array();  // XML file  $xml_file = "/home/sammy/invoices/invoice.xml";  // parse document  $doc = xmldocfile($xml_file);  // get the root node  $root = $doc->root();  // get its children  $children = $root->children();  // start printing  print_tree($children);  // this recursive function accepts an array of nodes as argument,  // iterates through it and:  //      - marks up elements with HTML  //      - prints text as is  function print_tree($nodeCollection)  {       global $startTagsArray, $endTagsArray, $subTotals;       foreach ($nodeCollection as $node)       {           // how to handle elements            if ($node->type == XML_ELEMENT_NODE)            {               // print HTML opening tags                echo $startTagsArray[strtoupper($node->tagname)];                // recurse                $nextCollection = $node->children();                print_tree($nextCollection);                // once done, print closing tags                echo $endTagsArray[strtoupper($node->tagname)];            }            // how to handle text nodes            if ($node->type == XML_TEXT_NODE)            {               // print text as is                echo($node->content);            }            // PI handling code would come here            // this doesn't work too well in PHP 4.1.1            // see the sidebar entitled "Process Failure"            // for more information       }  }  // this function gets the character data within an element  // it accepts an element node as argument  // and dives one level deeper into the DOM tree  // to retrieve the corresponding character data  function getNodeContent($node)  {      $content = "";       $children = $node->children();       if ($children)       {           foreach ($children as $child)            {                 $content .= $child->content;            }       }       return $content;  }  ?> 

Figure 3.2 shows what the output looks like.

Figure 3.2. Sammy's Sports Store invoice.

graphics/03fig02.gif

As with the SAX example (refer to Listing 2.23), the first thing to do is define arrays to hold the HTML markup for specific tags; in Listing 3.10, this markup is stored in the $startTagsArray and $endTagsArray variables .

Next, the XML document is read by the parser, and an appropriate DOM tree is generated in memory. An array of objects representing the first level of the tree ”the children of the root node ”is obtained and the function print_tree() is called. This print_tree() function is a recursive function, and it forms the core of the script.

The print_tree() function accepts a node list as argument, and iterates through this list, examining each node and processing it appropriately. As you can see, the function is set up to perform specific tasks , depending on the type of node:

  • If the node is an element, the function looks up the $startTagsArray and $endTagsArray variables, and prints the corresponding HTML markup.

  • If the node is a text node, the function simply prints the contents of the text node as is.

Additionally, if the node is an element, the print_tree() function obtains a list of the element's children ”if any exist ”and proceeds to call itself with that node list as argument. And so the process repeats itself until the entire tree has been parsed.

As Listing 3.10 demonstrates, this technique provides a handy way to recursively scan through a DOM tree and perform different actions based on the type of node encountered.You can use this technique to count, classify, and process the different types of elements encountered (Listing 3.5 demonstrated a primitive element counter); or even construct a new tree from the existing one.

Process Failure

If you've been paying attention, you will have noticed that the XML invoice in Listing 3.9 is not exactly the same as the one shown in Listing 2.21. Listing 2.21 included an additional processing instruction (PI), a call to the PHP function displayTotal() , which is missing in Listing 3.9.

Why? Because the DOM extension that ships with PHP 4.1.1 has trouble with processing instructions, and tends to barf all over the screen when it encounters one. Later (beta) versions of the extension do, however, include a fix for the problem.

I l @ ve RuBoard


XML and PHP
XML and PHP
ISBN: 0735712271
EAN: 2147483647
Year: 2002
Pages: 84

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net