Working with XML in PHP

Now that we know what XML documents are and how to write them, it is time to look at how to use them within PHP. This is complicated ever so slightly by there being two ways of accessing and manipulating an XML document, both of which are supported by PHP.

The first is known as Simple API for XML (SAX). This is a small serial access parser for XML documents that calls functions implemented by you whenever it encounters content of a specific type (opening element tags, closing element tags, text data, and so forth). It is unidirectional, meaning that it works through your XML document telling you what it sees as it sees it (see Figure 23-1). This has the advantage of being fast and memory efficient, but it has the disadvantage of doing nothing other than giving you the tags and text as it sees themyou are responsible for interpreting any structure and hierarchy from the data.

Figure 23-1. A SAX parser working on a document.

The second method is known as the Document Object Model, or DOM. The XML DOM gives you the information in a given XML document in a hierarchical object-oriented fashion. As a document is loaded in the DOM, it forms a hierarchy of objects representing its structure, and you move through the document using these objects (see Figure 23-2). This is an intuitive way to access your data, because the data is specified hierarchically in the XML document. The DOM does, however, have the disadvantage of being somewhat slower and more memory intensive than SAX.

Figure 23-2. An XML DOM representation of a document.

Using SAX or DOM

Now that we have two options for accessing our data, we must choose between them. We are fortunate in that we can identify clear situations when one would be better than the other.

For less hierarchical data that could be considered more the results of an "information dump," we would want to use the SAX parser to reload the data. An example of this would be an object in PHP and its properties. If we want to dump a collection of these and their member data to disk in an XML file or even store them in a field in a database table for later depersistence, we would likely be working with straightforward files with a limited structure. Loading them back in with SAX would be fast and not require much extra work.

For most other situations, especially for data that is fundamentally hierarchical or less predictable in nature (such as user data), we would want to use the DOM. Any decrease in speed the DOM incurs would be offset by the savings in the code we otherwise would have had to write to rebuild the structure of the document.

One other advantage to the DOM is that it can be used to generate XML documents in addition to reading and parsing them. We can create and add new elements to our tree using the DOM and then resave the document to disk if we want. For systems based entirely on the SAX parser, XML generation will be done by hand (which is still not unreasonably challenging).

Although both XML implementations in PHP5 are interesting and useful, we will find ourselves more often using the DOM, which is why we cover it further.

Using the DOM

Although the SAX parser is fast, easy to use, and permits us to load our data efficiently, it does not let us take full advantage of the XML documents, including truly appreciating their hierarchical structure. It also doesn't easily enable us to search through our documents looking for specific pieces of information.

Fortunately, the designers of the XML Specifications, foreseeing the usefulness of such functionality, created a specification for a DOM. This specification defines a number of "levels" for DOMs, the second of which encourages an object-oriented implementation.

The DOM is just a set of classes through which you can create, inspect, and manipulate XML documents. PHP5 ships with a new implementation of a DOM, known simply as "the DOM." Older versions of PHP did have a DOM implementation, but the system for PHP5 is vastly changed and improved, and we cover this one exclusively.

Setting Up PHP for the DOM

The new DOM is enabled by default in PHP5. For those users compiling PHP themselves, no extra command-line options are usually required. No configuration options are required in php.ini for this extension.

Getting Started in Code

The DOM implementation in PHP5 is a robust system of object classes, the most important of which is usually the DOMDocument class. It is this class that you will use to load and save documents, get access to the elements in your document, and search for content within the document. Creating a DOMDocument object is as simple as follows:

 $dom = new DOMDocument();

After you have this, you need to load in the XML content you want to parse. You can either give it the name of a file to load with the load member function, or you can just give it a string containing the XML content through the loadXML function:

 $result = $dom->load('c:/webs/health/claims.xml'); if ($result === FALSE) {   throw new CantLoadClaimsException(); } // continue

The DOMDocument class contains a number of methods to create new XML documents, including some for creation of elements, attributes, and text content. There are also methods for searching within a document (see the section "Adding Search Capabilities") and for validating with DTDs and XSDs (see the section "Validating XML").

The first property with which we will work, however, is the documentElement property, which returns the root node of the element hierarchy in our document.

The Element Hierarchy

As shown before, XML documents are organized in a hierarchical format, with all content nodes originating from a single document element, or root node. This root node is accessed by querying the documentElement property on the DOMDocument object:

 $rootNode = $dom->documentElement;

The $rootNode variable now contains an object of type DOMElement, which inherits directly from DOMNode. All nodes in the DOM are implemented as classes inheriting from the DOMNode class, which contains a number of basic properties and methods. You learn the type of the node by querying the nodeType property on a given node, which will typically have one of the values shown in Table 23-2. (A few other possible values exist, but we will not likely encounter those much.)

Table 23-2. Node Types in the PHP5 DOM
Node Type	Integer Value	Description
`XML_ELEMENT_NODE`	1	The node is an element, represented by the `DOMElement` class.
`XML_ATTRIBUTE_NODE`	2	The node is an attribute, represented by the `DOMAttribute` class.
`XML_TEXT_NODE`	3	The node is a text content node, represented by the `DOMText` class.
`XML_CDATA_SECTION_NODE`	4	The node is a `CDATA` content node, represented by the `DOMCharacterData` class.
`XML_ENTITY_REF_NODE`	5	The node is an `ENTITY` reference node, represented by the `DOMEntityReference` class.
`XML_ENTITY_NODE`	6	The node is an `ENTITY` node, represented by the `DOMEntity` class.
`XML_PI_NODE`	7	The node is a processing instruction, represented by the `DOMProcessing-Instruction` class.
`XML_COMMENT_NODE`	8	The node is an XML comment, represented by the `DOMComment` class.
`XML_DOCUMENT_NODE`	9	The node represents the entire XML document, accessed through the `DOMDocument` class.
`XML_DOCUMENT_TYPE_NODE`	10	The node is the Document Type Definition (DTD) associated with this document, represented by the `DOMDocumentType` class.
`XML_NOTATION_NODE`	12	The node is an XML notation node, represented by the `DOMNotation` class.

The node types with which we will work most of the time are elements, attributes, and text nodes. Element nodes correspond to the elements in our documents, and any attributes they contain are represented by an attribute node. Their contents are represented by a text node, as shown in Figure 23-3.

Figure 23-3. A sample node hierarchy in PHP.

One of the quirks to working with the DOM to which we will have to adjust initially is that there is a requirement in the XML Specification that the DOM preserve all text content in an XML document, including the whitespace between nodes. So, in fact, the diagram shown in Figure 23-3 is not quite correct. There will be text nodes in places that we would not otherwise expect them, as in Figure 23-4.

Figure 23-4. A more accurate sample node hierarchy in PHP.

Fortunately, if we do not care about whitespace and these extra newlines, spaces, and tabs, we can tell our DOMDocument object to cheat a little bit and collapse all extra whitespace, removing many of those unwanted text nodes. You can do this by setting the preserveWhiteSpace property on it to FALSE before the document is loaded. Changes to this member variable have no effect on already loaded documents:

 $dom->preserveWhiteSpace = FALSE;

With this change, our document hierarchy truly would look like that shown in Figure 23-3.

Nodes and Elements

Because all nodes and elements inherit from the same base class, the DOMNode, there is a standard way of querying nodes for information and working our way through the document hierarchy without surprises.

Standard pieces of information to query on an element or node are as follows:

nodeType This returns the type of the node, specified as one of the constant values shown in Table 23-2.
nodeName This returns the full name of the given node. For element nodes, this is the full name of the tag, including any namespace declaration. To get just the tag name, use the localName property on DOMNode.
localName This returns the base name of the element without namespace prefixes.
prefix This returns the namespace prefix for the given node.
namespaceURI This returns the URI of the namespace for this node, or NULL if unspecified.
textContent This is the preferred way to get the text content of a DOMElement. It returns the content of the child DOMText node parented by this element.

To see these in action, look at the following example code, which takes a simple XML document and shows some properties being queried:

 <?php $xmldoc = <<<XMLDOC <?xml version="1.0" encoding="utf-8"?> <sh:Shoes xmlns:sh='http://localhost/shoestore'>   <sh:Shoe>     <sh:BrandName>Nyke</sh:BrandName>     <sh:Model>Super Runner 150</sh:Model>     <sh:Price>109.99</sh:Price>   </sh:Shoe> </sh:Shoes> XMLDOC; $dom = new DOMDocument(); $dom->preserveWhiteSpace = FALSE; $result = $dom->loadXML($xmldoc); if ($result == FALSE)   die('Unable to load in XML text'); $rootNode = $dom->documentElement; echo $rootNode->nodeName . "<br/>\n"; echo $rootNode->localName . "<br/>\n"; echo $rootNode->prefix . "<br/>\n"; echo $rootNode->namespaceURI . "<br/>\n"; ?>

The output of this script will be as follows:

 sh:Shoes Shoes sh:Shoes sh http://commerceserver/shoestore

Navigating through the hierarchy of elements is done through the following standard methods and properties available on all classes inheriting from DOMNode:

parentNode This returns the parent node of the current node, or NULL if there is none.
childNodes This returns a DOMNodeList containing all the child nodes of the current node. This list can be used in foreach loops. If you do not have preserveWhiteSpace turned off, this often contains a mixture of node types, so be sure to check for the appropriate typed node.
firstChild This returns the first child node of the current node, or NULL if there is no such node.
lastChild This returns the last child node of the current node, or NULL if there is no such node.
nextSibling This returns the current node's next sibling.
ownerDocument This returns the containing DOMDocument node that ultimately "contains" or represents this node.

From this list of methods, we can see that we have the following three ways to iterate through the child nodes of a given node:

 // // if we don't preserve whitespace, then we can just get // the first node (it will be a DOMElement).  Otherwise, // we have to skip over the DOMText node that will be there! // if ($dom->preserveWhiteSpace === FALSE)   $shoe = $node->firstChild;  // <sh:Shoe> else {   $shoe = $node->firstChild;   while ($shoe->nodeType !== XML_ELEMENT_NODE)     $shoe = $shoe->nextSibling; } echo "<br/>Method 1:<br/>\n"; foreach ($shoe->childNodes as $child) {   echo "Type: $child->nodeType, Name: $child->localName<br/>"; } echo "<br/>Method 2:<br/>\n"; $children = $shoe->childNodes; for ($x = 0;  $x < $children->length; $x++) {   $child = $children->item($x);   echo "Type: $child->nodeType, Name: $child->localName<br/>"; } echo "<br/>Method 3:<br/>\n"; $child = $shoe->firstChild; while ($child !== NULL) {   echo "Type: $child->nodeType, Name: $child->localName<br/>";   $child = $child->nextSibling; }

All three result in the same nodes being visited. Note the extra code we had to include at the top. In those cases where we are preserving whitespace, the first child of an element node is very likely not to be an element node, but a text node containing the whitespace between that element and the next element node.

The output of the preceding script with preserveWhiteSpace set to TRUE (the default) will be the same for all three loops, as follows. (Recall that nodeType 3 is XML_TEXT_NODE and 1 is XML_ELEMENT_NODE.)

 Type: 3, Name: Type: 1, Name: BrandName Type: 3, Name: Type: 1, Name: Model Type: 3, Name: Type: 1, Name: Price Type: 3, Name:

Attributes

To access attributes on a given node, you have two options. The first, and by far most common, is to use the hasAttribute and getAttribute methods, as follows:

 if ($element->hasAttribute('name'))   echo 'Name is: ' . $element->getAttribute('name'); else   echo 'Element has no name!';

The other method for obtaining attributes is to use the attributes collection on the DOMNode class, which enables us to get at the actual DOMAttr classes representing these attributes, as follows:

 $attrs = $element->attributes; if ($attrs !== NULL) {   foreach ($attrs as $attr)   {     if ($attr->name == 'name')       echo 'Name is: ' . $attr->value;   } } else   echo 'Element has no attributes!';

Although slightly less convenient than the getAttribute method on the DOMElement class, this method enables us to view all attributes and their values when we are not absolutely certain as to which attributes will exist for a given element.

An Example

To see all of this in action, we will continue the example of the health-care claims system. We will write a ClaimsDocument class, which we will use to return all the claims in a document, or search for claims given a user's name. We will declare some extremely simple classes (without interesting implementation) to hold the data we learn about claims and patients, as follows:

 class Patient {   public $name;   public $healthCareID;   public $primaryPhysician; } class Claim {   public $patient;   public $code;   public $amount;   public $actingPhysicianID;   public $treatment; }

We will also create a new class called the ClaimsDocument, as follows:

 class ClaimsDocument {   public $errorText;   private $dom; }

The first method we will add on this class is a public method that loads a given claim document and saves the DOMDocument representing it in a private member variable:

 // // loads in a claims XML document and saves the DOMDocument // object for it. // public function loadClaims($in_file) {   // create a DOMDocument and load our XML data.   $this->dom = new DOMDocument();   // by setting this to false, we will not have annoying   // empty TEXT nodes in our hierarchy.   $this->dom->preserveWhiteSpace = FALSE;   $result = $this->dom->load($in_file);   if ($result === FALSE)     throw new CantLoadClaimsException();   return TRUE; }

We will next write a method to return an array of Claim objects for all of the <Claim> elements in the document:

 // returns an array containing all of the claims we loaded public function getAllClaims(&$out_claimsList) {   // 1. get the root node of the tree (Claims).   $claimsNode = $this->dom->documentElement;   // 2. now, for each child node, create a claim   //    object.   $claimsList = array();   foreach ($claimsNode->childNodes as $childNode)   {     $claim = $this->loadClaim($childNode);     $claimsList[] = $claim;   }   // set up the out param   $out_claimsList = $claimsList;   return TRUE; }

As you can see, this method requires a new method called loadClaim:

 // // loads the data for a claim element. // private function loadClaim($in_claimNode) {   $claim = new Claim();   foreach ($in_claimNode->childNodes as $childNode)   {     switch ($childNode->localName)     {       case 'Patient':         $claim->patient = $this->loadPatient($childNode);         break;       case 'Code':         $claim->code = $childNode->textContent;         break;       case 'Amount':         $claim->amount = $childNode->textContent;         break;       case 'ActingPhysicianID':         $claim->actingPhysicianID = $childNode->textContent;         break;       case 'Treatment':         $claim->treatment = $childNode->textContent;         break;     }   }   return $claim; }

This method, as it works, calls a function to load the patient data, called loadPatient:

 // // loads the data for a patient element. // private function loadPatient($in_patientNode) {   $patient = new Patient();   $patient->name = $in_patientNode->getAttribute('name');   foreach ($in_patientNode->childNodes as $childNode)   {     switch ($childNode->localName)     {       case 'HealthCareID':         $patient->healthCareID = $childNode->textContent;         break;       case 'PrimaryPhysician':         $patient->primaryPhysician = $childNode->textContent;         break;     }   }   return $patient; }

Adding Search Capabilities

Finally, we will add a new public method that demonstrates how to use the facilities available via the DOMDocument to find nodes within our document. We will call this method findClaimsByName, and it will return all claims for the patient with the given name.

This function works by using the getElementsByTagName method on the DOMDocument class. This method takes the name of an element to find as an argument and returns a list of all those nodes in the document with the given element (tag) name:

 public function findClaimsByName($in_name) {   if ($in_name == '')   {     throw new InvalidArgumentException();   }   $claims = array();   // 1. use the DOMDocument to do the searching for us.   $found = $this->dom->getElementsByTagName('Patient');   foreach ($found as $patient)   {     // 2. for any found node, if the name is the one we     //    want, then load the data.  these are in the parent     //    node of the Patient node.     if (trim($patient->getAttribute('name')) == $in_name)     {       $claims[] = $this->loadClaim($patient->parentNode);     }   }   return $claims; }

Putting It All Together

To show the use of our new ClaimsDocument object, we can list some simple code to demonstrate loading the claims, listing the claims that were sent with it, and finding a claim by patient name.

We start by creating a ClaimsDocument object and wrapping our code in a try/catch block in case there is an error with the XML:

 try {   $cl = new ClaimsDocument();   // etc. } catch (Exception $e) {   echo "¡Aiee! An Error occurred: " . $e->getMessage()        . "<br/>\n"; }

After creating the object, we load the claims document and have the it return a list of all those claims:

 $cl->loadClaims('claims.xml'); $claims =  $cl->getAllClaims();

If we then want to summarize these claims, we can write code as follows:

 $count = count($claims); echo "<u>Successfully loaded $count claim(s)</u>.  "; echo "Summarizing:"; echo "<br/><br/>\n"; foreach ($claims as $claim) {   $name = $claim->patient->name;   $id = $claim->patient->healthCareID;   echo "Patient Name: <b>$name</b> (ID: <em>$id</em>)<br/>"; }

Finally, to find a specific user and find out how much the user's claim was, we write this:

   echo <<<EOM <br/><br/><u>Searching for specific user:</u><br/><br/> EOM;   $matching = $cl->findClaimsByName('Samuela Nortone');   echo "Patient Name: <b>$name</b> (ID: <em>$id</em>)<br/>\n";   echo "Claim Amount: $claim->amount<br/>\n";

When put together, the complete code looks like this:

 try {   $cl = new ClaimsDocument();   $cl->loadClaims('claims.xml');   $claims =  $cl->getAllClaims();   $count = count($claims);   echo "<u>Successfully loaded $count claim(s)</u>.  ";   echo "Summarizing:";   echo "<br/><br/>\n";   foreach ($claims as $claim)   {     $name = $claim->patient->name;     $id = $claim->patient->healthCareID;     echo "Patient Name: <b>$name</b> (ID: <em>$id</em>)<br/>";   }   echo <<<EOM <br/><br/><u>Searching for specific user:</u><br/><br/> EOM;   $matching = $cl->findClaimsByName('Samuela Nortone');   echo "Patient Name: <b>$name</b> (ID: <em>$id</em>)<br/>\n";   echo "Claim Amount: $claim->amount<br/>\n"; } catch (Exception $e) {   echo "¡Aiee! An Error occurred: " . $e->getMessage()        . "<br/>\n"; }