Using Native Data Structures

I l @ ve RuBoard

You may sometimes come across a situation that requires you to convert raw XML markup into native data structures such as variables , arrays, or custom objects. For these situations, PHP offers a very specialized little function named xml_parse_into_struct() .

The xml_parse_into_struct() function requires four arguments:

A reference to the XML parser
The raw XML data to be processed
Two arrays to hold the data in structured form

After xml_parse_into_struct() has processed the XML document, it populates the two arrays with detailed information on the structure of the XML document. One array holds a list of all the elements encountered by the parser in its journey through the XML document; the other contains information on the frequency of occurrence of each element.

An example might help to make this clearer. Consider the XML document shown in Listing 2.16.

Listing 2.16 XML-Compliant Bookmark List ( links.xml )

 <?xml version="1.0"?>  <bookmarks category="News">        <link id="15696">              <title>CNN</title>              <url>http://www.cnn.com/</url>              <last_access>2000-09-08</last_access>        </link>        <link id="3763">              <title>Freshmeat</title>              <url>http://www.freshmeat.net/</url>              <last_access>2001-04-23</last_access>        </link>        <link id="84574">              <title>Slashdot</title>              <url>http://www.slashdot.com/</url>              <last_access>2001-12-30</last_access>        </link>  </bookmarks>

Then take a look at the script in Listing 2.17, which parses the preceding XML data and creates native PHP arrays representing the document structure (you can view these arrays with the print_r() function).

Listing 2.17 Converting XML Data Structures into PHP Arrays

 <?php  // XML data  $xml_file = "links.xml";  // read XML file  if (!($fp = fopen($xml_file, "r")))  {       die("File I/O error: $xml_file");  }  // create string containing XML data  while ($chunk = fread($fp, 4096))  {       $data .= $chunk;  }  // initialize parser  $xml_parser = xml_parser_create();  // turn off whitespace processing  xml_parser_set_option($xml_parser,XML_OPTION_SKIP_WHITE, TRUE);  // read file  if (!xml_parse_into_struct($xml_parser, $data, $elementArray, $frequencyArray))  {       die("XML parser error: " .  xml_error_string(xml_get_error_code($xml_parser)));  }  // all done, clean up!  xml_parser_free($xml_parser);  ?>

Quick Experiment

In Listing 2.17, comment out the line that turns off whitespace processing, and see what happens to the generated arrays.

After the script has finished processing, the individual elements of $elementArray correspond to the elements within the XML document. Each of these elements is itself an array containing information such as the element name , attributes, type, and depth within the XML tree. Take a look:

 Array  (     [0] => Array          (             [tag] => BOOKMARKS              [type] => open              [level] => 1              [attributes] => Array                  (                     [CATEGORY] => News                  )          )      [1] => Array          (             [tag] => LINK              [type] => open              [level] => 2              [attributes] => Array                  (                     [ID] => 15696                  )          )      [2] => Array          (             [tag] => TITLE              [type] => complete              [level] => 3              [value] => CNN          )      [3] => Array          (             [tag] => URL              [type] => complete              [level] => 3              [value] => http://www.cnn.com/          )      [4] => Array          (             [tag] => LAST_ACCESS              [type] => complete              [level] => 3              [value] => 2000-09-08          )      [5] => Array          (             [tag] => LINK              [type] => close              [level] => 2          )      [6] => Array          (             [tag] => LINK              [type] => open              [level] => 2              [attributes] => Array                  (                     [ID] => 3763                  )          )  ... and so on ...  )

The second array, $frequencyArray , is a more compact associative array, with keys corresponding to the element names found within the document. Each key of this array is linked to a list of indexes, which points to the locations within $elementArray holding information on the corresponding element. Take a look:

 Array  (     [BOOKMARKS] => Array          (             [0] => 0              [1] => 11          )      [LINK] => Array          (             [0] => 1              [1] => 5              [2] => 6              [3] => 10          )      [TITLE] => Array          (             [0] => 2              [1] => 7          )      [URL] => Array          (             [0] => 3              [1] => 8          )      [LAST_ACCESS] => Array          (             [0] => 4              [1] => 9          )  )

By studying the elements of $frequencyArray , it's easy to do the following:

Determine the frequency with which particular elements occur within the XML document
Identify individual element occurrences, and obtain their corresponding value or attributes from the $elementArray array via the specified index

After the raw XML has been converted into this structured (albeit complex) representation and stored in memory, it's possible to manipulate it or perform tree-type traversal on it. It's possible, for example, to convert this structured representation into a tree object, and write an API to travel between the different branches of the tree, thereby replicating much of the functionality offered by PHP's DOM library. (I say "possible" instead of "advisable" for obvious reasons: Using PHP's native DOM functions to build an XML tree would be faster than simulating the same with SAX.)

Nevertheless, it might be instructive to see how this structured representation can be used to extract specific information from the document. Consider Listing 2.18, which is an enhancement to Listing 2.17. It manipulates the structured representation to create a third array containing only the URLs from each link.

Listing 2.18 Creating Custom Structures from Raw XML Data

 <?php  // XML data  $xml_file = "links.xml";  // read XML file  if (!($fp = fopen($xml_file, "r")))  {       die("File I/O error: $xml_file");  }  // create string containing XML data  while ($chunk = fread($fp, 4096))  {       $data .= $chunk;  }  // initialize parser  $xml_parser = xml_parser_create();  // turn off whitespace processing  xml_parser_set_option($xml_parser,XML_OPTION_SKIP_WHITE,1);  // read file  if (!xml_parse_into_struct($xml_parser, $data, $elementArray, $frequencyArray))  {       die("XML parser error: " .  xml_error_string(xml_get_error_code($xml_parser)));  }  // all done, clean up!  xml_parser_free($xml_parser);  // create array to hold URLs   $urls = array();   // look up $frequencyArray for <url> element   // this element is itself an array, so iterate through it   foreach($frequencyArray["URL"] as $element)   {   // for each value found, look up $elementsArray and retrieve the value   // add this to the URLs array   $urls[] = $elementArray[$element]["value"];   }  ?>

You're probably thinking that this might be easier using a character data handler and an element handler.You're right ”it would be. Listing 2.18 is shown merely to demonstrate an alternative approach to the event-based approach you've become familiar with. Personally, I haven't ever used this function; I prefer to use the DOM for XML tree generation and traversal where required. (The DOM approach to XML processing is covered in Chapter 3.)

I l @ ve RuBoard