Parsing and Transforming XML Documents

   

Parsing and Transforming XML Documents

The example above works well for parsing ultrasimple documents, but it doesn't take into account nested elements or attributes.

In the real world, XML documents are highly structured and precisely defined. A great deal of effort goes into designing XML information products. One reason for the amount of effort that is required is the extensible nature of XML. XML gives you a set of rules that allows you to create these complex structured documents. Those rules, however, provide for an almost infinite amount of possibilities for defining a document's structure. Keep that in mind when creating a PHP XML parsing application. Each application that you create will only work with one particular document structure. The extra effort spent designing the document structure pays off when you code applications that will parse that data.

Having said that, this next example tackles a couple of problems you will find in coding parsers for your own XML documents. These problems include multiple instances of the same element (usually differentiated by attributes or content) and attributes: two things you are sure to find in "real" XML documents. The complexity of the example document, bebop.xml, is fairly basic, but as you will see, the complexity of coding a PHP parser to read more complex documents doesn't necessarily increase; it is really just doing a lot more of the same thing over and over. The basic function of the element and character handlers does not change. You just need to add additional cases to encompass the additional variations that your document requires.

This next example shows how a more complex document can be parsed and how to present the resulting data. The XML file is the beginning of series synopsis for the popular anime title, Cowboy Bebop. Although it is far from complete as far as the series information goes, the XML structure allows for adding data to make the document more comprehensive. The output appears in Figure 10-3.

Figure 10-3. xml_series.php

graphics/10fig03.jpg

The first part of the script, bebop.xml, is the actual XML file that is read into the parser.

Script 10-3 bebop.xml
 1.  <?xml version="1.0"?>
 2.  <series title="Cowboy Bebop" genre="Anime" subgenre="Science Fiction">
 3.    <dvd number="1">
 4.      <episode number="1">
 5.        <title>Asteroid Blues</title>
 6.        <synopsis>Jet and Spike track down a drug dealer.</synopsis>
 7.        <characters>
 8.          <character>Jet</character>
 9.          <character>Spike</character>
 10.        </characters>
 11.    </episode>
 12.    <episode number="2">
 13.      <title>Stray Dog Strut</title>
 14.      <synopsis>Ein Joins the crew.</synopsis>
 15.      <characters>
 16.        <character>Jet</character>
 17.        <character>Spike</character>
 18.        <character>Ein</character>
 19.      </characters>
 20.    </episode>
 21.    <episode number="3">
 22.      <title>Honkey Tonk Woman</title>
 23.      <synopsis>Introduction of Faye Valentine.</synopsis>
 24.      <characters>
 25.        <character>Jet</character>
 26.        <character>Spike</character>
 27.        <character>Ein</character>
 28.        <character>Faye</character>
 29.      </characters>
 30.    </episode>
 31.    <episode number="4">
 32.      <title>Gateway Shuffle</title>
 33.      <synopsis>Having fun at the casino.</synopsis>
 34.    </episode>
 35.    <episode number="5">
 36.      <title>Ballad Of Fallen Angels</title>
 37.      <synopsis>Spike's past comes back to haunt him.</synopsis>
 38.    </episode>
 39.  </dvd>
 40.  <dvd number="2">
 41.    <episode number="6">
 42.      <title>Sympathy For The Devil</title>
 43.      <synopsis>The mystery of the boy Wen.</synopsis>
 44.    </episode>
 45.    <episode number="7">
 46.      <title>Heavy Metal Queen</title>
 47.      <synopsis>Truckers in space.</synopsis>
 48.    </episode>
 49.    <episode number="8">
 50.      <title>Waltz for Venus</title>
 51.      <synopsis>Welcome to Venus.</synopsis>
 52.    </episode>
 53.    <episode number="9">
 54.      <title>Jamming With Edward</title>
 55.      <synopsis>Edward joins the crew.</synopsis>
 56.      <characters>
 57.        <character>Jet</character>
 58.        <character>Spike</character>
 59.        <character>Ein</character>
 60.        <character>Faye</character>
 61.        <character>Edward</character>
 62.      </characters>
 63.    </episode>
 64.    <episode number="10">
 65.      <title>Ganymede Elegy</title>
 66.      <synopsis>Homecoming for Jet.</synopsis>
 67.    </episode>
 68.  </dvd>
 69.</series>
 
Script 10-4 xml_series.php
  1. <html>
  2. <head>
  3. <title>XML - DVD SERIES PARSER</title>
  4. <style type=text/css>
  5. h1, h2, h3 {font-family: verdana, helvetica, sans-serif;}
  6. p, blockquote {font-family: verdana, helvetica, sans-serif; font-size: 10pt}
  7. .navy {color: navy; }
  8. .characters {font-size: 8pt; color: red}
  9. </style>
 10. </head>
 11. <body>
 12. <?
 13. function startElement($xml_parser, $name, $attributes) {
 14.  global $TagsOpen, $counter;
 15.  switch($name) {
 16.    case($name = "SERIES"):
 17.      $TagsOpen["SERIES"] = 1;
 18.      ?>
 19.      <h1>DVD Series</h1>
 20.      <h2>Title: <span class=navy><?=$attributes["TITLE"]?></span>
 21.      <br>Genre: <span class=navy><?=$attributes["GENRE"]?></span>
 22.      <br>Subgenre: <span class=navy><?=$attributes["SUBGENRE"]?></span></h2>
 23.      <?
 24.      break;
 25.    case($name = "DVD"):
 26.      $TagsOpen["DVD"] = 1;
 27.      ?>
 28.      <h3>DVD <span class=navy><?=$attributes["NUMBER"]?></span>
 29.      <?
 30.      break;
 31.    case($name = "EPISODE"):
 32.      $TagsOpen["EPISODE"] = 1;
 33.      ?>
 34.      <p><span class=navy>Episode: <?=$attributes["NUMBER"]?>
 35.      <?
 36.      break;
 37.    case($name = "TITLE");
 38.      $TagsOpen["TITLE"] = 1;
 39.      break;
 40.    case($name = "SYNOPSIS"):
 41.      $TagsOpen["SYNOPSIS"] = 1;
 42.      break;
 43.    case($name = "CHARACTERS"):
 44.      $TagsOpen["CHARACTERS"] = 1;
 45.      ?>
 46.      <blockquote>Characters:<span class=characters> 
 47.      <?
 48.      break;
 49.    case($name = "CHARACTER"):
 50.      $TagsOpen["CHARACTER"] = 1;
 51.      $counter++;
 52.      break;
 53.   }
 54. }
 55.
 56. function endElement($parser, $name) {
 57.   global $TagsOpen, $counter;
 58.   switch($name) {
 59.     case($name = "SERIES"):
 60.       $TagsOpen["SERIES"] = 0;
 61.       break;
 62.     case($name = "DVD"):
 63.       $TagsOpen["DVD"] = 0;
 64.       break;
 65.     case($name = "EPISODE"):
 66.       $TagsOpen["EPISODE"] = 0;
 67.       break;
 68.     case($name = "TITLE");
 69.       $TagsOpen["TITLE"] = 0;
 70.       ?>
 71.       </span>
 72.       <?
 73.       break;
 74.     case($name = "SYNOPSIS"):
 75.       $TagsOpen["SYNOPSIS"] = 0;
 76.       break;
 77.     case($name = "CHARACTERS"):
 78.       $TagsOpen["CHARACTERS"] = 0;
 79.      ? >
 80.       </span></blockquote>
 81.       <?
 82.       $counter = 0;
 83.       break;
 84.     case($name = "CHARACTER"):
 85.       $TagsOpen["CHARACTER"] = 0;
 86.       break;
 87.   }
 88. }
 89. 
 90. function characterData($parser, $data) {
 91.   global $TagsOpen, $counter;
 92.   switch($TagsOpen) {
 93.     case($TagsOpen["CHARACTER"] == 1):
 94.       if($counter == 1) {
 95.         echo " $data";
 96.       } else {
 97.         echo ", $data";
 98.       }
 99.       break;
 100.     case($TagsOpen["SYNOPSIS"] == 1):
 101.       echo "<br>$data\n";
 102.       break;
 103.     case($TagsOpen["TITLE"] == 1):
 104.       echo " - \"$data\"";
 105.   }
 106. }
 107. 
 108. function load_data($file) {
 109.   $fh = fopen($file, "r") or die ("<P>COULD NOT OPEN FILE!");
 110.   $data = fread($fh, filesize($file));
 111.   return $data;
 112. }
 113. 
 114.   $TagsOpen = array(  
 115.        "SERIES" => 0,
 116.        "DVD" => 0,
 117.        "EPISODE" => 0,
 118.        "TITLE" => 0,
 119.        "SYNOPSIS" => 0,
 120.        "CHARACTERS" => 0,
 121.        "CHARACTER" => 0
 122.        );
 123.
 124. $counter = 0;
 125.
 126. $file = "bebop.xml";
 127. $xml_parser = xml_parser_create();
 128. xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, true);
 129. xml_set_element_handler($xml_parser, "startElement", "endElement");
 130. xml_set_character_data_handler($xml_parser, "characterData");
 131. xml_parse($xml_parser, load_data($file)) or die ("<P>ERROR PARSING XML!");
 132. xml_parser_free($xml_parser);
 133. ?>
 134. </body>
 135. </html>
 

Script 10-4. xml_series.php Line-by-Line Explanation

LINE

DESCRIPTION

111

Display the beginning part of the HTML page for the script, including some basic styles to format the output.

12

Begin parsing the page as PHP.

1354

Create a function called startElement() to handle any start elements that the script encounters as it parses the XML. The function takes the following as its arguments (required by PHP):

  • $xml_parser

  • $name

  • $attributes

14

Allow the variables $TagsOpen and $counter to be accessed and modified by this function. $TagsOpen is an array that tracks which tags are open or closed. $counter is an integer that is used for formatting purposes when displaying character data.

1553

Create a switch statement to evaluate the value of the start element name.

1624

Add a case to the switch statement to check if the $name variable equals "SERIES".

17

Set the $TagsOpen["SERIES"] flag to true (1), since the SERIES tag is open.

1823

Print out some information to the screen, including the attributes of the SERIES tag, which include the title, genre, and subgenre of the series.

24

Break out of the switch statement, since we are done evaluating the $name variable.

2530

Add a case to the switch statement to check if the $name variable equals "DVD".

26

Set the $TagsOpen["DVD"] flag to true (1), since the DVD tag is open.

2729

Print out some information to the screen regarding the DVD tag, including its attributes.

30

Break out of the switch statement, since we are done evaluating the $name variable.

31

Add a case to the switch statement to check if the $name variable equals "EPISODE".

32

Set the $TagsOpen["EPISODE"] flag to true (1), since the EPISODE tag is open.

3335

Print out some information to the screen regarding the EPISODE tag, including its attributes.

36

Break out of the switch statement, since we are done evaluating the $name variable.

37

Add a case to the switch statement to check if the $name variable equals "TITLE".

38

Set the $TagsOpen["TITLE"] flag to true (1), since the TITLE tag is open. We don't need to print anything for TITLE, because there are no attributes associated with this tag.

39

Break out of the switch statement, since we are done evaluating the $name variable.

40

Add a case to the switch statement to check if the $name variable equals "SYNOPSIS".

41

Set the $TagsOpen["SYNOPSIS"] flag to true (1), since the SYNOPSIS tag is open. We don't need to print anything for SYNOPSIS because there are no attributes associated with this tag.

42

Break out of the switch statement, since we are done evaluating the $name variable.

43

Add a case to the switch statement to check if the $name variable equals "CHARACTERS".

44

Set the $TagsOpen["CHARACTERS"] flag to true (1), since the CHARACTERS tag is open.

4547

Print out some formatting that will be used when we print out the individual character names.

48

Break out of the switch statement, since we are done evaluating the $name variable.

49

Add a case to the switch statement to check if the $name variable equals "CHARACTER".

50

Set the $TagsOpen["CHARACTER"] flag to true (1), since the CHARACTER tag is open. We don't need to print anything for CHARACTER, because there are no attributes associated with this tag.

51

Increment the $counter variable, as it is used to help format the output when the individual character names are displayed.

52

Break out of the switch statement, since we are done evaluating the $name variable.

53

Close the switch statement.

54

End the function declaration.

56

Create a function called endElement() to handle any end elements that the script encounters as it parses the XML. The function takes the following as its arguments (required by PHP):

  • $xml_parser

  • $name

57

Allow the variables $TagsOpen and $counter to be accessed and modified by this function. $TagsOpen is an array that tracks which tags are open or closed. $counter is an integer that is used for formatting purposes when displaying character data.

58

Create a switch statement to evaluate the value of the start element name.

59

Add a case to the switch statement to check if the $name variable equals "SERIES".

60

Set the $TagsOpen["SERIES"] flag to false (0), since the SERIES tag has been closed.

61

Break out of the switch statement, since we are done evaluating the $name variable.

62

Add a case to the switch statement to check if the $name variable equals "DVD".

63

Set the $TagsOpen["DVD"] flag to false (0), since the DVD tag has been closed.

64

Break out of the switch statement, since we are done evaluating the $name variable.

65

Add a case to the switch statement to check if the $name variable equals "EPISODE".

66

Set the $TagsOpen["EPISODE"] flag to false (0), since the EPISODE tag has been closed.

67

Break out of the switch statement, since we are done evaluating the $name variable.

68

Add a case to the switch statement to check if the $name variable equals "TITLE".

69

Set the $TagsOpen["TITLE"] flag to false (0), since the TITLE tag has been closed.

7072

Print out the closing span, which was started when the parser found an open TITLE tag.

73

Break out of the switch statement, since we are done evaluating the $name variable.

74

Add a case to the switch statement to check if the $name variable equals "SYNOPSIS".

75

Set the $TagsOpen["SYNOPSIS"] flag to false (0), since the SYNOPSIS tag has been closed.

76

Break out of the switch statement, since we are done evaluating the $name variable.

77

Add a case to the switch statement to check if the $name variable equals "CHARACTERS".

78

Set the $TagsOpen["CHARACTERS"] flag to false (0), since the CHARACTERS tag has been closed.

7981

Close out the blockquote and span tags that were started when the parser encountered an open CHARACTERS tag.

82

Set the $counter variable to "0". The characterData() function will use this to help display the names of the characters.

83

Break out of the switch statement, since we are done evaluating the $name variable.

84

Add a case to the switch statement to check if the $name variable equals "CHARACTER".

85

Set the $TagsOpen["CHARACTER"] flag to false (0), since the CHARACTER tag has been closed.

86

Break out of the switch statement, since we are done evaluating the $name variable.

87

Close the switch statement.

88

End the function declaration.

90

Create a function called characterData() to handle any character data that the script encounters as it parses the XML. The function takes the following as its arguments (required by PHP):

  • $xml_parser

  • $data

91

Allow the variables $TagsOpen and $counter to be accessed and modified by this function. $TagsOpen is an array that tracks which tags are open or closed. $counter is an integer that is used for formatting purposes when displaying character data.

92

Create a switch statement to evaluate the value of the $TagsOpen array. By examing this array, the function can determine at which point in the XML file PHP is parsing and display the data in the proper format.

93

Add a case to the switch statement to check if the CHARACTER element tags are open. We check them to see if the CHARACTER element is open before we check if any other tags are open, since the CHARACTER element is the most deeply nested element in our XML file.

9498

Check the value of the $counter variable. If the counter variable is set to 1, then it means that this is the first character encountered in the current node of the XML. If it is the first character, then just print the character name. If it is not the first character, then print a comma and the character name. This allows you to display the entries separated by a comma.

99

Break out of the switch statement, since we are done evaluating the $TagsOpen variable.

100

Add a case to the switch statement to check if the SYNOPSIS element tags are open. We check to see if the SYNOPSIS element is open next, because it is one node up from the deepest node.

101

If the SYNOPSIS tags are open, then print the data that exists inside SYNOPSIS tags.

102

Break out of the switch statement, since we are done evaluating the $TagsOpen variable.

103

Add a case to the switch statement to check if the TITLE element tags are open. We check them to see if the TITLE element is open next, because it is one node up from SYNOPSIS.

104

If the TITLE tags are open, then print the data that exists inside TITLE tags.

105

Break out of the switch statement, since we are done evaluating the $TagsOpen variable.

106

Close the switch statement.

Note: We do not have to check any of the other elements because none of those elements contain any actual character data in between their open and close tags. They contain only other elements, so there is no need to use this function to check for data in those elements.

107

End the function declaration.

108

Create a function called load_data() to read the data from an XML file into the script so that it may be parsed. The function takes one argument, $file, which is the name (with an optional path) of the XML file that you want to parse.

109

Attempt to assign a file handle to the file. If unsuccessful, kill the script and print out an error message.

110

If the file opening was successful, read the entire file into the $data variable. Note that this is wildly inefficient for large files.

111

Return the $data variable to the calling program.

112

End the function declaration.

114122

Define and initialize the $TagsOpen array. You need to define an array element for each element that needs to be tracked. The array item's key is the name of the element, and the value is either 1 or 0 (true or false). Set all the values to false (0) in the beginning, since no tags are supposed to be open before you even start parsing the file.

124

Initialize the $counter variable to "0".

126

Assign a file name to the $file variable. Note that you can use a full path to the file such as:

  • Windows: $file = "C:\winnt\xml\myxmlfile.xml";

  • Linux: $file = "/home/me/xml/myxmlfile.xml";

127

Define a parser option using the xml_set_parser_option() function. This function changes options as to how the parser operates. It requires three arguments:

  • The XML parser that you created using the xml_parser_create function.

  • The option you want to set.

  • The value of the option you are setting.

In this case, we are setting the option "XML_OPTION_CASE_FOLDING" to true, which changes the case of element names to uppercase when the parser reads the XML file. This way, we do not have to worry if there is mixed case in the element names in the XML file. When checking to see if a particular element is encountered, you only need to check for the uppercase version of the element name.

128

Create a variable called $xml_parser and assign it as a PHP xml parser using the xml_parser_create() function.

129

Define the custom start and end element handler functions that you created above using the xml_set_element_handler() function. Whenever PHP encounters a start or end element, it will use the respective function.

130

Define the custom data handler function that you created above using the respective function. Whenever PHP encounters character data, it will use the characterData() function.

131

Begin parsing the XML with the xml_parse function. The xml_parse() function requires the name of the XML parser ($xml_parser) and the XML data that you want to parse. In this case, we provide the XML data by using the custom load_data($file) function. If the xml_parse() function fails, then kill the script and display an error message.

132

After the xml_parse() function has completed parsing the XML, free the memory associated with the parser by using the xml_parser_free() function.

133135

Stop parsing the page as PHP and close out the HTML for the page.


   
Top