XML::Parser Perl Module

The XML::Parser Perl module was originally written by Larry Wall and serves as a wrapper for James Clark's Expat parser's C libraries. Expat was one of the first XML parsers available, and it has proven itself over time to be very fast and powerful. Because Expat was one of the first XML parsers developed, it doesn't implement SAX, DOM, or XPath interfaces because the standards didn't exist or were immature and weren't yet widely accepted.

XML::Parser is also the underlying parser to several other parser implementations , but that is now changing with the development of the SAX2 interface and the XML::SAX module. As you will see, the XML::Parser Perl module is full of features that can accomplish just about any XML parsing task, and because of these numerous features, it is a very powerful tool for XML processing. The underlying Expat parser and the XML::Parser module are very stable because they have been around for quite some time.

XML::Parser-Based Application

Let's create and examine a relatively small application. Assume that your company has outsourced desktop PC system administration to a support contractor. The support contractor stores all his information in a database and delivers reports to your company in XML.Your task is to develop a utility program that prints out a report for the non-technical personnel who are not yet familiar with the format of an XML document. This XML document can be very large, depending on the number of software packages that the company purchases. Because the reporting program will have to run on one of their standard desktop PCs (which in some cases might be an older, slower machine), we need to develop a resource-efficient program. This is the type of application where the stream-style processing in XML::Parser proves to be very beneficial.

In Chapter 1, "Basics of XML Processing in Perl," we discussed the steps required to build an XML-based application. Let's reuse those steps and apply them to the current example.

Gather Requirements

As with any well-planned software application, we need to list our available data, the requirements, and the desired output data. Our goal for this example is to generate a report from XML data that contains our company's software inventory. For this simple example, let's assume that the XML data is generated by hand. Don't worry, all XML data isn't generated by hand. In Chapters 6, "Generating XML Documents from Databases," and 7, "Transforming Miscellaneous Data Formats to XML (and Vice-Versa)," we'll show you how to generate XML data from a number of input sources, including text files and relational databases.

To perform the proper inventory, we'll need the following information about each piece of software:

 Name  Operating System  Number of Copies Purchased  Number of Copies Currently Being Used  Our output report should have the following format:  *Software Package Name *  Total Number of Packages: X  Total Number Available: Y  Value ($Z per piece): $Z * Total Number of Packages

Describe the Format of the Stored Data

We know that we need to store a few pieces of information for each software package in our inventory. In the real world, a lot more data could be required for each piece of software (for example, licensing information, version number, patches or upgrades, and so forth); however, the pieces of information we have listed will suffice for our example.

We'll need to keep several points in mind while designing our format of the stored data. First, some of our data elements can have multiple child elements. For example, our base element will be the name of the software package. Some of our software packages are available for multiple operating systems, and the prices for each operating system could be different. Second, all our information (name, operating system name, total number of packages, total number of packages available) for each software package is required to appear at least once.

Design a Document Type Definition or XML Schema

Now that we have the required data fields and the rules associated with the data (for example, each field is required to appear once for each base element), we can design a DTD and a schema to support our requirements. Remember, I discussed DTDs and schemas in Chapter 1. Both DTDs and schemas describe the content of XML data.

First, let's develop a DTD that supports our requirements. The root element of our DTD is called software , and it will be made up of multiple distribution elements. The root element is declared in the following section of the DTD:

 <!ELEMENT software (distribution*)>  <!ELEMENT distribution (os*, price)>  <!ATTLIST distribution  name CDATA #REQUIRED  version CDATA #REQUIRED>

As you can see, we're saying that the software root element is made up of one or more distribution elements. Each distribution element is made up of an os element and a price element. Note that the os element can appear multiple timesthis is required to support products that run on multiple operating systems. The distribution element also has two attributes name and version . This enables each distribution element to be assigned a name and version . Note that the * after the distribution element indicates that the distribution element can appear one or more times.

The os element is similar to the distribution element in that it has two child elements ( total and out ) and an attribute ( name ).

 <!ELEMENT os (total, out)>  <!ATTLIST os  name CDATA #REQUIRED>  <!ELEMENT total (#PCDATA)>  <!ELEMENT out (#PCDATA)>

Finally, the last child of the distribution element is the price element, which is defined by the following:

 <!ELEMENT price (#PCDATA)>

All these individual portions of the DTD are combined to create the DTD shown in Listing 3.1.

Listing 3.1 DTD for the software inventory XML document. (Filename: ch3_xml_parser_sw_inventory.dtd)

 <?xml version="1.0" encoding="UTF-8"?>  <!ELEMENT software (package*)>  <!ELEMENT package (operating_system*, price)>  <!ATTLIST package    name CDATA #REQUIRED    version CDATA #REQUIRED>  <!ELEMENT operating_system (licenses_purchased, licenses_in_use)>  <!ATTLIST operating_system    name CDATA #REQUIRED>  <!ELEMENT licenses_purchased (#PCDATA)>  <!ELEMENT licenses_in_use (#PCDATA)>  <!ELEMENT price (#PCDATA)>

Now, let's build an XML schema based on the same set of data requirements. The corresponding XML schema is shown in Listing 3.2.

Listing 3.2 XML schema for the software inventory XML file. (Filename: ch3_xml_parser_sw_inventory.xsd)

 <?xml version="1.0" encoding="UTF-8"?>  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"  elementFormDefault="qualified">     <xs:element name="licenses_in_use" type="xs:integer"/>     <xs:element name="licenses_purchased" type="xs:integer"/>     <xs:element name="operating_system">        <xs:complexType>          <xs:sequence>              <xs:element ref="licenses_purchased"/>              <xs:element ref="licenses_in_use"/>           </xs:sequence>           <xs:attribute name="name" type="xs:string" use="required"/>        </xs:complexType>     </xs:element>     <xs:element name="package">        <xs:complexType>           <xs:sequence>              <xs:element ref="operating_system" minOccurs="0" maxOccurs="unbounded"/>              <xs:element ref="price"/>           </xs:sequence>           <xs:attribute name="name" type="xs:string" use="required"/>           <xs:attribute name="version" type="xs:string" use="required"/>        </xs:complexType>     </xs:element>     <xs:element name="price" type="xs:string"/>     <xs:element name="software">        <xs:complexType>           <xs:sequence>              <xs:element ref="package" minOccurs="0" maxOccurs="unbounded"/>           </xs:sequence>        </xs:complexType>     </xs:element>  </xs:schema>

Now that we have defined an XML DTD and an XML schema based on our requirements, let's build an XML file that conforms to the DTD and schema.

Note

Remember, you usually only use a DTD or a schema, not both at the same time. For the purpose of this example, I'm trying to demonstrate that they perform the same basic taskdefining the structure and content of your XML data.

Now let's build an XML file that contains the software inventory information. Listing 3.3 shows an XML file that contains the software inventory in XML.

Note

The XML file should contain a reference to the corresponding DTD or schema.

Listing 3.3 The software inventory data is stored in an XML file. (Filename: ch3_xml_parser_sw_inventory.xml)

 <?xml version="1.0"?>  <software_inventory>     <package name="Dreamweaver" version="4">        <operating_system name="Microsoft Windows">           <licenses_purchased>15</licenses_purchased>           <licenses_in_use>8</licenses_in_use>        </operating_system>        <operating_system name="Apple Mac OS">           <licenses_purchased>10</licenses_purchased>           <licenses_in_use>2</licenses_in_use>        </operating_system>        <price>299.99</price>     </package>     <package name="Microsoft Visual C++ Enterprise Edition" version="6">        <operating_system name="Microsoft Windows">           <licenses_purchased>15</licenses_purchased>           <licenses_in_use>8</licenses_in_use>        </operating_system>        <price>1299.00</price>     </package>     <package name="XML Spy" version="4.2">        <operating_system name="Microsoft Windows">           <licenses_purchased>15</licenses_purchased>           <licenses_in_use>8</licenses_in_use>        </operating_system>        < price399.00</price>     </package>    <package name="Borland JBuilder Enterprise" version="6">        <operating_system name="Microsoft Windows">           <licenses_purchased>15</licenses_purchased>           <licenses_in_use>8</licenses_in_use>        </operating_system>        <operating_system name="Apple Mac OS">           <licenses_purchased>10</licenses_purchased>           <licenses_in_use>2</licenses_in_use>        </operating_system>        <operating_system name="Linux">           <licenses_purchased>10</licenses_purchased>           <licenses_in_use>2</licenses_in_use>        </operating_system>        <operating_system name="Solaris Sparc">           <licenses_purchased>10</licenses_purchased>           <licenses_in_use>2</licenses_in_use>        </operating_system>        <price>2999.00</price>     </package>  </software_inventory>

Note

As additional distributions are added to the inventory, the file in Listing 3.3 can be edited either manually or by another program. I will show you how to create a program that does just that in Chapter 6, "Generating XML Documents from Databases." For now, we'll just focus on parsing this file and generating a formatted report.

Before looking at the Perl application that generates the output report, let's take a minute and think about how we would manually perform this task.

Print out the heading for the report.
Read the XML document (shown in Listing 3.3) line by line (starting at the top).
Locate the package element and write out the desired information from each element. Also, add the number of licenses and the cost to our running totals, then repeat this step for each new package element that we find.
Print the summary information after reaching the end of the document.

As you'll see, the program in Listing 3.4 (ch3_xml_parser_app.pl) prints out a report from the information contained in the ch3_xml_parser_sw_ inventory.xml file by passing it as a command-line argument. Note that these are the steps that I just laid out.You may find that it is helpful to make a short list of the main steps in the process. In our application, we'll utilize XML::Parser's more commonly used functions and attributes to parse the XML document and generate the output report.

Now that we've defined our input data (in our XML file) and the output report format, let's take a look at a Perl program built upon XML::Parser that accomplishes this task. Our Perl program that uses the XML::Parser module to parse the input file is shown in Listing 3.4. I will walk through the program and explain each major section of the program.

Listing 3.4 Program built using XML::Parser to parse the software inventory XML file. (Filename: ch3_xml_parser_app.pl)

 1.   use strict;  2.   use XML::Parser;  3.  4.   my $parser = XML::Parser->new(Style => 'Stream',  5.                                 Handlers => {Init  => \&init,  6.                                              Start => \&start,  7.                                              Char  => \&char,  8.                                              End   => \&end,  9.                                              Final => \&final});  10.  11.  $parser->parsefile(shift); 12.  13.  ####################################  14.  # These variables keep track       #  15.  # of each distribution data, and   #  16.  # are reset at each distribution.  #  17.  #                                  #  18.  my $dist_count = 0;                #  19.  my $dist_cost = 0;                 #  20.  my $dist_out = 0;                  #  21.  ####################################  22.  23.  ####################################  24.  # These variables keep track       #  25.  # of the totals which are          #  26.  # accumulated throughout the     #  27.  # parsing process                  #  28.  #                                  #  29.  my $total_count = 0;               #  30.  my $total_cost = 0;                #  31.  my $total_out = 0;                 #  32.  ####################################  33.  34.  my $curr_val = "";  ## Retains the text value  35.                      ## within the current node  36.  37.  my @os_avail = ();  ## An array of available  38.                      ## operating systems  39.  40.  sub init {  41.    my $e = shift;  42.  43.    print "\n***** Software Inventory Report *****\n";  44.    print "-------------------------------------\n\n";  45.  } 46.  47.  sub start{  48.    my ($e, $tag, %attr) = @_;  49.  50.    if ($tag eq "distribution") {  51.      print "*$attr{name} (version $attr{version})*\n";  52.    }  53.    elsif ($tag eq "os") {  54.      push($attr{name}, @os_avail);  55.    }  56.  }  57.  58.  sub char {  59.    my ($e, $string) = @_;  60.    $curr_val = $string;  61.  } 62.  63.  sub end {  64.    my ($e, $tag) = @_;  65.  66.    if ($tag eq "price") {  67.      $dist_cost = $curr_val;  68.      $total_cost += $curr_val;  69.    }  70.    elsif ($tag eq "total") {  71.      $dist_count += $curr_val;  72.      $total_count += $curr_val;  73.    }  74.    elsif ($tag eq "out") {  75.      $dist_out += $curr_val;  76.      $total_out += $curr_val;  77.    }  78.    elsif ($tag eq "distribution") {  79.      print "Packages: $dist_count\n";  80.      print "Available: ".($dist_count - $dist_out)."\n";  81.      print "Value ($dist_cost per piece): $".($dist_count*$dist_cost)."\n";  82.      print "-------------------------------------\n\n";  83.  84.      ## Empty the distribution variables  85.      $dist_count = 0;  86.      $dist_out = 0;  87.      @os_avail = ();  88.    }  89.  }  90.  91.  sub final {  92.    my $e = shift;  93.  94.    print "Total software packages: $total_count\n";  95.    print "Total software packages available: ".($total_count -$total_out)."\n";  96.    print "Total cost for $total_count pieces of software: $total_cost\n";  97.  }

Initialization

138 Our program starts with the standard use strict pragma, and then we utilize the use pragma to load the XML::Parser module. Next, we call the new function, which initializes the parser. Note that this call to new is very similar to a C++ constructor.

Note

Remember it's always a good idea to take advantage of the use strict Perl compiler pragma. The use strict pragma limits potentially unsafe code.

We pass two attributes to the new function, Style and Handlers . The Style attribute tells the parser which parsing style model we would like to use. The Stream value tells the parser to use the stream method of parsing, which is the event-driven model. Because we decided to use the event-driven model, we also need to use the Handlers attribute to pass the event handler functions. Handlers actually points to another hash, which contains the actual references to the subroutines that we will use to handle the predefined events. In this case, we assigned a handler subroutine to the Init , Start , Char , End , and Final events. All the subroutines that we defined ( init , start , char , end , and final ) are subroutines that we will define in our program, and each will be called by the XML::Parser module whenever the corresponding event is encountered . Now that the subroutines have been defined for each of the required handlers, and the parser has been initialized , we can start parsing the XML document. The XML::Parser module has three functions for parsing XML data: parse , parsestring , and parsefile . The first two functions are basically the same, expecting an input of XML data as a string. Note that the parsestring function is available only to provide backward compatibility for older applications. In our case, we'll use the parsefile subroutine because our XML data is in a standalone file.

Note

You are free to name your event handler subroutines anything that you choose. For ease of association, our subroutine names resemble the actual handlers. You can define any subroutine name as long as you assign the subroutine reference to the appropriate event handler. XML::Parser also provides the option of using predefined event handler names by using the Handlers facility. In this case, XML::Parser uses the following event handler names: StartDocument , StartTag , Text , EndTag , and EndDocument .

Because we are passing the XML document as a command-line argument, we use shift to retrieve the value from @ARGV . At this point, the process of parsing the XML document actually begins. XML::Parser parses through the whole document and calls event handlers to handle the encountered data while also verifying that the document is syntactically correct (that is, well- formed ).

Note

You can use XML::Parser to check that XML data is well-formed by calling the proper parse method (that is, parse or parsefile ) before setting any event handlers.

Next we declare and initialize eight different global data types that we'll use to keep track of the values through the calls to different handlers.

 1.   use strict;  2.   use XML::Parser;  3.  4.   my $parser = XML::Parser->new(Style => 'Stream',  5.                                 Handlers => {Init  => \&init,  6.                                              Start => \&start,  7.                                              Char  => \&char,  8.                                              End   => \&end,  9.                                              Final => \&final});  10.  11.   $parser->parsefile(shift); 12.  13.   ####################################  14.   # These variables keep track       #  15.   # of each distribution data, and   #  16.   # are reset at each distribution.  #  17.   #                                  #  18.   my $dist_count = 0;                #  19.   my $dist_cost = 0;                 #  20.   my $dist_out = 0;                  #  21.   ####################################  22.  23.   ####################################  24.   # These variables keep track       #  25.   # of the totals which are          #  26.   # accumulated throughout the     #  27.   # parsing process                  #  28.   #                                  #  29.   my $total_count = 0;               #  30.   my $total_cost = 0;                #  31.   my $total_out = 0;                 #  32.   ####################################  33.  34.   my $curr_val = "";  ## Retains the text value  35.                       ## within the current node  36.  37.   my @os_avail = ();  ## An array of available  38.                       ## operating systems

init and start Event Handlers

4056 The init subroutine is called once before the actual parsing begins. In our example, we take advantage of this subroutine to generate a header for our report. The start subroutine is called once for the opening tag (for example, <package> ). Note that three arguments are passed into the start subroutine:

Expat object This is the actual Expat object that was created with the initial call to new . We can use this object to gain access to object property values or object-specific functions.
Element name This scalar contains the name of the current element. If the XML::Parser module finds the opening tag for the package element (that is, <package> ), then the element name scalar will contain the string " package ".
Hash of attributes This argument is a hash of attributes for the current element. The data in the hash is stored in name-value pairs (for example, version=>"4" ).

 40.   sub init {  41.     my $e = shift;  42.  43.     print "\n***** Software Inventory Report *****\n";  44.     print "-------------------------------------\n\n";  45.   }  46.  47.   sub start{  48.     my ($e, $tag, %attr) = @_;  49.  50.     if ($tag eq "distribution") {  51.       print "*$attr{name} (version $attr{version})*\n";  52.     }  53.     elsif ($tag eq "os") {  54.       push($attr{name}, @os_avail);  55.     }  56.   }

char Event Handler

5861 You have to be careful of the actions that are performed in this callback subroutine. Even though this is a short subroutine, it can be tricky.

The subroutine can be called multiple times for each encounter with character data. It is not recommended to use this subroutine for any action requiring a precise count of actions (for example, printing character, incrementing counters, and so forth). It is safe to use it to store the actual value of the characters , because regardless of how many times the handler is called, it will be passed the same value, and the outcome will be the same for all the calls made to it for the same character set. That is exactly what we do; we retain the value in $curr_val for later use.

 58.  sub char {  59.    my ($e, $string) = @_;  60.    $curr_val = $string;  61.  }

Note

All the event handlers are passed the actual Expat object that was created when the new method is called.

end Event Handler and final Subroutine

6397 The end handler is called whenever the parser encounters an end tag (for example, </package> ) for an element. In this example, we use the end tag to notify the application that we've completed a particular element and to tabulate the required information. Note that we print out the summary information for that particular element and clear the values of any variables associated with that element.

The final subroutine is similar to the init subroutine in that it is only called once. However, as you may have guessed from the name, the final subroutine is called after the parsing has completed. In our example, we take advantage of the fact that this is the last subroutine called and use it to print out summary information.

 63.  sub end {  64.    my ($e, $tag) = @_;  65.  66.    if ($tag eq "price") {  67.      $dist_cost = $curr_val;  68.      $total_cost += $curr_val;  69.    }  70.    elsif ($tag eq "total") {  71.      $dist_count += $curr_val;  72.      $total_count += $curr_val;  73.    }  74.    elsif ($tag eq "out") {  75.      $dist_out += $curr_val;  76.      $total_out += $curr_val;  77.    }  78.    elsif ($tag eq "distribution") {  79.      print "Packages: $dist_count\n";  80.      print "Available: ".($dist_count - $dist_out)."\n";  81.      print "Value ($dist_cost per piece): $".($dist_count*$dist_cost)."\n";  82.      print "-------------------------------------\n\n";  83.  84.      ## Empty the distribution variables  85.      $dist_count = 0;  86.      $dist_out = 0;  87.      @os_avail = ();  88.    }  89.  }  90.  91.  sub final {  92.    my $e = shift;  93.  94.    print "Total software packages: $total_count\n";  95.    print "Total software packages available: ".($total_count -$total_out)."\n";  96.    print "Total cost for $total_count pieces of software: $total_cost\n";  97.  }

The output from our inventory report generator program is shown in Listing 3.5.

Listing 3.5 Output software inventory report. (Filename: ch3_xml_parser_report.txt)

 ***** Software Inventory Report *****  --------------------------------------------------- *Dreamweaver (version 4)*  Packages Purchased: 25  Packages Available: 15  Cost (299.99 per piece): 99.75  ---------------------------------------------------- *Microsoft Visual C++ Enterprise Edition (version 6)*  Packages Purchased: 15  Packages Available: 7  Cost (1299.00 per piece): 485  ---------------------------------------------------- *XML Spy (version 4.2)*  Packages Purchased: 15  Packages Available: 7  Cost (399.00 per piece): 85  ---------------------------------------------------- *Borland JBuilder Enterprise (version 6)*  Packages Purchased: 45  Packages Available: 31  Cost (2999.00 per piece): 4955  ---------------------------------------------------- Total software packages purchased: 100  Total software packages available: 60  Total cost for 100 software packages: 4996.99

This example demonstrates how to use XML::Parser in the stream mode to parse an XML document and how to generate a report from the data contained in the XML document. We also defined the event handlers corresponding to the most important events. For additional information on the XML::Parser Perl module (including functions and attribute options), see the perldoc page that comes with its distribution.