SAX1Simple API for XML Version 1

What is the Simple API for XML (SAX)? SAX provides an event-based API to retrieve (that is, to parse) data from an XML document. What is the data in an XML document? Depending on the task at hand, you may need to extract an element name , the character data inside an element, an attribute associated with a particular element, or all the above. The SAX API is implemented by a large number of XML parsers, including Apache Xerces, MSXML from Microsoft, and the Oracle XML Parser.

SAX1 Event Handling

How does the event handling inside a SAX processor work? The XML document is read sequentially (line by line), and the event-driven SAX processor calls a predefined subroutine (called an event handler in this case) whenever a particular condition (or set of conditions) is satisfied. The interface between the process and the event handlers is very user -friendly.

For example, we can set up event handlers that will be called by the SAX processor whenever it encounters any number of events. Here is a small subset of what the SAX processor considers to be an event:

The start of an XML document
The start tag of an element (<)
The end tag of an element (</)
The end of an XML document

Using the SAX approach provides several advantages to parsing XML documents.

First, because you're reading the XML document sequentially, the data that you're parsing from the XML document is available as you process the file. This means that you don't need to wait for the entire file to be processed before you start seeing your results.
Second, SAX works well with large files. Because the file is processed as it is being read, the entire file doesn't reside in memory all at once.
Third, the SAX processor doesn't store any of the parsed data. What the application does with the event handler inside the application is up to you, the developer. All that the SAX processor does is call the event handler and pass in the proper data as a parameter. Some people would consider this a disadvantage compared to other parsing methods (for example, tree-based parsers); however, I consider this to be an advantage. It provides a lot of flexibility and enables (and almost forces) you to be a little creative while trying to parse a document. So, what you do inside of an event handler (what data you store, how you store it, and so forth) is up to you.

Now that we have a good understanding of SAX and what it does, let's take a look at an example of how it does it.

XML::Parser::PerlSAX Perl Module

One of the more popular Perl SAX processors is XML::Parser::PerlSAX. XML::Parser::PerlSAX is a Perl SAX parser written by Ken MacLeod that was built using the XML::Parser module. The following example illustrates some of the SAX concepts that I discussed in the previous section and shows how to use XML::Parser::PerlSAX to parse an XML document.

XML::Parser::PerlSAX-Based Application

Before we look at the XML document that we want to parse for this example, let's take a look at a DTD and XML schema that describes the XML document. Listing 3.6 shows the DTD for the course catalog file. As you can see, the root element is named <course_catalog> and it has two attributes school and term . Note that term is an enumerated value that only allows four possible values. Each <class> element has one attribute ( name ) and two child elements ( <description> and <schedule> ). Note that each <schedule> element has the following child elements: <room> , <day> , <start_time> , <end_time> , and <credits> .

Listing 3.6 DTD for the course catalog XML document. (Filename: ch3_course_catalog.dtd)

 <?xml version="1.0" encoding="UTF-8"?>  <!ELEMENT course_catalog (class+)>  <!ATTLIST course_catalog     school CDATA #REQUIRED     term (Fall  Winter  Spring  Summer) #REQUIRED>  <!ELEMENT school (#PCDATA)>  <!ELEMENT class (description, schedule+)>  <!ATTLIST class     name CDATA #REQUIRED>  <!ELEMENT name (#PCDATA)>  <!ELEMENT description (#PCDATA)>  <!ELEMENT schedule ((room, day, start_time, end_time, credits))>  <!ELEMENT room (#PCDATA)>  <!ELEMENT day (#PCDATA)>  <!ELEMENT start_time (#PCDATA)>  <!ELEMENT end_time (#PCDATA)>  <!ELEMENT credits (#PCDATA)>

Listing 3.7 shows the XML schema for the course catalog XML document.

Listing 3.7 XML schema for the course catalog XML document. (Filename: ch3_course_catalog.xsd)

 <?xml version="1.0" encoding="UTF-8"?>  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"  elementFormDefault="qualified">     <xs:element name="class">        <xs:complexType>           <xs:sequence>              <xs:element ref="description"/>              <xs:element ref="schedule" maxOccurs="unbounded"/>           </xs:sequence>           <xs:attribute name="name" type="xs:string" use="required"/>        </xs:complexType>     </xs:element>     <xs:element name="course_catalog">        <xs:complexType>           <xs:sequence>              <xs:element ref="class" maxOccurs="unbounded"/>           </xs:sequence>           <xs:attribute name="school" type="xs:string" use="required"/>           <xs:attribute name="term" use="required">              <xs:simpleType>                 <xs:restriction base="xs:NMTOKEN">                 <xs:enumeration value="Fall"/>                 <xs:enumeration value="Winter"/>                 <xs:enumeration value="Spring"/>                 <xs:enumeration value="Summer"/>              </xs:restriction>              </xs:simpleType>           </xs:attribute>        </xs:complexType>     </xs:element>     <xs:element name="credits" type="xs:float"/>     <xs:element name="day" type="xs:string"/>     <xs:element name="description" type="xs:string"/>     <xs:element name="end_time" type="xs:string"/>     <xs:element name="name" type="xs:string"/>     <xs:element name="room" type="xs:string"/>     <xs:element name="schedule">       <xs:complexType>           <xs:sequence>              <xs:element ref="room"/>              <xs:element ref="day"/>              <xs:element ref="start_time"/>              <xs:element ref="end_time"/>              <xs:element ref="credits"/>           </xs:sequence>        </xs:complexType>     </xs:element>     <xs:element name="school" type="xs:string"/>     <xs:element name="start_time" type="xs:string"/>  </xs:schema>

The XML document that we want to process is shown in Listing 3.8. As you can see, it is a short, straightforward XML document containing the course catalog from a local university for the fall semester. Note that this document has data stored both in elements (as character data) and in attributes associated with particular elements.

Listing 3.8 University course catalog stored in XML. (Filename: ch3_perlsax_catalog.xml)

 <?xml version="1.0" encoding="UTF-8"?>  <!Filename: course_catalog_3-1.xml  <course_catalog school="XML and Perl University" term="Fall">     <class name="XML 101">        <description>Hands on introduction to XML.</description>        <schedule>           <room>Lecture Hall 1</room>           <day>Monday and Wednesday</day>           <start_time>9:00 AM</start_time>           <end_time>10:00 AM</end_time>           <credits>2.0</credits>        </schedule>    </class>     <class name="Perl 101">        <description>Hands on introduction to Perl .</description>        <schedule>           <room>Lecture Hall 2</room>           <day>Tuesday and Thursday</day>           <start_time>1:00 PM</start_time>           <end_time>3:00 PM</end_time>           <credits>2.5</credits>        </schedule>     </class>     <class name="Writing  for Engineers 101">        <description>Covers the topic of technical writing.</description>        <schedule>           <room>Lecture Hall 3</room>           <day>Monday and Friday</day>           <start_time>1:00 AM</start_time>           <end_time>3:00 PM</end_time>           <credits>2.0</credits>        </schedule>     </class>  </course_catalog>

As you can see, the root element of the document is named <course_catalog> . There are multiple occurrences of class elements, and each class element has several child elementsbasically the important information associated with the class.

Let's say that you work for the university and your task is to parse the XML course catalog and generate a report that contains the pertinent information for each class. The output report should have the following format (repeating the class portion of the schedule once for each class):

 School: <school name here> - <term> Semester Course Catalog  Class name: <class name here> ------------------- Description:  Room:  Day:  Start time:  End time:

Developing the Application

Before we actually walk through the Perl program that parses the course catalog and generates our report, let's take a minute and think about how you would generate this report if you needed to do it by hand. To parse the document by hand, you would need to perform the following individual tasks :

Read the XML document (shown in Listing 3.8) line by line (starting at the top).
Locate the <course_catalog> element and write the desired attributes.
Look for the <class> elements that we're interested in, then find the schedule elements that belong to each class element.
Write out the element attributes and character data to a report.

Well, guess what? That's exactly how our report generator program will work! Remember, XML::Parser::PerlSAX is a sequential parser, so the document shown in Listing 3.8 will be processed one line at a time (as if you were reading it), starting at the first line in the document and finishing at the end of the document. As we're parsing the document, we'll need to check the element names in the document (returned inside the event handlers) against a list of the element names that are required to generate the output report.

Note that the credits element appears in our XML document; however, the number of credits for each class is not part of our output report. For this example, we can print out the results whenever we come across one of the elements (or attributes) in which we're interested. Depending on the required output, we don't have to print the results. For example, depending on our requirements, we might parse the results and store them in a database, display them in HTML as part of a web page, or just count the number of elements in an XML document that match a particular set of criteria. Don't worry, I'll cover all these parsing- related tasks (as well as a few others) in upcoming chapters, so let's take a look at the Perl application shown in Listing 3.9, which generates the required output report.

Listing 3.9 Program that builds a course catalog report using XML::Parser::PerlSAX. (Filename: ch3_perlsax_app.pl)

 1.   use strict;  2.   use XML::Parser::PerlSAX;  3.  4.   # Instantiate a new parser object.  5.   my $saxHandler = SaxHandler->new();  6.   my $parser = XML::Parser::PerlSAX->new(Handler => $saxHandler);  7.   my $inputXmlFile = "ch3_perlsax_catalog.xml";  8.   my %parser_args = (Source => {SystemId => $inputXmlFile});  9.   $parser->parse(%parser_args);  10.  11.   exit;  12.  13.   # Create a new package.  14.   package SaxHandler;  15.   use strict;  16.  17.   my $current_element;  18.  19.   sub new {  20.     my $type = shift;  21.     return bless {}, $type;  22.   }  23.  24.   # start_element event handler  25.   sub start_element {  26.     my ($self, $element) = @_; 27.  28.     my %atts = %{$element->{Attributes}};  29.     my $numAtts = keys(%atts);  30.  31.     # Check to see if this element has attributes.  32.     if ($numAtts > 0) {  33.       my ($thisAtt, $key, $val);  34.       for $key (keys %atts) {  35.         $val = $atts{$key};  36.  37.         if ($key eq 'school') {  38.            print "\nSchool: $val - ";  39.         }  40.         elsif ($key eq 'term') {  41.            print "$val Semester Course Catalog\n\n";  42.         } 43.         elsif ($key eq 'name') {  44.           print "Class name: $val\n";  45.           for (my $i = 0; $i < (12 + length($val)); $i++) {  46.             print "-";  47.           }  48.           print "\n";  49.         }  50.       }  51.     }  52.     $current_element = $element->{Name};  53.   }  54.  55.   # characters event handler  56.   sub characters {  57.     my ($self, $character_data) = @_;  58.  59.     my $text = $character_data->{Data};  60.  61.     # Remove leading and trailing whitespace.  62.     $text =~ s/^\s*//;  63.     $text =~ s/\s*$//;  64.  65.     if (length($text)) {  66.       if (($current_element eq 'description')) {  67.         print "Description: $text\n";  68.       }  69.       elsif ($current_element eq 'room') {  70.         print "Room: $text\n";  71.       }  72.       elsif ($current_element eq 'day') {  73.         print "Day: $text\n"; 74.       }  75.       elsif ($current_element eq 'start_time') {  76.         print "Start time: $text\n";  77.       }  78.       elsif ($current_element eq 'end_time') {  79.         print "End time: $text\n";  80.       }  81.     }  82.   }  83.  84.   # end_element event handler  85.   sub end_element {  86.     my ($self, $element) = @_;  87.  88.     if ($element->{Name} eq 'class') {  89.       print "\n";  90.     }  91.   }  92.  93.   # start_document event handler  94.   sub start_document {  95.       my ($self) = @_;  96.   }  97.  98.   # end_document event handler  99.   sub end_document {  100.       my ($self) = @_;  101.   }

I've listed the input XML file and the Perl program that parses the input file. Now, let's methodically walk through the Perl program, and I'll explain each of the event handlers, and then we'll see the report that is generated by our parsing program.

Initialization

122 This is the main portion of the program. First, we need to utilize the use pragma to identify which module we want to use. In this case, use XML::Parser::PerlSAX. In this code block, we create a new XML::Parser::PerlSAX parser object and pass in any required options (as key-value pairs or a single hash). Also, we identify the XML file to be parsed, and call the parse method that actually starts the parsing of the document. This portion of the program is basically responsible for initializing the parser and passing in any required options.

This code block is the beginning of the handler that is defined as an inline Perl package. This is a simple package that generates our output report.

 1.   use strict;  2.   use XML::Parser::PerlSAX;  3.  4.   # Instantiate a new parser object.  5.   my $saxHandler = SaxHandler->new();  6.   my $parser = XML::Parser::PerlSAX->new(Handler => $saxHandler);  7.   my $inputXmlFile = "ch3_perlsax_catalog.xml";  8.   my %parser_args = (Source => {SystemId => $inputXmlFile});  9.   $parser->parse(%parser_args);  10.  11.   exit;  12.  13.   # Create a new package.  14.   package SaxHandler;  15.   use strict;  16.  17.   my $current_element;  18.  19.   sub new {  20.     my $type = shift;  21.     return bless {}, $type;  22.   }

start_element Event Handler

2453 This is the start_element event handler that is called by the parser each time the opening tag (for example, <course_catalog> ) of an element is reached. In the start_element handler, we have access to the attributes associated with this element. As you can see from the example, we can receive a reference to a hash of attributes stored as key-value pairs. If we want to extract attributes in a particular order (for example, to satisfy the format of a particular report), we can loop through the hash of key-value pairs until we find the attribute that we're interested in. This enables us to skip over any attributes that aren't important to us.

XML::Parser::PerlSAX (and SAX in general) doesn't provide a method to identify the current element, so it is your responsibility to track your current location (that is, current element) within the XML document. One way to do this is to set a global variable equal to the current element ($current_element = $element->{Name}) in the start_element handler.

 24.   # start_element event handler  25.   sub start_element {  26.     my ($self, $element) = @_;  27.  28.     my %atts = %{$element->{Attributes}};  29.     my $numAtts = keys(%atts);  30.  31.     # Check to see if this element has attributes.  32.     if ($numAtts > 0) {  33.       my ($thisAtt, $key, $val);  34.       for $key (keys %atts) {  35.         $val = $atts{$key};  36.  37.         if ($key eq 'school') {  38.           print "\nSchool: $val - ";  39.         }  40.         elsif ($key eq 'term') {  41.           print "$val Semester Course Catalog\n\n";  42.         }  43.         elsif ($key eq 'name') {  44.           print "Class name: $val\n";  45.           for (my $i = 0; $i < (12 + length($val)); $i++) {  46.             print "-";  47.           } 48.           print "\n";  49.         }  50.       }  51.     }  52.     $current_element = $element->{Name};  53.   }

characters Event Handler

5582 The characters event handler is called whenever the parser encounters the content of an element (that is, the data between a pair of start and end tags). One thing you need to do is remove the leading and trailing whitespace from the character data. This is because our XML document has extra newlines (at the end of every line) that have been inserted for human readability, but aren't required by XML or the parser. So, we need to remove the whitespace to eliminate confusion.

After removing the whitespace, we can search for particular element names using the current_element scalar that was defined inside the start_element handler. Note that we're explicitly searching for the element name because we don't want to include every element name. Did you notice that our original XML document included a credits element, but that we didn't need to include credits as part of the course catalog? Because we didn't try to match the credit element in the characters event handler, it won't show up in our generated report.

 55.   # characters event handler  56.   sub characters {  57.     my ($self, $character_data) = @_;  58.  59.     my $text = $character_data->{Data};  60.  61.     # Remove leading and trailing whitespace.  62.     $text =~ s/^\s*//;  63.     $text =~ s/\s*$//;  64.  65.     if (length($text)) { 66.       if (($current_element eq 'description')) {  67.         print "Description: $text\n";  68.       }  69.       elsif ($current_element eq 'room') {  70.         print "Room: $text\n";  71.       }  72.       elsif ($current_element eq 'day') {  73.         print "Day: $text\n";  74.       }  75.       elsif ($current_element eq 'start_time') {  76.         print "Start time: $text\n";  77.       }  78.       elsif ($current_element eq 'end_time') {  79.         print "End time: $text\n";  80.       }  81.     }  82.   }

end_element Event Handler

8491 The end_element event handler is called whenever the parser encounters the end tag of an element. For this particular example, we are only looking for the end tag of each class element, so that we can insert a newline between consecutive class elements.

 84.   # end_element event handler  85.   sub end_element {  86.     my ($self, $element) = @_;  87.  88.     if ($element->{Name} eq 'class') {  89.       print "\n";  90.     }  91.   }

After running the XML::Parser::PerlSAX parser, you will see the output in Listing 3.10. As you can see, we've completed our initial task of generating a course catalog report based on the initial requirements.

Listing 3.10 Output report generated by a program using the XML::Parser::PerlSAX module. (Filename: ch3_perlsax_report.txt)

 School: XML and Perl University - Fall Semester Course Catalog  Class name: XML 101  ------------------- Description: Hands on introduction to the exciting world of XML.  Room: Lecture Hall 1  Day: Monday and Wednesday  Start time: 9:00 AM  End time: 10:00 AM  Class name: Perl 101  -------------------- Description: Hands on introduction to the Perl language.  Room: Lecture Hall 2  Day: Tuesday and Thursday  Start time: 1:00 PM  End time: 3:00 PM  Class name: Writing  for Engineers 101  -------------------------------------- Description: Covers the topic of technical writing.  Room: Lecture Hall 3  Day: Monday and Friday  Start time: 1:00 AM  End time: 3:00 PM

XML::Parser::PerlSAX Event Handlers

Because XML::Parser::PerlSAX is based on SAX1, it only supports a subset of the functions provided by SAX2. Occasions may arise when you need to use a SAX1-based parser; however, I would suggest using a SAX2-based parser for new projects. SAX2 adds additional functionality, and those additions are discussed in the next section.

Note

At the time of this writing, the Perl SAX specifications are due to be published by the Perl XML community. Unfortunately, we were unable to make the publishing deadline to include this content. Please check at the official Perl XML project home page (http://www.xmlproj.com) for additional information.