XML::Sablotron Perl Module


This example demonstrates how to use the XML::Sablotron Perl module. Sablotron is an open source multi-platform XML toolkit that was developed in C++ by the Ginger Alliance Open Resource Center (http://www.gingerall.com ).

The Sablotron toolkit implements an XSLT processor, a DOM parser, and the XPath standard. This toolkit provides extensions and APIs for several programming languages, and this toolkit is the basis for the XML::Sablotron Perl module that was developed by Pavel Hlavnicka.

XML::Sablotron Perl Module Example

Let's take a look at a Perl program that utilizes the XML::Sablotron module. In this example, we will filter an XML document by removing particular elements and generate another XML document as the output product of the transformation process. This type of transformation could be used between two companies who have agreed to exchange data in XML, but maybe have not agreed on a common DTD or XML schema. Another possible situation is illustrated by our example when the output XML document needs to have several elements (or attributes) changed or filtered in the output version of the XML document.

Input XML Document

For this example, let's assume that you work for a major accounting firm and that you're responsible for collecting and storing information for the annual corporate report. To provide maximum flexibility (and because you're a forward-thinking individual), the information is stored in an XML document. However, some of the information in the XML document is considered to be proprietary because you've signed non-disclosure agreements with your customers, promising not to publicly disclose certain information. So, several of the elements in your input XML document must be removed before sending the XML document to the production staff.

Before we look at the XML document containing the annual report information, let's look at the DTD. The DTD for the annual report XML document is shown in Listing 8.15.

Listing 8.15 DTD for the XML annual report . (Filename: ch8_sab_annual_report.dtd)
 <?xml version="1.0" encoding="UTF-8"?>  <!ELEMENT annual_report (customer*)>  <!ELEMENT customer (name, poc, years, revenue, telephone_num)>  <!ATTLIST customer account_number CDATA #REQUIRED>  <!ELEMENT name (#PCDATA)>  <!ELEMENT poc (#PCDATA)>  <!ELEMENT years (#PCDATA)>  <!ELEMENT revenue (#PCDATA)>  <!ELEMENT telephone_num (#PCDATA)> 

As you can see, the XML document contains an <annual_report> root element that is comprised of multiple <customer> elements. Each customer element contains the customer account number, customer name, Point of Contact (POC) (that is, who you deal with directly at the company), how long you've had the company as a client, how much revenue your company has generated from the customer, and the telephone number of your POC at the company.

The XML schema for the annual report XML document is shown in Listing 8.16.

Listing 8.16 XML schema for the annual report. (Filename: ch8_sab_annual_report.xsd)
 <?xml version="1.0" encoding="UTF-8"?>  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"  elementFormDefault="qualified">     <xs:element name="annual_report">        <xs:complexType>           <xs:sequence>              <xs:element ref="customer" minOccurs="0" maxOccurs="unbounded"/>           </xs:sequence>        </xs:complexType>    </xs:element>     <xs:element name="customer">        <xs:complexType>           <xs:sequence>              <xs:element ref="name"/>              <xs:element ref="poc"/>              <xs:element ref="years"/>              <xs:element ref="revenue"/>              <xs:element ref="telephone_num"/>           </xs:sequence>           <xs:attribute name="account_number" type="xs:string" use="required"/>        </xs:complexType>     </xs:element>     <xs:element name="name" type="xs:string"/>     <xs:element name="poc" type="xs:string"/>     <xs:element name="revenue" type="xs:string"/>     <xs:element name="telephone_num" type="xs:string"/>     <xs:element name="years" type="xs:integer"/>  </xs:schema> 

This isn't a difficult task if the XML document contains information for only a few customers. If the file was small, you could just edit it by hand, right? Fortunately, your employer is one of the major accounting firms, and unfortunately , the XML document contains information on tens of thousands of past and present customers. So, editing the XML document by hand is not an option. Luckily for us, we can use an XSLT stylesheet and filter the XML document. Let's take a look at the input XML document shown in Listing 8.17, so that we're familiar with the format. After we understand what needs to be changed in the XML document, we'll move on to the stylesheet generation.

Listing 8.17 Input annual report XML document. (Filename: ch8_sab_annual_report.xml)
 <?xml version="1.0" encoding="UTF-8"?>  <!DOCTYPE annual_report SYSTEM "annual_report.dtd">  <annual_report>    <customer account_number="id_1">        <name>Microsoft Corporation</name>        <poc>Bill Gates</poc>        <years>10</years>        <revenue>1,000.00</revenue>        <telephone_num>111-222-3333</telephone_num>     </customer>     <customer account_number="id_2">        <name>Oracle Corporation</name>        <poc>Larry Ellison</poc>        <years>12</years>        <revenue>1,700.00</revenue>        <telephone_num>222-333-4444</telephone_num>     </customer>     <customer account_number="id_3">        <name>Cisco Systems</name>        <poc>John Chambers</poc>        <years>5</years>        <revenue>2000.00</revenue>        <telephone_num>333-444-5555</telephone_num>     </customer>     <customer account_number="id_4">        <name>Dell Computer Corporation</name>        <poc>Michael Dell</poc>        <years>6</years>        <revenue>3000.00</revenue>        <telephone_num>444-555-6666</telephone_num>     </customer>  </annual_report> 

In the final generated annual report, we'd like to remove the <revenue> and <telephone_num> elements. While we want to promote the fact that we have these high profile companies as clients , we don't want to show our revenues , or release the private telephone numbers of their Chief Executive Officers (CEOs).

This is a fairly easy task if we use an XSLT stylesheet to filter the input XML document. Let's take a look at the XSLT stylesheet that performs this filtering for us.

XSLT Stylesheet

Now that we understand the format of the input XML document and understand which elements have to be removed, let's take a look at the XSLT stylesheet that performs our transformation. The XSLT stylesheet is shown in Listing 8.18. Let's step through the stylesheet and make sure that we understand how it works.

Listing 8.18 XSLT stylesheet to filter proprietary elements from the input XML document. (Filename: ch8_sab_annual_report.xslt)
 1.   <?xml version="1.0" encoding="utf-8"?>  2.   <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">  3.   <xsl:output method="xml" doctype-system="ch8_sab_annual_report.dtd"/>  4.  5.      <xsl:template match="/">  6.         <annual_report>  7.            <xsl:apply-templates/>  8.         </annual_report>  9.      </xsl:template>  10.  11.     <xsl:template match="annual_report">  12.        <xsl:for-each select="customer">  13.           <xsl:text>&#10;</xsl:text>  14.           <customer>  15.              <xsl:attribute name="account_number" >  16.                 <xsl:value-of select="@account_number"/>  17.              </xsl:attribute>  18.  19.              <xsl:text>&#10;</xsl:text>  20.              <xsl:element name="name">  21.                 <xsl:value-of select="name"/>  22.              </xsl:element>  23.  24.              <xsl:text>&#10;</xsl:text>  25.              <xsl:element name="poc">  26.                 <xsl:value-of select="poc"/>  27.              </xsl:element>  28.  29.              <xsl:text>&#10;</xsl:text>  30.              <xsl:element name="years">  31.                 <xsl:value-of select="years"/> 32.              </xsl:element>  33.              <xsl:text>&#10;</xsl:text>  34.           </customer>  35.        </xsl:for-each>  36.     </xsl:template>  37.  </xsl:stylesheet> 

1 “3 The opening portion of the XSLT stylesheet contains the standard XML stylesheet declarations that we discussed a little earlier in this chapter. The <xsl:output> element specifies some of the characteristics that will appear in the transformed document. In our case, we're only using the method attribute to specify that the output document will be XML and the doctype-system attribute to specify the value of the SYSTEM attribute of the DOCTYPE declaration.

 1.   <?xml version="1.0" encoding="utf-8"?>  2.   <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">  3.   <xsl:output method="xml" doctype-system="annual_report.dtd"/> 

5 “9 The <xsl:template> element defines the output template that processes the root element, that is, all elements in this XML document. The <xsl:apply-templates> element commands the XSLT processor to apply templates to a node set. In our case, the node set is the root element, and as you can see, the results are wrapped by the start and end tags for the root element <annual_report> .

 5.      <xsl:template match="/">  6.         <annual_report>  7.            <xsl:apply-templates/>  8.         </annual_report>  9.      </xsl:template> 

11 “37 This section of the XSLT stylesheet performs the majority of the work in this transformation. First, we define a template that matches on the <annual_report> element. Starting at the <annual_report> element (our root), we're going to iterate through all the customer elements. The <xsl:text> tag with the &#10; will insert new lines into the output XML document. Note that these new lines are not required (in fact, they're ignored by an XML parser) but we've added them for readability. Otherwise, your XML document would be printed on one long line.

 11.     <xsl:template match="annual_report">  12.        <xsl:for-each select="customer">  13.           <xsl:text>&#10;</xsl:text>  14.           <customer>  15.              <xsl:attribute name="account_number" >  16.                 <xsl:value-of select="@account_number"/>  17.              </xsl:attribute>  18.  19.              <xsl:text>&#10;</xsl:text>  20.              <xsl:element name="name">  21.                 <xsl:value-of select="name"/>  22.              </xsl:element>  23.  24.              <xsl:text>&#10;</xsl:text>  25.              <xsl:element name="poc">  26.                 <xsl:value-of select="poc"/>  27.              </xsl:element>  28.  29.              <xsl:text>&#10;</xsl:text>  30.              <xsl:element name="years">  31.                 <xsl:value-of select="years"/>  32.              </xsl:element>  33.              <xsl:text>&#10;</xsl:text>  34.           </customer>  35.        </xsl:for-each>  36.     </xsl:template>  37.  </xsl:stylesheet> 

Between the opening and closing tags for the <customer> element is where our filtering takes place. Remember, each customer element has an <account_number> attribute and <name> , <poc> , <years> , <revenue> , and <telephone_num> elements. At this point in the stylesheet, we'll have access to all the child elements of each <customer> element. So, all we need to do to perform our filtering is create the elements we want to keep ( <account_number> , <name> , <poc> , and <years> ), and skip over those that we don't want ( <revenue> and <telephone_num> ).

As you can see, first we create an attribute in the output document by using <xsl:attribute> . We use the same name and retrieve the current value of the attribute. Note that to retrieve the current value, we need an @ symbol in front of the attribute name.

We then create a new element using <xsl:element> , using the same name as the original element, and then retrieving the values of each respective element. Note that we could have easily mapped the incoming element name to a new name in the output document. Also, we could have replaced the character data in the proprietary elements ( revenue and telephone_num ) with a string such as " PROPRIETARY ."

Now that we've discussed the XSLT stylesheet, let's take a look at the Perl program that actually does the work for us.

XML::Sablotron-Based Perl Filtering Program

Because we've looked at the input XML document and the XSLT stylesheet, we know what the output of the transformation process should be. But, how do we do it? Let's take a look at the XML::Sablotron “based Perl program shown in Listing 8.19.

Listing 8.19 XML::Sablotron Perl program that filters an input XML document. (Filename: ch8_sab_app.pl)
 1.   use strict;  2.   use XML::Sablotron;  3.  4.   # Open the XSLT stylesheet.  5.   open (XSLT, "ch8_sab_annual_report.xslt");  6.   undef $/; 7.   my $xslt = <XSLT>;  8.   close XSLT;  9.  10.  # Open the input XML document.  11.  open (INPUT_XML, "ch8_sab_annual_report.xml");  12.  undef $/;  13.  my $inputXML = <INPUT_XML>;  14.  close INPUT_XML;  15.  16.  # Call the performTransform() subroutine, store the  17.  # results in $transformedDoc.  18.  my $transformedDoc = performTransform ($xslt, $inputXML);  19.  20.  # Write the results to an output file.  21.  open (XML_REPORT, "> ch8_sab_filtered_report.xml")  22.    or die "Can't open ch8_sab_filtered_report.xml $!\n";  23.  print XML_REPORT $transformedDoc;  24.  25.  close (XML_REPORT);  26.  27.  ###################################  28.  sub performTransform {  29.    my ($xsltDoc, $xmlDoc) = @_;  30.  31.    # Instantiate the new Sablotron objects.  32.    my $sabObject = new XML::Sablotron;  33.    my $sitObject = new XML::Sablotron::Situation;  34.  35.    # Pass the required arguments to the Sablotron objects.  36.    $sabObject->addArg($sitObject, 'xslt', $xsltDoc);  37.    $sabObject->addArg($sitObject, 'xml', $xmlDoc);  38.  39.    # Perform the transformation.  40.    $sabObject->process($sitObject, 'arg:/xslt', 'arg:/xml', 'arg:/output');  41.  42.    # Retrieve the results.  43.    my $result = $sabObject->getResultArg('arg:/output');  44.  45.    return ($result);  46.  } 

1 “12 The opening section of the program contains the usual use strict pragma, as well as the use XML::Sablotron pragma that is required to load the XML::Sablotron module.

This example is a little longer than it actually needs to be because we're going to do a few things differently in this example. In our previous examples, we read in the XML and XSLT stylesheet files directly (that is, we used the parse_file method in the XML::LibXSLT example). Here, we're going to utilize a subroutine that performs the parsing ”it accepts the input XML document and XSLT stylesheet as arguments and returns the transformed XML document.

As you can see, we open and read the XSLT stylesheet and the input XML document and store the contents in the $xslt and $inputXML scalars, respectively. Pay particular attention to the undef function in this example. The $/ symbol is the input record separator, which is by default a newline. The construct undef $/ undefines the input record separator, so the entire file is read into the scalar. Typically, by default, the construct <FILE> would read only one line at a time.

 1.   use strict;  2.   use XML::Sablotron;  3.  4.   # Open the XSLT stylesheet.  5.   open (XSLT, "ch8_sab_annual_report.xslt");  6.   undef $/;  7.   my $xslt = <XSLT>;  8.   close XSLT;  9.  10.  # Open the input XML document.  11.  open (INPUT_XML, "ch8_sab_annual_report.xml");  12.  undef $/;  13.  my $inputXML = <INPUT_XML>;  14.  close INPUT_XML; 

16 “25 Now we have the XSLT stylesheet and the XML input document in scalars. We call the locally defined performTransform subroutine and pass in the XSLT stylesheet and XML input document scalars, and the transformed document is returned. Note that if the XSLT stylesheet or XML input document was large, you would probably want to use references to the scalars.

 16.  # Call the performTransform() subroutine, store the  17.  # results in $transformedDoc.  18.  my $transformedDoc = performTransform ($xslt, $inputXML);  19.  20.  # Write the results to an output file.  21.  open (XML_REPORT, "> ch8_sab_filtered_report.xml")  22.    or die "Can't open ch8_sab_filtered_report.xml $!\n";  23.  print XML_REPORT $transformedDoc;  24.  25.  close (XML_REPORT); 

Note

For additional information about Perl references, see perldoc perlref.


28 “46 This subroutine is what performs all the work for us in this program. After assigning the input scalars to local variables , we need to instantiate new XML::Sablotron and XML::Sablotron::Situation objects. The XML::Sablotron object is our XSLT processor, while the XML:: Sablotron::Situation object is more of a helper object for the processor. The XML::Sablotron::Situation object provides several methods that are useful for debugging particular situations.

 28.  sub performTransform {  29.    my ($xsltDoc, $xmlDoc) = @_;  30.  31.    # Instantiate the new Sablotron objects.  32.    my $sabObject = new XML::Sablotron;  33.    my $sitObject = new XML::Sablotron::Situation;  34.  35.    # Pass the required arguments to the Sablotron objects.  36.    $sabObject->addArg($sitObject, 'xslt', $xsltDoc);  37.    $sabObject->addArg($sitObject, 'xml', $xmlDoc);  38.  39.    # Perform the transformation.  40.    $sabObject->process($sitObject, 'arg:/xslt', 'arg:/xml', 'arg:/output'); 41.  42.    # Retrieve the results.  43.    my $result = $sabObject->getResultArg('arg:/output');  44.  45.    return ($result);  46.  } 

After instantiating the objects, we call the method addArg() to pass in arguments to the XSLT processor. The arguments to the addArg() method are situation object, buffer name, and XML data. In our case, the situation object is named $sitObject , the buffer names are specified in the arg:/name scheme, and the XML data is passed in as a scalar. After all the arguments are provided, we can call the XML::Sablotron method process() to actually perform the transformation. Finally, the result is retrieved from the XSLT processor by calling the getResultArg() method and the result is returned and then written to an output file. The output file contains the transformed XML document and is discussed in the next section.

Note

The Sablotron module has dependencies on two additional packages. To use the Sablotron module, you also need to install the Sablotron and Expat XML parser libraries. For additional information and links to download these libraries, see perldoc XML::Sablotron.


Generated XML Output Document

The Perl program generates the filtered XML document that is shown in Listing 8.20. As you can see, the <revenue> and <telephone_num> elements have been removed from each occurrence of a <customer> element.

Listing 8.20 Output filtered XML document. (Filename: ch8_sab_filtered_report.xml)
 <?xml version="1.0"?>  <!DOCTYPE annual_report SYSTEM "ch8_sab_annual_report.dtd">  <annual_report>     <customer account_number="id_1">        <name>Microsoft Corporation</name>        <poc>Bill Gates</poc>        <years>10</years>     </customer>     <customer account_number="id_2">        <name>Oracle Corporation</name>        <poc>Larry Ellison</poc>        <years>12</years>     </customer>     <customer account_number="id_3">        <name>Cisco Systems</name>        <poc>John Chambers</poc>        <years>5</years>     </customer>     <customer account_number="id_4">        <name>Dell Computer Corporation</name>        <poc>Michael Dell</poc>        <years>6</years>  </customer></annual_report> 


XML and Perl
XML and Perl
ISBN: 0735712891
EAN: 2147483647
Year: 2002
Pages: 145

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net