Tree-Based XML Parser Modules
Tree-based XML parsing is one of two most popular ways of parsing XML, along with stream-based parsing, which was discussed in Chapter 3.The Document Object Model (DOM) standard defines a standard way to parse and represent the XML data in a tree-based parser. Tree-based parsers are also called
in-memory parsers
, because unlike stream-based parsers, which iterate over each piece of data and keep very minimal state information, tree-based parsers iterate over the XML document and create an in-memory representation of it for later use.You can then use Perl's standard facilities to access the information in the data structure. The
XML::Simple Perl ModuleBack in Chapter 2, "Now Let's Get Our Hands Dirty," you saw a small example using the XML::Simple Perl module to parse a document. In this section, we'll discuss another example using the XML::Simple Perl module and explain the XML::Simple API and some of the more useful options.
The XML::Simple Perl module was written by Grant McClean. As an XML parser module, it is (as you may have guessed from the
XML::Simple Perl Module Example
Some of our work involves working with databases, usually through the Perl Database Interface (DBI).The Perl DBI provides a consistent, database-independent interface to an application, regardless of the database that is being used.This means that a database application using the Perl DBI can use a consistent interface for a number of databases.This is a powerful capability that enables you to develop a very flexible application.The same application (with literally a one line change) can access an Oracle database, a Microsoft SQL Server database, or the
Note We will be discussing the Perl DBI in Chapter 6, "Generating XML Documents from Databases." If you can't wait until Chapter 6 for this discussion of the Perl DBI, take a look at the official Perl DBI page http://dbi.perl.org.
The reason that I mentioned the Perl DBI is that for it to
It's usually a good idea to store this type of configuration information outside your application. Storing this configuration information external to the application eliminates the need for a
Listing 4.1 Configuration file DTD. (Filename: ch4_simple_config.dtd)
<?xml version="1.0 " encoding=="UTF-8 "?>
<!ELEMENT db_config_information (db_user,db_server+)>
<!ELEMENT db_user (username,password)>
<!ELEMENT username (#PCDATA)>
<!ELEMENT password (#PCDATA)>
<!ELEMENT db_server (server_ip,db_driver,port)>
<!ATTLIST db_server hostname CDATA #IMPLIED
os CDATA #IMPLIED>
<!ELEMENT server_ip (#PCDATA)>
<!ELEMENT db_driver (#PCDATA)>
<!ELEMENT port (#PCDATA)>
As you can see, this is a very simple DTD that contains configuration informa-tion. Note that the DTD supports multiple database servers for each user. If this were a longer configuration file (for example, a few hundred lines), using XML instead for the configuration file format would clearly be an advantage. By storing the configuration information in XML, you can easily verify that the configuration file is well-
Listing 4.2 XML schema that describes the application configuration file. (Filename: ch4_simple_config.xsd)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<xs:element name="db_config_information">
<xs:complexType>
<xs:sequence>
<xs:element ref="db_user"/>
<xs:element ref="db_server" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="db_driver" type="xs:string"/>
<xs:element name="db_server">
<xs:complexType>
<xs:sequence>
<xs:element ref="server_ip"/>
<xs:element ref="db_driver"/>
<xs:element ref="port"/>
</xs:sequence>
<xs:attribute name="hostname" type="xs:string"/>
<xs:attribute name="os" type="xs:string"/>
</xs:complexType>
</xs:element>
<xs:element name="db_user">
<xs:complexType>
<xs:sequence>
<xs:element ref="username"/>
<xs:element ref="password"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="password" type="xs:string"/>
<xs:element name="port" type="xs:positiveInteger"/>
<xs:element name="server_ip" type="xs:string"/>
<xs:element name="username" type="xs:string"/>
</xs:schema>
We've shown the DTD and the XML schema, now let's take a look at the XML configuration file that is shown in Listing 4.3. It contains user and database server information required to establish Perl DBI connections. As you can see, this configuration file contains information for the username "
mark
", and shows my connection information for two servers named "
rocket
" and "
Listing 4.3 XML configuration file. (Filename: ch4_simple_config.xml)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE db_config_information SYSTEM "ch4_simple_config.dtd">
<db_config_information>
<db_user>
<username>mark</username>
<password>mark's password</password>
</db_user>
<db_server hostname="rocket" os="Linux">
<server_ip>192.168.1.10</server_ip>
<db_driver>MySQL</db_driver>
<port>3306</port>
</db_server>
<db_server hostname="scooter" os="Microsoft Windows">
<server_ip>192.168.1.50</server_ip>
<db_driver>ODBC</db_driver>
<port>3379</port>
</db_server>
</db_config_information>
Tree-based parsers are easier to understand when you can see how the data is stored after it has been parsed.This makes it much easier to access the particular elements that you are interested in retrieving. Figure 4.2 shows a graphical representation of the XML configuration file as it is stored in memory after parsing by the XML::Simple Perl module. As you should expect, we have one root element (
<config_information>
) that has three child elements (
<db_user>
,
<db_server[0]>
, and
<db_server[1]>
). Note the index next to each
<db_server>
element. This is used to
Figure 4.2. Tree representation of the XML configuration file.
Now let's take a look at the Perl program in Listing 4.4.This program was built with the XML::Simple module and parses the XML configuration file that was shown earlier in Listing 4.3. Listing 4.4 Configuration file parser built with the XML::Simple Perl module. (Filename: ch4_simple_app.pl)
1. use strict;
2. use XML::Simple;
3.
4. # Parse the input XML document
5. my $root = XMLin("./ch4_simple_config.xml", forcearray=>1);
6.
7. # Print config info.
8. print "Configuration Information\n";
9. print "User: $root->{db_user}->[0]->{username}->[0]\n";
10. print "Password: $root->{db_user}->[0]->{password}->[0]\n\n";
11.
12. # Print info about the first server (note indexes =0).
13. print "Server #1\n";
14. print "Hostname: $root->{db_server}->[0]->{hostname}\n";
15. print "Hostname: $root->{db_server}->[0]->{os}\n";
16. print "IP Address: $root->{db_server}->[0]->{server_ip}->[0]\n";
17. print "DB Driver: $root->{db_server}->[0]->{db_driver}->[0]\n";
18. print "Port: $root->{db_server}->[0]->{port}->[0]\n\n";
19.
20. # Print info about the second server (note indexes = 1).
21. print "Server #2\n";
22. print "Hostname: $root->{db_server}->[1]->{hostname}\n";
23. print "Hostname: $root->{db_server}->[1]->{os}\n";
24. print "IP Address: $root->{db_server}->[1]->{server_ip}->[0]\n";
25. print "DB Driver: $root->{db_server}->[1]->{db_driver}->[0]\n";
26. print "Port: $root->{db_server}->[1]->{port}->[0]\n";
Initialization and Parsing1 “5 The first section of this program contains the usual use pragma statement ( use strict ) that we include in all our programs. Because we're using the XML::Simple Perl module, we need to load the XML::Simple Perl module with the use XML::Simple pragma.
As mentioned earlier, XML::Simple is built on top of the XML::Parser Perl module that we discussed in Chapter 3.We don't need a
use XML::Parser
pragma because this is already handled for us inside the XML::Simple Perl module. The XML::Simple Perl module exports two functions:
XMLin()
and
XMLout()
. As you may have guessed from the
The XMLin() function parses XML data and returns a reference to a data structure that contains the information stored in the XML document.This function can accept an input filename, an undefined filename, or just a string containing XML data. In our example, we have provided the name of an input XML document named ch4_simple_config.xml. If undef is provided as the filename, the XML::Simple module will look for an XML document in the current directory with the same name as the program. For example, if your parsing program was named foo.pl and you didn't provide an input filename or string containing XML data, then the XML::Simple module would look for a file named foo.xml and try to parse it. XML::Simple will also accept a scalar containing an XML string as an input.This is helpful if XML data is being dynamically generated, saving you the trouble of writing the XML data to a file and then opening the file. In this example, $root is a reference to a data structure that contains all the information stored in the XML configuration file. Now that the XML document has been parsed, the information is stored for us in this data structure, and we just need to go and retrieve it.
1. use strict;
2. use XML::Simple;
3.
4. # Parse the input XML document
5. my $root = XMLin("./ch4_simple_config.xml", forcearray=>1);
Retrieving the Parsed Information7 “26 How do we retrieve the parsed information that is stored in the data structure? I'll show you in this section of the example. Because this is a relatively short and flat XML document (that is, not a lot of nested elements), we're going to explicitly extract all the information. If this XML configuration were longer or contained multiple elements (for example, multiple <db_user> elements), we would need to loop through each occurrence of the same element. Note that the DTD or the XML schema would need to be changed to support multiple <db_user> elements. In our program, the scalar $root is a reference to a data structure similar to the one shown in Figure 4.1. As you can see, we need to walk up and down the tree and retrieve our data. It may seem a bit confusing at first, but after you're comfortable with Perl references and nested data structures, it will seem like second nature.
7. # Print config info.
8. print "Configuration Information\n";
9. print "User: $root->{db_user}->[0]->{username}->[0]\n";
10. print "Password: $root->{db_user}->[0]->{password}->[0]\n\n";
11.
12. # Print info about the first server (note indexes = 0).
13. print "Server #1\n";
14. print "Hostname: $root->{db_server}->[0]->{hostname}\n";
15. print "Hostname: $root->{db_server}->[0]->{os}\n";
16. print "IP Address: $root->{db_server}->[0]->{server_ip}->[0]\n";
17. print "DB Driver: $root->{db_server}->[0]->{db_driver}->[0]\n";
18. print "Port: $root->{db_server}->[0]->{port}->[0]\n\n";
19.
20. # Print info about the second server (note indexes = 1).
21. print "Server #2\n";
22. print "Hostname: $root->{db_server}->[1]->{hostname}\n";
23. print "Hostname: $root->{db_server}->[1]->{os}\n";
24. print "IP Address: $root->{db_server}->[1]->{server_ip}->[0]\n";
25. print "DB Driver: $root->{db_server}->[1]->{db_driver}->[0]\n";
26. print "Port: $root->{db_server}->[1]->{port}->[0]\n";
Note For additional information about Perl references and nested data structures, see the perldoc perlref page. The index in each reference indicates a particular occurrence of an element. For example, the statement
$root->{db_user}->[0]->{username}->[0]
refers to the following portion of the input XML document: <db_user> <username>mark</username> This line refers to the first (and only in this case) <username> element that is a child of the <db_user> element.There is one subtle item to notice about the statement that refers to one of the element attributes
$root->{db_server}->[0]->{hostname}
that refers to the following portion of the XML document: <db_server hostname="rocket" os="Linux"> Did you notice how we don't need an index after the hostname element? Because hostname is an attribute of the db_server element, it can only appear once. If you try to use an index, you will create an error because Perl will incorrectly try to interpret the reference as a reference to an array. Therefore, you don't need the index to extract the attribute value. XML::Simple Perl Program OutputListing 4.5 shows the output that is generated by the XML::Simple Perl program. As desired, we've extracted the contents of the XML document and printed the information that was contained in each element. In a real application, instead of printing out the contents of the input XML document, you could create a hash to store the key-value pairs for further processing. For example, $hashname{username} would contain " mark ", and $hashname{password} would contain " mark's password ". Listing 4.5 Output generated by the XML::Simple-based program. (Filename: ch4_simple_output.txt)Configuration Information User: mark Password: mark's password Server #1 Hostname: rocket IP Address: 192.168.1.10 DB Driver: MySQL Port: 3306 Server #2 Hostname: scooter IP Address: 192.168.1.50 DB Driver: ODBC Port: 3379
The XML::Simple module is ideal to use in situations such as this when you need to parse small- to
Note
For additional XML::Simple module information (including all the possible options),
XML::Twig Perl ModuleThe XML::Twig Perl module by Michel Rodriguez is similar to the XML::Simple Perl module in that it is built on top of the XML::Parser module. However, that is where the similarities end. XML::Twig is a tree-based parser, but it gives you more flexibility compared to other tree-based parsers by providing two important characteristics found in other parsers. First, the XML::Twig Perl module is similar to a SAX-based module because it can be configured to have a small memory requirement.This is important when working with large XML documents, especially if you're only interested in a small subset of the elements. Second, the XML::Twig Perl module provides an API that it similar to XPath in retrieving elements and attributes. This is important because it isn't a module proprietary scheme of accessing elements and attributes. When it comes to parsing XML data, the XML::Twig Perl module supports three modes of operation.
Because I just demonstrated how the XML::Simple module parses small, simple, documents, the next example focuses on how to use the XML::Twig Perl module to parse a subset of a larger, more complex XML document. XML::Twig Example
One of the most useful capabilities of the XML::Twig module is the capability to parse only a portion of an XML document. When would you use this capability? Let's say for this example that you work for a large online retailer and that your catalog is stored in XML.Your task is to generate a report for a particular portion of the catalog. For a large catalog, it isn't
First, before we get into the program that
Listing 4.6 DTD that describes the online retailer catalog. (Filename : ch4_twig_retail.dtd)<?xml version="1.0" encoding="UTF-8"?> <!ELEMENT catalog (department*)> <!ELEMENT department (category*)> <!ATTLIST department name CDATA #REQUIRED> <!ELEMENT category (product*)> <!ATTLIST category name CDATA #REQUIRED> <!ELEMENT product (name, sku, description, price)> <!ELEMENT name (#PCDATA)> <!ELEMENT sku (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT price (#PCDATA)> Listing 4.7 shows the XML schema that describes the format of our XML catalog. Listing 4.7 XML schema that describes the online retailer catalog. (Filename: ch4_twig_retail.xsd)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<xs:element name="catalog">
<xs:complexType>
<xs:sequence>
<xs:element ref="department" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="category">
<xs:complexType>
<xs:sequence>
<xs:element ref="product" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
<xs:element name="department">
<xs:complexType>
<xs:sequence>
<xs:element ref="category" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
<xs:element name="description" type="xs:string"/>
<xs:element name="name" type="xs:string"/>
<xs:element name="price" type="xs:string"/>
<xs:element name="product ">
<xs:complexType>
<xs:sequence>
<xs:element ref="name"/>
<xs:element ref="sku"/>
<xs:element ref="description"/>
<xs:element ref="price"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="sku" type="xs:string"/>
</xs:schema>
Now that you understand the structure of the XML document based on the DTD and XML schema, let's take a look at the sample XML document shown Listing 4.8. As you can see, we have two departments ( electronics and print ) represented in the XML document, and each of the departments has at least one category represented. Our task for this example is to generate a report that shows all products, regardless of department or category. As mentioned earlier, this may seem like a big task, especially if the file is several hundred megabytes (or even several gigabytes) in size. However, you'll soon see that this is a simple task with the XML::Twig Perl module. Listing 4.8 Online retailer catalog in XML. (Filename: ch4_twig_retail.xml)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE catalog SYSTEM "xml-twig.dtd">
<catalog>
<department name="electronics">
<category name="personal computers">
<product>
<name>Computer hard drive</name>
<sku>1112819</sku>
<description>120 GB hard drive</description>
<price>0.00</price>
</product>
<product>
<name>Rewritable CDs</name>
<sku>11291119</sku>
<description>Box of 10</description>
<price>.00</price>
</product>
</category>
<category name="software">
<product>
<name>Microsoft Windows</name>
<sku>11299938</sku>
<description>Operating system</description>
<price>.99</price>
</product>
</category>
</department>
<department name="print">
<category name="books">
<product>
<name>XML and Perl</name>
<sku>9987298</sku>
<description>The XML and Perl book.</description>
<price>.99</price>
</product>
</category>
</department>
</catalog>
There are several ways to solve this problem, but one of the
Listing 4.9 XML::Twig-based program to parse a portion of an XML document. (Filename: ch4_twig_app.pl)
1. use strict;
2. use XML::Twig;
3.
4. # Instantiate the parser object and set the
5. # subroutine to be called.
6. my $twig = new XML::Twig(twig_handlers =>{product =>\&print_products});
7. $twig->parsefile( "ch4_twig_retail.xml");
8.
9. # Handler subroutine
10. sub print_products {
11. my($t, $elt)= @_;
12.
13. # Retrieve the element contents here.
14. my $name = $elt->first_child('name')->text;
15. my $sku = $elt->first_child('sku')->text;
16. my $description = $elt->first_child('description')->text;
17. my $price = $elt->first_child('price')->text;
18.
19. # Print the results here.
20. print "------------------------------------\n";
21. print "Name = $name\n";
22. print "SKU = $sku\n";
23. print "Description = $description\n";
24. print "Price = $price\n\n";
25.
26. # Free the memory
27. $t->purge;
28. }
Standard Pragmas1 “2 As you can see, this is a very short example which illustrates the power of the XML::Twig Perl module. The first section of the program includes the standard pragma statement that we see at the top of our programs ( use strict ). Also, we include the use XML::Twig pragma to load the XML::Twig module. 1. use strict; 2. use XML::Twig; Parsing and Extracting Information4 “28 This section of the program does all the work for us. First, we call new and create a new XML::Twig object.The twig_handlers argument to the constructor is a hash that consists of key=>value pairs. In this case, the key is the element name that we want to parse, and the value is a reference to a subroutine that acts like an event handler in a SAX-based parser (that is, it is called whenever we encounter one of the specified elements). So, we've created a subroutine named print_products that is called whenever we encounter a <product> element. After setting up the initial handler, we call the parsefile method to parse the XML document.
As the document is parsed, the subroutine
print_products
is called each time XML::Twig encounters a
<product>
element.The subroutine receives the twig object and the element as arguments. As you can see, we can easily extract the element text by using the
first_child
method. Don't forget to call the
purge
method to free up any memory associated with the element that was just parsed;
4. # Instantiate the parser object and set the
5. # subroutine to be called.
6. my $twig = new XML::Twig(twig_handlers => {product =>\&print_products});
7. $twig->parsefile("ch4_twig_retail.xml");
8.
9. # Handler subroutine
10. sub print_products {
11. my($t, $elt)= @_;
12.
13. # Retrieve the element contents here.
14. my $name = $elt->first_child('name')->text;
15. my $sku = $elt->first_child('sku')->text;
16. my $description = $elt->first_child('description')->text;
17. my $price = $elt->first_child('price')->text;
18.
19. # Print the results here.
20. print "------------------------------------\n";
21. print "Name = $name\n";
22. print "SKU = $sku\n";
23. print "Description = $description\n";
24. print "Price = $price\n\n";
25.
26. # Free the memory
27. $t->purge;
28. }
The output of the XML::Twig-based application is shown in Listing 4.10. As you can see, I've retrieved all the <product> elements. Listing 4.10 Output report from the XML::Twig application. (Filename: ch4_twig_report.txt)------------------------------------ Name = Computer hard drive SKU = 1112819 Description = 120 GB hard drive Price = 0.00 ------------------------------------ Name = Rewritable CDs SKU = 11291119 Description = Box of 10 Price = .00 ------------------------------------ Name = Microsoft Windows SKU = 11299938 Description = Operating system Price = .99 ------------------------------------ Name = XML and Perl SKU = 9987298 Description = The XML and Perl book. Price = .99
Note For additional XML::Twig Perl module information (including all the possible options), please take a look at the available online documentation by using perldoc XML::Twig or see http://www.xmltwig.com. The XML::Twig Perl module happens to be one of the best documented modules (that is, the perldoc pages contain lots of examples). |