Tree-based XML parsing is one of two most popular ways of parsing XML, along with stream-based parsing, which was discussed in Chapter 3.The Document Object Model (DOM) standard defines a standard way to parse and represent the XML data in a tree-based parser. Tree-based parsers are also called in-memory parsers , because unlike stream-based parsers, which iterate over each piece of data and keep very minimal state information, tree-based parsers iterate over the XML document and create an in-memory representation of it for later use.You can then use Perl's standard facilities to access the information in the data structure. The next few sections discuss non-DOM-compliant tree-based XML parsers. XML::Simple Perl ModuleBack in Chapter 2, "Now Let's Get Our Hands Dirty," you saw a small example using the XML::Simple Perl module to parse a document. In this section, we'll discuss another example using the XML::Simple Perl module and explain the XML::Simple API and some of the more useful options. The XML::Simple Perl module was written by Grant McClean. As an XML parser module, it is (as you may have guessed from the name ) very simple to use. XML::Simple provides an easy-to-use API and is built on the XML::Parser module discussed in Chapter 3. It is ideal to use for small XML documents, such as configuration files, and in any other situation when you don't need any additional features, just a simple, easy-to-use XML parser. Let's take a look at an example of how to use the XML::Simple Perl module. XML::Simple Perl Module ExampleSome of our work involves working with databases, usually through the Perl Database Interface (DBI).The Perl DBI provides a consistent, database-independent interface to an application, regardless of the database that is being used.This means that a database application using the Perl DBI can use a consistent interface for a number of databases.This is a powerful capability that enables you to develop a very flexible application.The same application (with literally a one line change) can access an Oracle database, a Microsoft SQL Server database, or the open -source MySQL database. Using the Perl DBI and an XML parser or generator module provides a strong foundation for a number of applications. Note We will be discussing the Perl DBI in Chapter 6, "Generating XML Documents from Databases." If you can't wait until Chapter 6 for this discussion of the Perl DBI, take a look at the official Perl DBI page http://dbi.perl.org. The reason that I mentioned the Perl DBI is that for it to operate properly, you need to pass in some initial configuration information. I usually store this information in an external XML document, and the XML::Simple Perl module is ideally suited for these types of situations. For example, to open a connection to a database using the Perl DBI, you need the following information:
It's usually a good idea to store this type of configuration information outside your application. Storing this configuration information external to the application eliminates the need for a user to look through source code and change hard-coded values. Now that you know what information we need in the configuration file, let's take a look at a DTD configuration file shown in Listing 4.1. Listing 4.1 Configuration file DTD. (Filename: ch4_simple_config.dtd)<?xml version="1.0 " encoding=="UTF-8 "?> <!ELEMENT db_config_information (db_user,db_server+)> <!ELEMENT db_user (username,password)> <!ELEMENT username (#PCDATA)> <!ELEMENT password (#PCDATA)> <!ELEMENT db_server (server_ip,db_driver,port)> <!ATTLIST db_server hostname CDATA #IMPLIED os CDATA #IMPLIED> <!ELEMENT server_ip (#PCDATA)> <!ELEMENT db_driver (#PCDATA)> <!ELEMENT port (#PCDATA)> As you can see, this is a very simple DTD that contains configuration informa-tion. Note that the DTD supports multiple database servers for each user. If this were a longer configuration file (for example, a few hundred lines), using XML instead for the configuration file format would clearly be an advantage. By storing the configuration information in XML, you can easily verify that the configuration file is well- formed and valid. This assures you that all the required information is present and reduces the amount of error checking that is required by the application. Listing 4.2 shows the XML schema for the configuration file. Listing 4.2 XML schema that describes the application configuration file. (Filename: ch4_simple_config.xsd)<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:element name="db_config_information"> <xs:complexType> <xs:sequence> <xs:element ref="db_user"/> <xs:element ref="db_server" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="db_driver" type="xs:string"/> <xs:element name="db_server"> <xs:complexType> <xs:sequence> <xs:element ref="server_ip"/> <xs:element ref="db_driver"/> <xs:element ref="port"/> </xs:sequence> <xs:attribute name="hostname" type="xs:string"/> <xs:attribute name="os" type="xs:string"/> </xs:complexType> </xs:element> <xs:element name="db_user"> <xs:complexType> <xs:sequence> <xs:element ref="username"/> <xs:element ref="password"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="password" type="xs:string"/> <xs:element name="port" type="xs:positiveInteger"/> <xs:element name="server_ip" type="xs:string"/> <xs:element name="username" type="xs:string"/> </xs:schema> We've shown the DTD and the XML schema, now let's take a look at the XML configuration file that is shown in Listing 4.3. It contains user and database server information required to establish Perl DBI connections. As you can see, this configuration file contains information for the username " mark ", and shows my connection information for two servers named " rocket " and " scooter ". Associated with each server is an IP address ( <server_ip> ), a driver name ( <db_driver> ), and a port number ( <port> ). Listing 4.3 XML configuration file. (Filename: ch4_simple_config.xml)<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE db_config_information SYSTEM "ch4_simple_config.dtd"> <db_config_information> <db_user> <username>mark</username> <password>mark's password</password> </db_user> <db_server hostname="rocket" os="Linux"> <server_ip>192.168.1.10</server_ip> <db_driver>MySQL</db_driver> <port>3306</port> </db_server> <db_server hostname="scooter" os="Microsoft Windows"> <server_ip>192.168.1.50</server_ip> <db_driver>ODBC</db_driver> <port>3379</port> </db_server> </db_config_information> Tree-based parsers are easier to understand when you can see how the data is stored after it has been parsed.This makes it much easier to access the particular elements that you are interested in retrieving. Figure 4.2 shows a graphical representation of the XML configuration file as it is stored in memory after parsing by the XML::Simple Perl module. As you should expect, we have one root element ( <config_information> ) that has three child elements ( <db_user> , <db_server[0]> , and <db_server[1]> ). Note the index next to each <db_server> element. This is used to indicate multiple occurrences of elements with the same name. Figure 4.2. Tree representation of the XML configuration file.
Now let's take a look at the Perl program in Listing 4.4.This program was built with the XML::Simple module and parses the XML configuration file that was shown earlier in Listing 4.3. Listing 4.4 Configuration file parser built with the XML::Simple Perl module. (Filename: ch4_simple_app.pl)1. use strict; 2. use XML::Simple; 3. 4. # Parse the input XML document 5. my $root = XMLin("./ch4_simple_config.xml", forcearray=>1); 6. 7. # Print config info. 8. print "Configuration Information\n"; 9. print "User: $root->{db_user}->[0]->{username}->[0]\n"; 10. print "Password: $root->{db_user}->[0]->{password}->[0]\n\n"; 11. 12. # Print info about the first server (note indexes =0). 13. print "Server #1\n"; 14. print "Hostname: $root->{db_server}->[0]->{hostname}\n"; 15. print "Hostname: $root->{db_server}->[0]->{os}\n"; 16. print "IP Address: $root->{db_server}->[0]->{server_ip}->[0]\n"; 17. print "DB Driver: $root->{db_server}->[0]->{db_driver}->[0]\n"; 18. print "Port: $root->{db_server}->[0]->{port}->[0]\n\n"; 19. 20. # Print info about the second server (note indexes = 1). 21. print "Server #2\n"; 22. print "Hostname: $root->{db_server}->[1]->{hostname}\n"; 23. print "Hostname: $root->{db_server}->[1]->{os}\n"; 24. print "IP Address: $root->{db_server}->[1]->{server_ip}->[0]\n"; 25. print "DB Driver: $root->{db_server}->[1]->{db_driver}->[0]\n"; 26. print "Port: $root->{db_server}->[1]->{port}->[0]\n"; Initialization and Parsing1 “5 The first section of this program contains the usual use pragma statement ( use strict ) that we include in all our programs. Because we're using the XML::Simple Perl module, we need to load the XML::Simple Perl module with the use XML::Simple pragma. As mentioned earlier, XML::Simple is built on top of the XML::Parser Perl module that we discussed in Chapter 3.We don't need a use XML::Parser pragma because this is already handled for us inside the XML::Simple Perl module. The XML::Simple Perl module exports two functions: XMLin() and XMLout() . As you may have guessed from the names , the XMLin() function is used to parse XML data while the XMLout() function is used to generate XML data. Because we're concerned with parsing, we'll be using the XMLin() function. The XMLin() function parses XML data and returns a reference to a data structure that contains the information stored in the XML document.This function can accept an input filename, an undefined filename, or just a string containing XML data. In our example, we have provided the name of an input XML document named ch4_simple_config.xml. If undef is provided as the filename, the XML::Simple module will look for an XML document in the current directory with the same name as the program. For example, if your parsing program was named foo.pl and you didn't provide an input filename or string containing XML data, then the XML::Simple module would look for a file named foo.xml and try to parse it. XML::Simple will also accept a scalar containing an XML string as an input.This is helpful if XML data is being dynamically generated, saving you the trouble of writing the XML data to a file and then opening the file. In this example, $root is a reference to a data structure that contains all the information stored in the XML configuration file. Now that the XML document has been parsed, the information is stored for us in this data structure, and we just need to go and retrieve it. 1. use strict; 2. use XML::Simple; 3. 4. # Parse the input XML document 5. my $root = XMLin("./ch4_simple_config.xml", forcearray=>1); Retrieving the Parsed Information7 “26 How do we retrieve the parsed information that is stored in the data structure? I'll show you in this section of the example. Because this is a relatively short and flat XML document (that is, not a lot of nested elements), we're going to explicitly extract all the information. If this XML configuration were longer or contained multiple elements (for example, multiple <db_user> elements), we would need to loop through each occurrence of the same element. Note that the DTD or the XML schema would need to be changed to support multiple <db_user> elements. In our program, the scalar $root is a reference to a data structure similar to the one shown in Figure 4.1. As you can see, we need to walk up and down the tree and retrieve our data. It may seem a bit confusing at first, but after you're comfortable with Perl references and nested data structures, it will seem like second nature. 7. # Print config info. 8. print "Configuration Information\n"; 9. print "User: $root->{db_user}->[0]->{username}->[0]\n"; 10. print "Password: $root->{db_user}->[0]->{password}->[0]\n\n"; 11. 12. # Print info about the first server (note indexes = 0). 13. print "Server #1\n"; 14. print "Hostname: $root->{db_server}->[0]->{hostname}\n"; 15. print "Hostname: $root->{db_server}->[0]->{os}\n"; 16. print "IP Address: $root->{db_server}->[0]->{server_ip}->[0]\n"; 17. print "DB Driver: $root->{db_server}->[0]->{db_driver}->[0]\n"; 18. print "Port: $root->{db_server}->[0]->{port}->[0]\n\n"; 19. 20. # Print info about the second server (note indexes = 1). 21. print "Server #2\n"; 22. print "Hostname: $root->{db_server}->[1]->{hostname}\n"; 23. print "Hostname: $root->{db_server}->[1]->{os}\n"; 24. print "IP Address: $root->{db_server}->[1]->{server_ip}->[0]\n"; 25. print "DB Driver: $root->{db_server}->[1]->{db_driver}->[0]\n"; 26. print "Port: $root->{db_server}->[1]->{port}->[0]\n"; Note For additional information about Perl references and nested data structures, see the perldoc perlref page. The index in each reference indicates a particular occurrence of an element. For example, the statement $root->{db_user}->[0]->{username}->[0] refers to the following portion of the input XML document: <db_user> <username>mark</username> This line refers to the first (and only in this case) <username> element that is a child of the <db_user> element.There is one subtle item to notice about the statement that refers to one of the element attributes $root->{db_server}->[0]->{hostname} that refers to the following portion of the XML document: <db_server hostname="rocket" os="Linux"> Did you notice how we don't need an index after the hostname element? Because hostname is an attribute of the db_server element, it can only appear once. If you try to use an index, you will create an error because Perl will incorrectly try to interpret the reference as a reference to an array. Therefore, you don't need the index to extract the attribute value. XML::Simple Perl Program OutputListing 4.5 shows the output that is generated by the XML::Simple Perl program. As desired, we've extracted the contents of the XML document and printed the information that was contained in each element. In a real application, instead of printing out the contents of the input XML document, you could create a hash to store the key-value pairs for further processing. For example, $hashname{username} would contain " mark ", and $hashname{password} would contain " mark's password ". Listing 4.5 Output generated by the XML::Simple-based program. (Filename: ch4_simple_output.txt)Configuration Information User: mark Password: mark's password Server #1 Hostname: rocket IP Address: 192.168.1.10 DB Driver: MySQL Port: 3306 Server #2 Hostname: scooter IP Address: 192.168.1.50 DB Driver: ODBC Port: 3379 The XML::Simple module is ideal to use in situations such as this when you need to parse small- to medium- sized XML documents. As I have demonstrated, the API for the XML::Simple module is very easy to use for tasks similar to our example. However, cases when you should choose other modules may arise, especially when you need to deal with larger XML documents. We'll discuss one of those cases in the next section. Note For additional XML::Simple module information (including all the possible options), please take a look at the available online documentation by using perldoc XML::Simple. XML::Twig Perl ModuleThe XML::Twig Perl module by Michel Rodriguez is similar to the XML::Simple Perl module in that it is built on top of the XML::Parser module. However, that is where the similarities end. XML::Twig is a tree-based parser, but it gives you more flexibility compared to other tree-based parsers by providing two important characteristics found in other parsers. First, the XML::Twig Perl module is similar to a SAX-based module because it can be configured to have a small memory requirement.This is important when working with large XML documents, especially if you're only interested in a small subset of the elements. Second, the XML::Twig Perl module provides an API that it similar to XPath in retrieving elements and attributes. This is important because it isn't a module proprietary scheme of accessing elements and attributes. When it comes to parsing XML data, the XML::Twig Perl module supports three modes of operation.
Because I just demonstrated how the XML::Simple module parses small, simple, documents, the next example focuses on how to use the XML::Twig Perl module to parse a subset of a larger, more complex XML document. XML::Twig ExampleOne of the most useful capabilities of the XML::Twig module is the capability to parse only a portion of an XML document. When would you use this capability? Let's say for this example that you work for a large online retailer and that your catalog is stored in XML.Your task is to generate a report for a particular portion of the catalog. For a large catalog, it isn't unrealistic that the XML document could be several hundred megabytes (MB) in size. While it is physically possible to use a standard tree-based parser on a document this size, it isn't recommended. Remember, an XML document in memory will be many times the size it occupies on disk. You could use a SAX-based parser; however, for some applications it may be more work than you need to do (or, in some situations, have time to do). One possible option is to use the XML::Twig module to process only a portion of an XML document. First, before we get into the program that performs the task, let's start with the format of the XML document. Remember, this is a catalog in XML format for a large online retailer. As you can imagine, in addition to being very large, an XML document that supports these requirements would be fairly complex.This document would have a large number of nested elements. Listing 4.6 shows the DTD that is being used by the online retailer. As you can see, the DTD is fairly simple; however, you can see that there are several levels of nesting, similar to what you might expect if you were visiting an online retailer. Listing 4.6 DTD that describes the online retailer catalog. (Filename : ch4_twig_retail.dtd)<?xml version="1.0" encoding="UTF-8"?> <!ELEMENT catalog (department*)> <!ELEMENT department (category*)> <!ATTLIST department name CDATA #REQUIRED> <!ELEMENT category (product*)> <!ATTLIST category name CDATA #REQUIRED> <!ELEMENT product (name, sku, description, price)> <!ELEMENT name (#PCDATA)> <!ELEMENT sku (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT price (#PCDATA)> Listing 4.7 shows the XML schema that describes the format of our XML catalog. Listing 4.7 XML schema that describes the online retailer catalog. (Filename: ch4_twig_retail.xsd)<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:element name="catalog"> <xs:complexType> <xs:sequence> <xs:element ref="department" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="category"> <xs:complexType> <xs:sequence> <xs:element ref="product" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="name" type="xs:string" use="required"/> </xs:complexType> </xs:element> <xs:element name="department"> <xs:complexType> <xs:sequence> <xs:element ref="category" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="name" type="xs:string" use="required"/> </xs:complexType> </xs:element> <xs:element name="description" type="xs:string"/> <xs:element name="name" type="xs:string"/> <xs:element name="price" type="xs:string"/> <xs:element name="product "> <xs:complexType> <xs:sequence> <xs:element ref="name"/> <xs:element ref="sku"/> <xs:element ref="description"/> <xs:element ref="price"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="sku" type="xs:string"/> </xs:schema> Now that you understand the structure of the XML document based on the DTD and XML schema, let's take a look at the sample XML document shown Listing 4.8. As you can see, we have two departments ( electronics and print ) represented in the XML document, and each of the departments has at least one category represented. Our task for this example is to generate a report that shows all products, regardless of department or category. As mentioned earlier, this may seem like a big task, especially if the file is several hundred megabytes (or even several gigabytes) in size. However, you'll soon see that this is a simple task with the XML::Twig Perl module. Listing 4.8 Online retailer catalog in XML. (Filename: ch4_twig_retail.xml)<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE catalog SYSTEM "xml-twig.dtd"> <catalog> <department name="electronics"> <category name="personal computers"> <product> <name>Computer hard drive</name> <sku>1112819</sku> <description>120 GB hard drive</description> <price>0.00</price> </product> <product> <name>Rewritable CDs</name> <sku>11291119</sku> <description>Box of 10</description> <price>.00</price> </product> </category> <category name="software"> <product> <name>Microsoft Windows</name> <sku>11299938</sku> <description>Operating system</description> <price>.99</price> </product> </category> </department> <department name="print"> <category name="books"> <product> <name>XML and Perl</name> <sku>9987298</sku> <description>The XML and Perl book.</description> <price>.99</price> </product> </category> </department> </catalog> There are several ways to solve this problem, but one of the easiest solutions would be to use XML::Twig to parse only a portion of the XML document. For this example, we're only interested in the product elements, and the program shown in Listing 4.9 does just that. Let's walk through the program and see how it works. Listing 4.9 XML::Twig-based program to parse a portion of an XML document. (Filename: ch4_twig_app.pl)1. use strict; 2. use XML::Twig; 3. 4. # Instantiate the parser object and set the 5. # subroutine to be called. 6. my $twig = new XML::Twig(twig_handlers =>{product =>\&print_products}); 7. $twig->parsefile( "ch4_twig_retail.xml"); 8. 9. # Handler subroutine 10. sub print_products { 11. my($t, $elt)= @_; 12. 13. # Retrieve the element contents here. 14. my $name = $elt->first_child('name')->text; 15. my $sku = $elt->first_child('sku')->text; 16. my $description = $elt->first_child('description')->text; 17. my $price = $elt->first_child('price')->text; 18. 19. # Print the results here. 20. print "------------------------------------\n"; 21. print "Name = $name\n"; 22. print "SKU = $sku\n"; 23. print "Description = $description\n"; 24. print "Price = $price\n\n"; 25. 26. # Free the memory 27. $t->purge; 28. } Standard Pragmas1 “2 As you can see, this is a very short example which illustrates the power of the XML::Twig Perl module. The first section of the program includes the standard pragma statement that we see at the top of our programs ( use strict ). Also, we include the use XML::Twig pragma to load the XML::Twig module. 1. use strict; 2. use XML::Twig; Parsing and Extracting Information4 “28 This section of the program does all the work for us. First, we call new and create a new XML::Twig object.The twig_handlers argument to the constructor is a hash that consists of key=>value pairs. In this case, the key is the element name that we want to parse, and the value is a reference to a subroutine that acts like an event handler in a SAX-based parser (that is, it is called whenever we encounter one of the specified elements). So, we've created a subroutine named print_products that is called whenever we encounter a <product> element. After setting up the initial handler, we call the parsefile method to parse the XML document. As the document is parsed, the subroutine print_products is called each time XML::Twig encounters a <product> element.The subroutine receives the twig object and the element as arguments. As you can see, we can easily extract the element text by using the first_child method. Don't forget to call the purge method to free up any memory associated with the element that was just parsed; otherwise , this would cause a memory leak. It may not be an issue when parsing a small XML document such as our example, but it would cause a problem if you were parsing a large XML document. 4. # Instantiate the parser object and set the 5. # subroutine to be called. 6. my $twig = new XML::Twig(twig_handlers => {product =>\&print_products}); 7. $twig->parsefile("ch4_twig_retail.xml"); 8. 9. # Handler subroutine 10. sub print_products { 11. my($t, $elt)= @_; 12. 13. # Retrieve the element contents here. 14. my $name = $elt->first_child('name')->text; 15. my $sku = $elt->first_child('sku')->text; 16. my $description = $elt->first_child('description')->text; 17. my $price = $elt->first_child('price')->text; 18. 19. # Print the results here. 20. print "------------------------------------\n"; 21. print "Name = $name\n"; 22. print "SKU = $sku\n"; 23. print "Description = $description\n"; 24. print "Price = $price\n\n"; 25. 26. # Free the memory 27. $t->purge; 28. } The output of the XML::Twig-based application is shown in Listing 4.10. As you can see, I've retrieved all the <product> elements. Listing 4.10 Output report from the XML::Twig application. (Filename: ch4_twig_report.txt)------------------------------------ Name = Computer hard drive SKU = 1112819 Description = 120 GB hard drive Price = 0.00 ------------------------------------ Name = Rewritable CDs SKU = 11291119 Description = Box of 10 Price = .00 ------------------------------------ Name = Microsoft Windows SKU = 11299938 Description = Operating system Price = .99 ------------------------------------ Name = XML and Perl SKU = 9987298 Description = The XML and Perl book. Price = .99
Note For additional XML::Twig Perl module information (including all the possible options), please take a look at the available online documentation by using perldoc XML::Twig or see http://www.xmltwig.com. The XML::Twig Perl module happens to be one of the best documented modules (that is, the perldoc pages contain lots of examples). |