Perl | Professional XML (Programmer to Programmer)

Perl is one of the oldest, and most powerful of the scripting languages. Originally created by Larry Wall in 1987, he wanted to name it Pearl, after the Parable of the Pearl. However, that name was already taken by another programming language, so he shortened it to Perl. Later, the name has been said to be an acronym (Practical Extraction and Report Language), however this name is a fairly contrived example of creating the acronym after the fact.

Perl is available for most, if not all, platforms and is currently at version 5.8. Its forte is text-processing, and it is typically used in scenarios where you need to search through large amounts of text to find the information you need, such as processing log files. This section focuses on the most basic XML parsers available in Perl, showing you techniques for reading and writing XML using the commonly available libraries for Perl. Note, however, that Perl supports a great many more libraries and methods of working with XML: see Programming Perl ISBN 0-596-00027-8 (aka "The Camel Book") for more details on Perl. The samples in this section were created using ActivePerl 5.8.8 (Build 819), but they should work with any Perl installation with few or no changes.

Reading and Writing XML

Perl supports three main types of parsers for working with XML: tree-based, object-based, and stream-based. Each is useful in different situations. Object-based parsers convert the XML into Perl objects, enabling you to work with XML without keeping track of angle brackets. Tree-based parsers enable you to work with the XML in memory, moving forward and backward as needed to process the XML. Finally, stream-based parsers move rapidly through the XML document, raising events that you can use in your code to process the XML. Generally, object-based parsers "feel" the most natural for those used to dealing with a programming language. Rather than deal with XML as a separate format, object-based parsers enable the developer to use the techniques they already know to work with the format. Tree-based parsers are generally based on the XML DOM, and thus are the easiest to port between languages. They create a common model in memory, of the XML as a tree with a single root, and branches reaching out to terminal leaf nodes. Finally, stream-based parsers are generally the fastest if you need forward-only access to the XML. In addition, they usually have the lowest memory requirements because they only store a portion of the document in memory at any time. These parsers are best if you need only a small part of the XML file, or if you will only need to process the file once, and in order.

Reading XML

The simplest library for processing XML with Perl is named, strangely enough, XML::Simple. This library was originally created for processing configuration files, but it can be used with many XML files. It is an object-based parser, converting the XML into Perl data structures, such as hashrefs and arrays. XML::Simple has two main methods: XMLin loads a block of XML and converts it into a mixture of arrays and associative arrays, whereas XMLout does the opposite. In Perl, arrays are zero-based lists of items, and associative arrays are collections of name-value pairs. Listing 17-1 shows how this library can be used to process the XML that is shown in Listing 17-2.

Listing 17-1: Using XML::Simple to read XML with Perl

      use XML::Simple;      my $file = 'customers.xml';       # default behaviour      print "Default behaviour\n";      my $doc = XMLin($file);      print XMLout($doc->{customer}->{ALFKI});      print "\n============================\n";      # Coerces structure into arrays (outputs as elements)      print "Output as elements\n";      my $doc = XMLin($file, ForceArray=>1);      print XMLout($doc->{customer}->{ALFKI});      print "\n============================\n";      # Does not use id as key, creates array of customers      print "Display 0th customer\n";      my $doc = XMLin($file, KeyAttr=>[]);      print XMLout($doc->{customer}->[0]);      print "\n============================\n";      # Return selected elements      print "Return selected elements\n";      my $doc = XMLin($file);      print $doc->{customer}->{AROUT}->{contact}->{phone}, "\n";

Listing 17-2: Sample XML used in reading samples

      <customers>        <customer >          <company>Alfreds Futterkiste</company>          <address>            <street>Obere Str. 57</street>            <city>Berlin</city>            <zip>12209</zip>            <country>Germany</country>          </address>          <contact>            <name>Maria Anders</name>            <title>Sales Representative</title>            <phone>030-0074321</phone>            <fax>030-0076545</fax>          </contact>        </customer>        <customer >          <company>Ana Trujillo Emparedados y helados</company>          <address>            <street>Avda. de la Constitución 2222</street>            <city>Mexico D.F.</city>            <zip>05021</zip>            <country>Mexico</country>        </address>        <contact>            <name>Ana Trujillo</name>            <title>Owner</title>             <phone>(5) 555-4729</phone>            <fax>(5) 555-3745</fax>          </contact>          </customer>          <customer >          <company>Antonio Moreno Taqueria</company>          <address>            <street>Mataderos  2312</street>            <city>Mexico D.F.</city>            <zip>05023</zip>            <country>Mexico</country>          </address>          <contact>            <name>Antonio Moreno</name>            <title>Owner</title>            <phone>(5) 555-3932</phone>          </contact>        </customer>        .        .        .      </customers>

In the previous code, the directive use XML::Simple; loads it into your script. It uses the XMLIn command to import the XML, and XMLout to print it to the system console. Each of the runs loads the same file, but using the various parameters to force the in-memory representation to change. Listing 17-3 shows the output of the code in Listing 17-1.

Installing Perl modules

To use XML::Simple in your Perl scripts, first ensure that you have it as part of your distribution. It is included with the ActivePerl distribution by default. If you do not have this library installed, you can install it from Comprehensive Perl Archive Network (CPAN). CPAN is a Web site (cpan.org) that provides a common location for finding and downloading Perl libraries. As of this writing, there are almost 11,000 modules available. These range from modules for specific operating systems, image and text processing, and, of course, XML handling. The XML::Simple page on CPAN is at http://www.search.cpan.org/~grantm/XML-Simple-2.16/lib/XML/Simple.pm.

Perl interpreters generally have the capability of automatically downloading and compiling modules from CPAN. This means that you generally do not need to navigate manually through the CPAN site, find the module you need, download it, and compile. If you do not have the module installed, or if there is a more recent version of the module available, the code will be downloaded and Perl will attempt to compile the module. This compilation generally means you need a make program (such as nmake.exe or dmake.exe) available on your computer, and on the system path. Once downloaded and compiled, the module will be available to your applications.

Listing 17-3: Reading XML with XML::Simple

      Default behaviour      <opt company="Alfreds Futterkiste">        <address city="Berlin" country="Germany" street="Obere Str. 57" zip="12209" />        <contact name="Maria Anders" fax="030-0076545" phone="030-0074321"          title="Sales Representative" />      </opt>      ============================      Output as elements      <opt>        <address>          <city>Berlin</city>          <country>Germany</country>          <street>Obere Str. 57</street>          <zip>12209</zip>        </address>        <company>Alfreds Futterkiste</company>        <contact>          <name>Maria Anders</name>          <fax>030-0076545</fax>          <phone>030-0074321</phone>          <title>Sales Representative</title>        </contact>      </opt>      ============================      Display 0th customer      <opt  company="Alfreds Futterkiste">        <address city="Berlin" country="Germany" street="Obere Str. 57" zip="12209" />        <contact name="Maria Anders" fax="030-0076545" phone="030-0074321"           title="Sales Representative" />      </opt>      ============================      Return selected elements      (171) 555-7788

By default, XML::Simple converts the document into a hashref (associative array). Therefore, the first customer appears in memory as shown in Listing 17-4. Each of the elements in the original XML file is now represented as a name-value pair.

Listing 17-4: Structure of the customer in memory

      $VAR1 = {                'address' => {                             'country' => 'Germany',                             'zip' => '12209',                             'city' => 'Berlin',                              'street' => 'Obere Str. 57'                           },                'contact' => {                             'fax' => '030-0076545',                             'name' => 'Maria Anders',                             'title' => 'Sales Representative',                             'phone' => '030-0074321'                           },                'company' => 'Alfreds Futterkiste'      };

You have a number of options for adjusting the resulting structure. For example, the ForceArray parameter of XMLin converts each element into an array. Listing 17-5 shows the resulting in-memory structure.

Listing 17-5: Structure of the customer in memory with ForceArray

      $VAR1 = {                'address' => [                             {                              'country'=> ['Germany'],                              'zip' => ['12209'],                              'city' => ['Berlin'],                              'street' => ['Obere Str. 57']                             }                          ],                'contact' => [                            {                              'fax' => ['030-0076545'],                              'name' => ['Maria Anders'],                              'title' => ['Sales Representative'],                              'phone' => ['030-0074321']                            }                          ],                'company' => ['Alfreds Futterkiste']              };

In addition to the XML::Simple module, Perl supports a number of other XML processing modules. Stream-based parsing is available from the XML::Parser module. In fact, this module forms the basis of many of the other XML parsers for Perl, including XML::Simple. When using XML::Parser in streaming mode, you supply up to three handlers; these are called for the start, end, and contents of each tag. Listing 17-6 shows a script to count the occurrences of cities in the customer file.

Listing 17-6: Counting cities with stream-based parsing

      use XML::Parser;      my $file = 'customers.xml';      my $parser = new XML::Parser();       my %cities;      my $flag = 0;      sub start_handler {        my $p = shift;        my $elem = shift;        if ($elem =~ /city/) {          $flag = 1;        }      }      sub end_handler {        my $p = shift;        my $elem = shift;        if ($elem =~ /customers/) {          foreach $city (keys %{$cities}) {            print $city, ": ", %{$cities}->{$city}, "\n";          }        }      }      sub char_handler {        if($flag) {          my ($p, $data) = @_;          $cities->{$data}++;          $flag = 0;        }      }      $parser->setHandlers(Start => \&start_handler,                           End   => \&end_handler,                           Char  => \&char_handler);      $parser->parsefile($file);

Three handlers are defined and assigned to the parser. Of the three, only char_handler may need some explanation; it is called for each text element in the XML.

The code creates a hash table. The key for each of the elements in the hash table will be the city names, while the value will be the count of that city. As the XML needs to be read only once, using a streaming parser such as XML::Parser means that the code should run faster than it might with another form of parser, as the parser itself does not need to create any additional memory structures. In start_handler, which is called at the beginning of each element, the code determines if it is in the city element, setting a flag if so. If not, it continues. Similar code could handle multiple elements. If the flag is set, the char_handler routine increments the count for that city in the hash and turns off the flag. Finally, when the end of document is reached in end_handler, the count of each city is dumped to the output. Listing 17-7 shows a portion of the output of this script.

Listing 17-7: Output of the Pe rl stream-based parsing

            Reims: 1      Barquisimeto: 1      Mexico D.F.: 5      Strasbourg: 1      Graz: 1      Lille: 1      Leipzig: 1      Charleroi: 1      Bruxelles: 1      Resende: 1      San Francisco: 1      Eugene: 1      Warszawa: 1      Elgin: 1

Writing XML

Writing XML with Perl and XML::Simple is a matter of building up the correct structure in memory and using XMLout to write the resulting XML structure. The items in an array are converted into elements, whereas the items in a hashref are converted into attributes. The code in Listing 17-8 shows how to create a simple in-memory structure. Note that the formatting is for clarity; the definition of the structure could fit on one line.

Listing 17-8: Writing XML with XML::Simple

      use XML::Simple;      my $cfg = {'version' => '1.0',                 'section' => {                 'name' => 'Section 1',                 'setting' => [                               {                                 'name' => 'Setting#1',                                 'value' => 'Value#1'                              },                              {                                 'name' => 'Setting#2',                                 'value' => 'Value#2'                              },                              {                                 'name' => 'Setting#3',                                 'value' => 'Value#3'                              }                            ]                  }                  };      # write out Perl variable      print XMLout($cfg, RootName=>'configuration', XMLDecl=>1);

The XMLout command takes the memory structure created and writes the XML version to the console. As there was no root node defined in the $cfg variable, only the value for the version, this is added during the call to XMLout. In addition, the standard XML declaration is included by including the XMLDecl parameter. If you were creating this XML to be part of a larger structure, you would likely avoid this step. Listing 17-9 shows the output of this script.

Listing 17-9: Output of writing XML with XML::Simple

      <?xml version='1.0' standalone='yes' ?>      <configuration version="1.0">        <section name="Section 1">          <setting name="Setting#1" value="Value#1" />          <setting name="Setting#2" value="Value#2" />          <setting name="Setting#3" value="Value#3" />        </section>      </configuration>

Support for Other XML Formats

Beyond XML::Simple and XML::Parser, many other modules exist for working with XML and Perl. CPAN (see the Resources section later in this chapter) lists over 3700 current modules; they include everything from simple processing, through specific XML formats such as Atom or DocBook, to XSLT and XSL:FO processors. Some of the most notable modules include:

q XML::Parser::PerlSAX-A stream-based parser with full SAX support.
q XML::Twig-A tree-based parser, optimized for working with extremely large documents. Documents can be loaded entirely in memory or chunked to conserve memory.
q XML::DOM-A tree-based parser with W3C DOM support. Good for porting DOM code, but rather non-Perl.