Section 2.3. Other Parsing Techniques | Advanced Perl Programming

2.3. Other Parsing Techniques

Of course, we don't want to always be writing our own parsers for most of the data we come across, as there's a good chance someone else has come across that sort of data before. The best examples are HTML and XML: there's a vast amount of code out there that deals with these file formats, and most of the hard work has been put into CPAN modules. We'll look at a few of these modules in this section.

2.3.1. HTML::Parser

I'll start by saying something that is anathema to a lot of advanced Perl programmers: in certain circumstances, it is acceptable to use regular expressions to extract the data you want from HTML. I've written a bunch of screen-scraping programs to automate access to various web sites and applications, and because I knew the pages were machine-generated and unlikely to change, I had no qualms about using regular expressions to get what I wanted.

In general, though, you should do things properly. The way to parse HTML properly is to use the HTML::Parser module.

HTML::Parser is incredibly flexible. It supports several methods of operation: you can use OO inheritance, you can use callbacks, you can determine what data gets sent to callbacks and when the callbacks are called, and so on. We'll only look here at the simplest way of using it: by subclassing the module.

Let's begin by examining a way to dump out the URL and link text for every hyperlink in a document. Because we're inheriting from HTML::Parser, we need to say something like this:

     package DumpLinks;     use strict;     use base 'HTML::Parser';

Next, we specify what happens when we see a start tag: if it's not an <a> tag, then we ignore it. If it is, we make a note of its HRef attribute and remember that we're currently in an <a> tag.

     sub start {        my ($self, $tag, $attr) = @_;        return unless $tag eq "a";        $self->{_this_url} = $attr->{href};        $self->{_in_link} = 1;     }

Notice that our method is called with the literal name of the current tag, plus a hash of the attributes given in the tag. It's actually called with a few more parameters, but these two are by far the most important; take a look at the HTML::Parser documentation for the rest.

Now let's add a text handler: this is called for any ordinary text that isn't a tag. This needs to store away any text it finds while we're inside a link and do nothing otherwise.

     sub text {         my ($self, $text) = @_;         return unless $self->{_in_link};         $self->{_urls}->{$self->{_this_url}} .= $text;     }

Note that we have to use concatenation so that the following comes out correctly:

      <a href="http://www.perl.com/">The <code>Perl</code> home page</a>

The text handler will be called three times for this chunk: once for The, once for Perl, and once for home page. We want all three of these pieces of text, so we concatenate them together.

Finally, we need an end tag handler to take us out of _in_link mode, like so:

     sub end {         my ($self, $tag) = @_;         $self->{_in_link} = 0 if $tag eq "a";     }

Let's look at our complete parser package again before we use it:

     package DumpLinks;     use strict;     use base 'HTML::Parser';     sub start {        my ($self, $tag, $attr) = @_;        return unless $tag eq "a";        $self->{_this_url} = $attr->{href};        $self->{_in_link} = 1;     }     sub text {         my ($self, $text) = @_;         return unless $self->{_in_link};         $self->{_urls}->{$self->{_this_url}} .= $text;     }     sub end {         my ($self, $tag) = @_;         $self->{_in_link} = 0 if $tag eq "a";     }

Using it couldn't be more simple: we instantiate a DumpLinks object, call its parse_file method on the HTML file of our choice, and we'll have a handy hash reference in $parser->{_urls} we can inspect.

     Use DumpLinks;     my $parser = DumpLinks->new(  );     $parser->parse_file("index.html");     for (keys %{$parser->{_urls}}) {         print qq{Link to $_ (Link text: "}. $parser->{_urls}->{$_}. qq{")\n};     }

Running this on the front page of this week's www.perl.com edition produces something like this:

     Link to /cs/user/query/q/6?id_topic=42 (Link text: "Files")     Link to /pub/a/universal/pcb/solution.html (Link text: "Do it now.")     Link to http://www.oreillynet.com/python/ (Link text: "Python")     Link to http://training.perl.com/ (Link text: "Training")     Link to /cs/user/query/q/6?id_topic=68 (Link text: "Sound and Audio")     Link to /cs/user/query/q/6?id_topic=62 (Link text: "User Groups")     Link to http://search.cpan.org/author/DARNOLD/DBD-Chart-0.74 (Link text: "DBD-Chart-0.74")     Link to http://www.oreilly.com/catalog/perlxml/ (Link text: "Perl & XML")     Link to http://www.oreilly.com/catalog/regex2/ (Link text:  "Mastering Regular Expressions, 2nd Edition")     Link to http://www.openp2p.com/ (Link text: "openp2p.com")     ...

As if that wasn't easy enough, there are a few other modules you might consider when dealing with HTML text. For doing something like the above, if you don't care about the link text, HTML::LinkExtor can do the job in seconds:

     use HTML::LinkExtor;     my $parser = HTML::LinkExtor->new(  );     $parser->parse_file("index.html");     for ($parser->links) {         my ($tag, %attrs) = @$_;         print $attrs{href},"\n";     }

If you're not interested in writing callbacks, another module worth looking into is HTML::TokeParser, which parses an HTML file one token at a time. Another favorite is HTML::TreeBuilder, which allows you to navigate the document's structure as a tree of Perl objects.

For more on HTML parsing with Perl modules, you should check out Sean Burke's Perl and LWP (O'Reilly).

2.3.2. XML Parsing

Of course, nowadays HTML is old hat, and everything is being written in the much more right-on XML. The principles are the same, only the module name changes: instead of using HTML::Parser, there's an XML::Parser module.

This works in the same way as HTML::Parser--you set callbacks for start tags, end tags, and the stuff in between. Of course, for 99% of the things you need to do with XML, this method is complete overkill. Just like with so many other things in Perl, if you want the flexibility, you can have it, but if you want things to be simple, you can have that, too. Simple is goodand a good module for handling XML simply is called, simply, XML::Simple.

The job of XML::Simple is to turn some XML into a Perl data structure or vice versa. It exports two subroutines: XMLin and XMLout. Let's see how it copes with a real-life XML file. This is a description of the opcodes in Microsoft's Common Interpreted Language, as implemented by the Mono project (http://www.mono-project.com^[*]):

^[*] If you've got Mono installed, you can probably find this file as /usr/local/share/mono/cil/cil-opcodes.xml.

     <opdesc>     <opcode name="nop" input="Pop0" output="Push0" args="InlineNone" o1="0xFF" o2="0x00"     flow="next"/>     <opcode name="break" input="Pop0" output="Push0" args="InlineNone" o1="0xFF"     o2="0x01" flow="break"/>     <opcode name="ldarg.0" input="Pop0" output="Push1" args="InlineNone" o1="0xFF"     o2="0x02" flow="next"/>     <opcode name="ldarg.1" input="Pop0" output="Push1" args="InlineNone" o1="0xFF"     o2="0x03" flow="next"/>     ...     </opdesc>

For instance, this tells us that the ldarg.0 operator takes no arguments from the stack, returns one value to the stack, has no arguments inline, is represented by the assembly code FF 02, and passes control flow to the next operation.

We'll use XML::Simple to read the file and Data::Dumper to take a look at the resulting data structure:

     % perl -MData::Dumper -MXML::Simple -e          'print Dumper XMLin("/usr/local/share/mono/cil/cil-opcodes.xml")'     $VAR1 = {               'opcode' => {                             'stloc.2' => {                                            'args' => 'InlineNone',                                            'input' => 'Pop1',                                            'o1' => '0xFF',                                            'o2' => '0x0C',                                            'output' => 'Push0',                                            'flow' => 'next'                                          },                             'stloc.3' => {                                            'args' => 'InlineNone',                                            'input' => 'Pop1',                                            'o1' => '0xFF',                                            'o2' => '0x0D',                                            'output' => 'Push0',                                            'flow' => 'next'                                          },                              ...                            }             };

As you can see, this is pretty much exactly what we could have hoped for. So if we want to see how the shl operator took its arguments, we can ask:

      use XML::Simple;      my $opcodes = XMLin("/usr/local/share/mono/cil/cil-opcodes.xml");      my $shl = $opcodes->{opcode}->{shl};      print "shl takes its input by doing a ".$shl->{input}."\n";      print "And it returns the result by doing a ".$shl->{output}."\n";

The other function that XML::Simple exports is XMLout, which, as you might be able to guess, turns a Perl data structure into XML. For instance, we could introduce a new opcode:

     $opcodes->{opcode}->{hcf} = {                                  'args' => 'InlineNone',                                  'input' => 'Pop0',                                  'o1' => '0xFF',                                  'o2' => '0xFF',                                  'output' => 'Push0',                                  'flow' => 'explode'                                 };     print XMLout($opcodes);

And now we'd find another item in that list of XML:

     <opcode args="InlineNone" input="Pop0" o1="0xFF" o2="0xFF" output="Push0"     flow="explode" name="hcf" />

XML::Simple is particularly handy for dealing with configuration filessimply state that the config should be given in XML, and use XML::Simple to read and write it. The XMLin and XMLout functions do all the rest.

If you need to do anything more sophisticated with XML parsing, take a look at Perl and XML.

2.3.3. And Everything Else...

While we're on the subject of configuration files, there are plenty of other file formats out there that the Perl programmer will need to throw around during her programming life, and config files make up a good number of them. The rest of this chapter suggests a few other techniques for dealing with standard file formats.

First of all, I have a personal favorite, but that's only because I wrote it. Config::Auto parses a variety of file formats, if necessary sniffing out what the file format is likely to be. Here's the Samba configuration from a default install of Mac OS X 10.2:

     % perl -MData::Dumper -MConfig::Auto -e 'print Dumper Config::Auto::parse("/etc/smb.     conf")'     $VAR1 = {               'global' => {                             'guest account' => 'unknown',                             'client code page' => 437,                             'encrypt passwords' => 'yes',                             'coding system' => 'utf8'                           },               'homes' => {                            'read only' => 'no',                            'browseable' => 'no',                            'comment' => 'User Home Directories',                            'create mode' => '0750'                          }             };

Other modules worth looking out for are AppConfig, Parse::Syslog (which provides access to Unix system logs), SQL::Statement, and Mac::PropertyList.