Hack 23 In Praise of Regular Expressions

You don't always need to use a module like HTML::TokeParser or HTML::TreeBuilder in order to parse HTML. Sometimes, a few simple regular expressions can save you the effort .

Every so often, someone asks a question about extracting data from the thickets of HTML tag soup. They might have a piece of text like:

 <p>This is a paragraph</p> <p>And this is <i>another</i> paragraph</p>

and they wonder why they get such strange results when they try to attack it with something like /<p>(.*)<\/ p>/ . The standard Perlmonk's reply is to point people to HTML::Parser , HTML::TableExtract , or even HTML::TreeBuilder , depending on the context. The main thrust of the argument is usually that regular expressions lead to fragile code. This is what I term The Correct Answer , but alas, in Real Life , things are never so simple, as a recent experience just showed me. You can, with minor care and effort, get perfect results with regular expressions, with much better performance.

Using Modules to Parse HTML

I've used HTML::Parser in the past to build the Perlmonk Snippets Index (http://grinder.perlmonk.org/pmsi/); the main reason is that I wanted to walk down all the pages, in case an old node was reaped. Doing that, I learned it's a real bear to ferry information from one callback to another. I hacked it by using global variables to keep track of state. Later on, someone else told me that The Right Way to use HTML::Parser is to subclass it and extend the internal hash object to track state that way. Fair enough, but this approachwhile theoretically correctis not a trivial undertaking for a casual user who just wants to chop up some HTML.

More recently, I used HTML::TreeBuilder to parse some HTML output from a webified Domino database. Because of the way the HTML was structured in this particular case, it was a snap to just look_down('_tag', 'foo') and get exactly what I wanted. It was easy to write, and the code was straightforward.

Watching the Printers: Score One for Regular Expressions

Then, last week, I got tired of keeping an eye on our farm of HP 4600 color printers to see how their supplies were lasting (they use four cartridgesC, M, Y, and Kand two kits, the transfer and fuser ). It turns out that this model has an embedded web server. Point your browser at it, and it will produce a status page that shows you how many pages can be printed, based on what's left in the consumables .

So, I brought HTML::TreeBuilder to bear on the task. It wasn't quite as easy. It was no simple matter to find a reliable part in the tree from whence to direct my search. The HTML contains deeply nested tables, with a high degree of repetition for each kit and cartridge. The various pieces of information were scattered in different elements, and collecting and collating it made for some pretty ugly code.

After I'd managed to wrestle the data I wanted out of the web page, I set about stepping through the code in the debugger, to better understand the data structures and see what shortcuts I could figure out by way of method chaining and array slicing in an attempt to tidy up the code. To my surprise, I saw that just building the HTML::TreeBuilder object (by calling the parse( ) with the HTML in a scalar) required about a second to execute, and this on some fairly recent high-end hardware.

Until then, I wasn't really concerned about performance, because I figured the parse time would be dwarfed by the time it took to get the request's results back. In the master plan, I intended to use LWP::Parallel::UserAgent (http://search.cpan.org/author/MARCLANG/ParallelUserAgent/) to probe all of the printers in parallel, rather than loop though them one at a time, and factor out much of the waiting. In a perfect world, it would be as fast as the single slowest printer.

Given the less than stellar performance of the code at this point, however, it was clear that the cost of parsing the HTML would consume the bulk of the overall runtime. Maybe I might be able to traverse a partially fetched page, but at this point, the architecture would start to become unwieldy. Madness!

The Code

So, after trying the orthodox approach, I started again. I broke the rule about parsing HTML with regular expressions and wrote the following code:

 #!usr/bin/perl -w use strict;  my (@s) = m{       >         # close of previous tag       ([^<]+)   # text (name of part, e.g., q/BLACK CARTRIDGE/)       <br>       ([^<]+)   # part number (e.g., q/HP Part Number: HP C9724A/+)       (?:<[^>]+>\s*){4} # separated by four tags       (\d+)       # percent remaining                  # --or--       (?:             # different text values             (?:                 Pages\sRemaining                  Low\sReached                  Serial\sNumber                  Pages\sprinted\swith\sthis\ssupply             ) : (?:\s*<[^>]+>){6}\s* # colon, separated by six tags         # or just this, within the current element          Based\son\shistorical\s\S+\spage\scoverage\sof\s         )         (\w+) # and the value we want     }gx;

A single regular expression (albeit with a /g modifier for global matching) pulls out all I want. Actually, it's not quite perfect, since the resulting array also fills up with a pile of undefs , the unfilled parenthesis on the opposite side of the alternation to the match. This is easily handled with a simple next unless $index addition to any foreach loop on @s .

Is the code fragile? Not really. The HTML has errors in it, such as <td valign= op"> , which can trip up some modules that expect perfectly formed HTML, but HTML::TreeBuilder coped just fine with this too.

Not Fragile, but Probably Not Permanent Either

The generated HTML is stored in the printer's onboard firmware, so unless I upgrade the BIOS, the HTML isn't going to change; it's written in stone, bugs and all.

Here's the main point: when the HP 4650 or 4700 model is released, it will probably have completely different HTML anyway, perhaps with stylesheets instead of tables. Either way, the HTML will have to be inspected anew, in order to tweak the regular expression or to pull something else out of TreeBuilder 's parse tree.

Neither approach, regular expression nor module, is maintenance free. But the regular expression is far less code and 17 times faster. Now, the extraction cost is negligible compared to the page fetch, as it should be. And, as a final bonus, the regular expression approach requires no noncore modules, saving me installation time. Case closed.

David Landgren