Section 8.4. Using Regular Expressions | Developing Feeds with Rss and Atom

8.4. Using Regular Expressions

Using regular expressions to parse feeds may seem a little brutish, but it does have two advantages. First, it totally negates the issues regarding the differences between standards. Second, it is a much easier installation: it requires no XML parsing modules or any dependencies thereof.

Regular expressions, however, aren't pretty. Consider Example 8-7, which is a section from Rael Dornfest's lightweight RSS aggregator, Blagg.

Example 8-7. A section of code from Blagg

# Feed's title and link my($f_title, $f_link) = ($rss =~ m#<title>(.*?)</title>.*?<link>(.*?)</link>#ms);     # RSS items' title, link, and description     while ( $rss =~ m{<item(?!s).*?>.*?(?:<title>(.*?)</title>.*?)?(?:<link>(.*?)</link>. *?)?(?:<description>(.*?)</description>.*?)?</item>}mgis ) {      my($i_title, $i_link, $i_desc, $i_fn) = ($1||'', $2||'', $3||'', undef);          # Unescape &amp; &lt; &gt; to produce useful HTML      my %unescape = ('&lt;'=>'<', '&gt;'=>'>', '&amp;'=>'&', '&quot;'=>'"');      my $unescape_re = join '|' => keys %unescape;      $i_title && $i_title =~ s/($unescape_re)/$unescape{$1}/g;      $i_desc && $i_desc =~ s/($unescape_re)/$unescape{$1}/g;          # If no title, use the first 50 non-markup characters of the description      unless ($i_title) {           $i_title = $i_desc;           $i_title =~ s/<.*?>//msg;           $i_title = substr($i_title, 0, 50);           }           next unless $i_title;

While this looks pretty nasty, it is actually an efficient way of stripping the data out of the RSS file, even if it is potentially much harder to extend. If you are really into regular expressions and don't mind having a very specialized, hard-to-extend system, their simplicity may be for you. They certainly have their place.