Section 8.4. Using Regular Expressions

8.4. Using Regular Expressions

Using regular expressions to parse feeds may seem a little brutish, but it does have two advantages. First, it totally negates the issues regarding the differences between standards. Second, it is a much easier installation: it requires no XML parsing modules or any dependencies thereof.

Regular expressions, however, aren't pretty. Consider Example 8-7, which is a section from Rael Dornfest's lightweight RSS aggregator, Blagg.

Example 8-7. A section of code from Blagg
# Feed's title and link my($f_title, $f_link) = ($rss =~ m#<title>(.*?)</title>.*?<link>(.*?)</link>#ms);     # RSS items' title, link, and description     while ( $rss =~ m{<item(?!s).*?>.*?(?:<title>(.*?)</title>.*?)?(?:<link>(.*?)</link>. *?)?(?:<description>(.*?)</description>.*?)?</item>}mgis ) {      my($i_title, $i_link, $i_desc, $i_fn) = ($1||'', $2||'', $3||'', undef);          # Unescape &amp; &lt; &gt; to produce useful HTML      my %unescape = ('&lt;'=>'<', '&gt;'=>'>', '&amp;'=>'&', '&quot;'=>'"');      my $unescape_re = join '|' => keys %unescape;      $i_title && $i_title =~ s/($unescape_re)/$unescape{$1}/g;      $i_desc && $i_desc =~ s/($unescape_re)/$unescape{$1}/g;          # If no title, use the first 50 non-markup characters of the description      unless ($i_title) {           $i_title = $i_desc;           $i_title =~ s/<.*?>//msg;           $i_title = substr($i_title, 0, 50);           }           next unless $i_title;

While this looks pretty nasty, it is actually an efficient way of stripping the data out of the RSS file, even if it is potentially much harder to extend. If you are really into regular expressions and don't mind having a very specialized, hard-to-extend system, their simplicity may be for you. They certainly have their place.

    Developing Feeds with RSS and Atom
    Developing Feeds with Rss and Atom
    ISBN: 0596008813
    EAN: 2147483647
    Year: 2003
    Pages: 118 © 2008-2017.
    If you may any questions please contact us: