Section 8.4. Using Regular Expressions


8.4. Using Regular Expressions

Using regular expressions to parse feeds may seem a little brutish, but it does have two advantages. First, it totally negates the issues regarding the differences between standards. Second, it is a much easier installation: it requires no XML parsing modules or any dependencies thereof.

Regular expressions, however, aren't pretty. Consider Example 8-7, which is a section from Rael Dornfest's lightweight RSS aggregator, Blagg.

Example 8-7. A section of code from Blagg
# Feed's title and link
my($f_title, $f_link) = ($rss =~ m#<title>(.*?)</title>.*?<link>(.*?)</link>#ms);

   
# RSS items' title, link, and description
   
while ( $rss =~ m{<item(?!s).*?>.*?(?:<title>(.*?)</title>.*?)?(?:<link>(.*?)</link>.

*?)?(?:<description>(.*?)</description>.*?)?</item>}mgis ) {
     my($i_title, $i_link, $i_desc, $i_fn) = ($1||'', $2||'', $3||'', undef);
   
     # Unescape &amp; &lt; &gt; to produce useful HTML
     my %unescape = ('&lt;'=>'<', '&gt;'=>'>', '&amp;'=>'&', '&quot;'=>'"');

     my $unescape_re = join '|' => keys %unescape;
     $i_title && $i_title =~ s/($unescape_re)/$unescape{$1}/g;
     $i_desc && $i_desc =~ s/($unescape_re)/$unescape{$1}/g;
   
     # If no title, use the first 50 non-markup characters of the description
     unless ($i_title) {
          $i_title = $i_desc;
          $i_title =~ s/<.*?>//msg;
          $i_title = substr($i_title, 0, 50);
          }
          next unless $i_title;

While this looks pretty nasty, it is actually an efficient way of stripping the data out of the RSS file, even if it is potentially much harder to extend. If you are really into regular expressions and don't mind having a very specialized, hard-to-extend system, their simplicity may be for you. They certainly have their place.


    8.5. Using XSLT

    The transformation of RSS into another form of XML, using XSLT, isn't very common at the moment, but it may soon have its time in the sun. This is because RSSespecially RSS 1.0, with its complicated relationships and masses of metadatacan be reproduced in many useful ways.

    While the examples in this book are text-based and mostly XHTML, there is no reason you can't render RSS into an SVG graphic, a PDF (via the Apache FOP tool), an MMS-SMIL message for new-generation mobile phones, or any of the hundreds of other XML-based systems. XSLT and the arcane art of writing XSLT stylesheets to take care of all of this is a subject too large for this book to cover in detailfor that, check out O'Reilly's XSLT.

    Nevertheless, I will show you some nifty stuff. Example 8-8 is an XSLT stylesheet that transforms an RSS 1.0 feed into the XHTML produced in Example 8-7.

    Example 8-8. Transforming RSS 1.0 into XHTML fragments
    <?xml version="1.0"?>
       
    <xsl:stylesheet version = '1.0'
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rss="http://purl.org/rss/1.0/"
    exclude-result-prefixes="rss rdf"
    >
    <xsl:output method="html"/>
       
    <xsl:template match="/">
     <div >
      <a href="{rdf:RDF/rss:channel/rss:link}">
       <xsl:value-of select="rdf:RDF/rss:channel/rss:title"/>
      </a>
     </div>
     <div >
      <ul>
       <xsl:apply-templates select="rdf:RDF/*"/>
      </ul>
     </div>
    </xsl:template>
       
    <xsl:template match="rss:channel|rss:item">
     <li>
      <a href="{rss:link}">
       <xsl:value-of select="rss:title"/>
      </a>
     </li>
    </xsl:template>
    </xsl:stylesheet>

    Again, just like the parsing code in Example 8-5, it is easy to extend this stylesheet to take the modules into account. Example 8-9 extends Example 8-8 to look for the description, dc:creator, and dc:date elements. Note the emphasized code: those are the changes.

    Example 8-9. Making the XSLT stylesheet more useful
    <?xml version="1.0"?>
       
    <xsl:stylesheet version = '1.0'
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rss="http://purl.org/rss/1.0/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
    exclude-result-prefixes="rss   rdf  
     dc "
    >
    <xsl:output method="html"/>
       
    <xsl:template match="/">
     <div >
      <a href="{rdf:RDF/rss:channel/rss:link}">
       <xsl:value-of select="rdf:RDF/rss:channel/rss:title"/>
      </a>
     </div>
     <div >
      <ul>
       <xsl:apply-templates select="rdf:RDF/*"/>
      </ul>
     </div>
    </xsl:template>
       
    <xsl:template match="rss:channel|rss:item">
     <li>
      <a href="{rss:link}"><xsl:value-of select="rss:title"/></a>
       <ol>
         <xsl:value-of select="rss:description" />
       </ol>
       <ol>
        <xsl:text>Written  by: </xsl:text>
        <xsl:value-of select="dc:creator"/>
       </ol>
       <ol>
        <xsl:text>Written  on: </xsl:text>
        <xsl:value-of select="dc:date"/> 
       </ol>
     </li>
    </xsl:template>
    </xsl:stylesheet>