Hack 84 Bargain Hunting with PHP

figs/moderate.gif figs/hack84.gif

If you're always on the lookout for the best deals, coupons , and contests, a little bit of PHP-scraping code can help you stay up-to-date .

Scraping content is a task than can be handled by most programming languages. PHP is quickly becoming one of the most popular scripting languages, and it is particularly well-suited for scraping work. With a moderate grasp of PHP, programmers can write scrapers in a matter of minutes. In this section, we'll work through some of the basic code and concepts for scraping with PHP.

There are a handful of useful functions that most scraping tasks will need, which will make writing customized scrapers almost painless. For the sake of simplicity, we won't use regular expressions here, but the more agile programmers will quickly note where regular expressions might make these functions work better.

The first function that we want uses PHP's fopen( ) function to fetch individual pages from a web server. For more sophisticated scrapers, a direct socket connection is probably more desirable, but that's another matter. For now, we'll go the simple way:

 function getURL( $pURL ) {    $_data = null;    if( $_http = fopen( $pURL, "r" ) ) {       while( !feof( $_http ) ) {          $_data .= fgets( $_http, 1024 );       }       fclose( $_http );    }    return( $_data ); } 

Calling this function is done simply, like this:

 $_rawData = getURL( "http://www.example.com/" ); 

If $_rawData is null , the function wasn't able to fetch the page. If $_rawData contains a string, we're ready for the next step.

Because every author codes her HTML slightly different, it's useful to normalize the raw HTML data that getURL( ) returns. We can do this with the cleanString( ) function. This function simply removes newline, carriage return, tab, and extra space characters . Regular expressions could simplify this function a bit, if you are comfortable with them.

 function cleanString( $pString ) {    $_data = str_replace( array( chr(10), chr(13), chr(9) ), chr(32),  [RETURN]  $pString );       while( strpos( $_data, str_repeat( chr(32), 2 ), 0 ) != false ) {          $_data = str_replace( str_repeat( chr(32), 2 ), chr(32), $_data );       }       return( trim( $_data ) ); } 

We'll clean up the raw HTML source with the following code:

 $_rawData = cleanString( $_rawData ); 

Now, we have some data that is easy to parse. Two other useful functions will parse out particular pieces of the source and get data from individual HTML tags:

 function getBlock( $pStart, $pStop, $pSource, $pPrefix = true ) {    $_data = null;    $_start = strpos( strtolower( $pSource ), strtolower( $pStart ), 0 );    $_start = ( $pPrefix == false ) ? $_start + strlen( $pStart ) : $_start;    $_stop = strpos( strtolower( $pSource ), strtolower( $pStop ), $_start );    if( $_start > strlen( $pElement ) && $_stop > $_start ) {       $_data = trim( substr( $pSource, $_start, $_stop - $_start ) );    }    return( $_data ); } function getElement( $pElement, $pSource ) {    $_data = null;    $pElement = strtolower( $pElement );    $_start = strpos( strtolower( $pSource ), chr(60) . $pElement, 0 );    $_start = strpos( $pSource, chr(62), $_start ) + 1;    $_stop = strpos( strtolower( $pSource ), "</" . $pElement .  [RETURN]  chr(62), $_start );    if( $_start > strlen( $pElement ) && $_stop > $_start ) {       $_data = trim( substr( $pSource, $_start, $_stop - $_start ) );    }    return( $_data ); } 

We can use each of these functions with the following code:

 $_rawData = getBlock( start_string, end_string, raw_source,  [RETURN]  include_start_string ); $_rawData = getElement( html_tag, raw_source ); 

Let's assume for a moment that we have source code that contains the string " Total of 13 results ", and we want just the number of results. We can use getBlock( ) to get that number with this code:

 $_count = getBlock( "Total of", "results", $_rawData, false ); 

This returns " 13 ". If we set $pPrefix to true , $_count will be " Total of 13 ". Sometimes, you might want the start_string included, and other times, as in this case, you won't.

The getElement( ) function works basically the same way, but it is specifically designed for parsing HTML-style tags instead of dynamic strings. Let's say our example string is " Total of <b>13</b> results ". In this case, it's easier to parse out the bold element:

 $_count = getElement( "b", $_rawData ); 

This returns " 13 " as well.

It's handy to put the scraping functions into an includable script, because it keeps you from having to copy/paste them into all your scraping scripts. In the next example, we save the previous code into scrape_func.php .

Now that we have the basics covered, let's scrape a real page and see it in action. For this example, we'll scrape the latest deals list from TechDeals.net (http://www.techdeals.net).

The Code

Save the following code as bargains.php :

 /* include the scraping functions script:  */ include( "scrape_func.php" );  /* Next, we'll get the raw source code of    the page using our getURL(  ) function:  */ $_rawData = getURL( "http://www.techdeals.net/" );  /* And clean up the raw source for easier parsing:  */ $_rawData = cleanString( $_rawData );  /* The next step is a little more complex. Because we've already    looked at the HTML source, we know that the items start and    end with two particular strings. We'll use these strings to    get the main data portion of the page:*/ $_rawData = getBlock( "<div class=\"NewsHeader\">",                       "</div> <div id=\"MenuContainer\">", $_rawData );  /* We now have the particular data that we want to parse into    an itemized list. We do that by breaking the code into an    array so we can loop through each item: */ $_rawData = explode( "<div class=\"NewsHeader\">", $_rawData );  /* While iterating through each value, we     parse out the individual item portions:  /* foreach( $_rawData as $_rawBlock ) {    $_item = array(  );    $_rawBlock = trim( $_rawBlock );    if( strlen( $_rawBlock ) > 0 ) {       /*   The title of the item can be found in <h2> ... </h2> tags   */       $_item[ "title" ] = strip_tags( getElement( "h2", $_rawBlock ) );       /*   The link URL can is found between            http://www.techdeals.net/rd/go.php?id= and "   */       $_item[ "link" ] = getBlock( "http://www.techdeals.net/rd/go.php?id=",                                    chr(34), $_rawBlock );       /*   Posting info is in <span> ... </span> tags   */       $_item[ "post" ] = strip_tags( getElement( "span", $_rawBlock ) );       /*   The description is found between an </div> and a <img tag   */       $_item[ "desc" ] = cleanString( strip_tags( getBlock( "</div>",                                       "<img", $_rawBlock ) ) );       /*   Some descriptions are slightly different,            so we need to clean them up a bit   */       if( strpos( $_item[ "desc" ], "Click here for the techdeal", 0 )  [RETURN]  > 0 ) {          $_marker = strpos( $_item[ "desc" ], "Click here for the techdeal",  [RETURN]  0 );          $_item[ "desc" ] = trim( substr( $_item[ "desc" ], 0, $_marker ) );       }       /*   Print out the scraped data   */       print( implode( chr(10), $_item ) . chr(10) . chr(10) );       /*   Save the data as a string (used in the mail example below)   */       $_text .= implode( chr(10), $_item ) . chr(10) . chr(10);    } } 

Running the Hack

Invoke the script from the command line, like so:

 %  php -q bargains.php  Values on Video http://www.techdeals.net/rd/go.php?id=28 Posted 08/06/03 by david TigerDirect has got the eVGA Geforce FX5200 Ultra 128MB video card with TV-Out & DVI for only 4.99+S/H after a  rebate.  Potent Portable http://www.techdeals.net/rd/go.php?id=30 Posted 08/06/03 by david Best Buy has got the VPR Matrix 220A5 2.2Ghz Notebook for just 49.99 with free shipping after 0 in rebates. ...etc... 

Hacking the Hack

This output could be emailed easily, or you could even put it into an RSS feed. If you want to email it, you can use PHP's mail( ) function:

 mail( "me@foo.com", "Latest Tech Deals", $_text ); 

But how do you output RSS in PHP? While there are many ways to go about it, we'll use the simplest to keep everything concise . Creating an RSS 0.91 feed is a matter of three small sections of codethe channel metadata, the item block, and the closing channel tags:

 <rss version="0.91">    <channel>       <title><?= htmlentities( $_feedTitle ) ?></title>       <link><?= htmlentities( $_feedLink ) ?></link>       <description><?= htmlentities( $_feedDescription ) ?></description>       <language>en-us</language>        <item>          <title><?= htmlentities( $_itemTitle ) ?></title>          <link><?= htmlentities( $_itemLink ) ?></link>          <description><?= htmlentities( $_itemDescription ) ?></description>       </item>     </channel> </rss> 

By putting together these three simple blocks, we can quickly output a full RSS feed. For example, let's use our scraper and output RSS instead of plain text:

 <rss version="0.91">    <channel>       <title>TechDeals: Latest Deals</title>       <link>http://www.techdeals.net/</link>       <description>Latest deals from TechDeals.net (scraped)</description>       <language>en-us</language> <?    include( "scrape_func.php" );    $_rawData = getURL( "http://www.techdeals.net/" );    $_rawData = cleanString( $_rawData );    $_rawData = getBlock( "<div class=\"NewsHeader\">",                          "</div> <div id=\"MenuContainer\">", $_rawData );    $_rawData = explode( "<div class=\"NewsHeader\">", $_rawData );    foreach( $_rawData as $_rawBlock ) {       $_item = array(  );       $_rawBlock = trim( $_rawBlock );       if( strlen( $_rawBlock ) > 0 ) {          $_item[ "title" ] = strip_tags( getElement( "h2", $_rawBlock ) );          $_item[ "link" ]           = getBlock( "http://www.techdeals.net/rd/go.php?id=",           chr(34), $_rawBlock );          $_item[ "post" ] = strip_tags( getElement( "span", $_rawBlock ) );          $_item[ "desc" ] = cleanString( strip_tags( getBlock( "</div>",                                       "<img", $_rawBlock ) ) );          if( strpos($_item[ "desc" ], "Click for the techdeal", 0 ) > 0 ) {             $_marker = strpos($_item[ "desc" ], "Click for the techdeal",0 );             $_item[ "desc" ] = trim(substr( $_item[ "desc" ], 0, $_marker) );          } ?>       <item>          <title><?= $_item ["title" ] ?></title>          <link><?=  $_item[ "link" ] ?></link>          <description>             <?= $_item[ "desc" ] . " (" . $_item[ "post" ] . ")" ?>          </description>       </item> <?       }    } ?>    </channel> </rss> 

Keep in mind that this is the quick-and-dirty way to create RSS. If you plan on generating a lot of RSS, look into RSS 1.0 and build yourself a PHP class for the RSS-generating code.

As you can see, a few simple functions and a few lines of code are all that is needed to make a usable scraper in PHP. Customizing the script and the output are a matter of personal whim. In this particular example, you could also parse out information about the comments that are included in the items, or you could merge in other bargain sites, like AbleShoppers (http://www.ableshopper.com) or Ben's Bargains (http://www.bensbargains.net).

James Linden



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net