Hack 54 Scraping Amazon.com Customer Advice

figs/moderate.gif figs/hack54.gif

Screen scraping can give you access to Amazon.com community features not yet implemented through Amazon.com's public Web Services API. In this hack, we'll implement a script to scrape customer buying advice .

Customer buying advice isn't available through Amazon.com's Web Services API, so if you'd like to include this information on a remote site, you'll have to get it from Amazon.com's site through scraping. The first step to this hack is knowing where to find all the customer advice on one page. The following URL links directly to the advice page for a given ASIN (the unique ID Amazon.com displays for each product [Hack #52]):

 http://amazon.com/o/tg/detail/-/   insert ASIN   /?vi=advice 

For example, here is the advice page for Mac OS X Hacks :

 http://amazon.com/o/tg/detail/-/0596004605/?vi=advice 

The Code

This Perl script splits the advice page into two variables , based on the headings "in addition to" and "instead of." It then loops through those sections, using regular expressions to match the products' information. The script then formats and prints the information.

Save the following script to a file called get_advice.pl :

 #!/usr/bin/perl -w # get_advice.pl # # A script to scrape Amazon to retrieve customer buying advice # Usage: perl get_advice.pl <asin> use strict; use LWP::Simple; # Take the ASIN from the command line. my $asin = shift @ARGV or die "Usage: perl get_advice.pl <asin>\n"; # Assemble the URL from the passed ASIN. my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=advice"; # Set up unescape-HTML rules. Quicker than URI::Escape. my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' '); my $unescape_re = join '' => keys %unescape; # Request the URL. my $content = get($url); die "Could not retrieve $url" unless $content; # Get our matching data. my ($inAddition) = (join '', $content)  [RETURN]  =~ m!in addition to(.*?)(instead of)?</td></tr>!mis; my ($instead)    = (join '', $content)  [RETURN]  =~ m!recommendations instead of(.*?)</table>!mis; # Look for "in addition to" advice. if ($inAddition) { print "-- In Addition To --\n\n";    while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/  [RETURN]  (.*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {        my ($place,$thisAsin,$title,$number) = ('','','','');        $title =~ s/($unescape_re)/$unescape{}/migs; #unescape HTML         print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";    } } # Look for "instead of" advice. if ($instead) { print "-- Instead Of --\n\n";     while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.  [RETURN]  *?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {         my ($place,$thisAsin,$title,$number)  [RETURN]  = ('','','','');         $title =~ s/($unescape_re)/$unescape{}/migs; #unescape HTML          print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";     } } 

Running the Hack

You can run this script from the command line, passing in any ASIN. Here is the one for Mac OS X Hacks :

 %  perl get_advice.pl 0596004605  -- In Addition To -- 1. Mac OS X: The Missing Manual, Second Edition (0596004508) (Recommendations: 1) 2. Mac Upgrade and Repair Bible, Third Edition (0764525948) (Recommendations: 1) 

If the book has long lists of alternate products, send the output to a text file. This example sends all alternate product recommendations for Google Hacks to a file called advice.txt :

 %  perl get_advice.pl 0596004478 > advice.txt  

See Also

  • Amazon Hacks (http://oreilly.com/catalog/amazonhks/) by Paul Bausch

Paul Bausch



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net