Hack 52 Scraping Amazon.com Product Reviews

While Amazon.com has made some reviews available through their Web Services API, most are available only at the Amazon.com web site, requiring a little screen scraping to grab them .

If you've written a book called Spidering Hacks and you're interested to hear what people are saying about it, you could run off to Amazon.com each and every day to check out the reviews. Well, you certainly could, but you wouldn't, else you'd deserve every bad comment that came your way. Here's a way to integrate Amazon.com reviews with your web site. Unlike linking or monitoringreviews for changes , this puts the entire text of Amazon.com reviews into your own pages.

The easiest and most reliable way to access customer reviews programmatically is through Amazon.com's Web Services API. Unfortunately, the API gives only a small window to the larger number of reviews available. An API query for the book Cluetrain Manifesto , for example, includes only three user reviews. If you visit the reviewpage for that book, though, you'll find 128 reviews. To dig deeper into the reviews available on Amazon.com and use all of them on your own web site, you'll need to spelunk a bit further into scripting.

The Code

This Perl script builds a URL to the review page for a given ASIN, uses regular expressions to find the reviews, and breaks the review into its pieces: rating, title, date, reviewer, and the text of the review.

Save the following script to a file called get_reviews.pl :

 #!/usr/bin/perl -w # get_reviews.pl # # A script to scrape Amazon, retrieve # reviews, and write to a file. # Usage: perl get_reviews.pl <asin> use strict; use LWP::Simple; # Take the ASIN from the command line. my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n"; # Assemble the URL from the passed ASIN. my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews"; # Set up unescape-HTML rules. Quicker than URI::Escape. my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' '); my $unescape_re = join '' => keys %unescape; # Request the URL. my $content = get($url); die "Could not retrieve $url" unless $content; # Loop through the HTML, looking for matches while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)\n. *?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {     my($rating,$title,$date,$reviewer,$review) =                        ('','','','','');     $reviewer =~ s!<.+?>!!g;   # drop all HTML tags     $reviewer =~ s!\(.+?\)!!g; # remove anything in parenthesis     $reviewer =~ s!\n!!g;      # remove newlines     $review =~ s!<.+?>!!g;     # drop all HTML tags     $review =~ s/($unescape_re)/$unescape{}/migs; # unescape.     # Print the results     print "$title\n" . "$date\n" . "by $reviewer\n" .           "$rating stars.\n\n" . "$review\n\n"; }

Running the Hack

This script can be run from a command line, and it requires an ASINan Amazon.com unique ID that can be found in the Product Details of each and every product, listed as either "ISBN" or "ASIN", as shown in Figure 4-3.

Figure 4-3. Amazon.com's unique ID, listed as an ASIN or ISBN

The reviews are too long to read as they scroll past your screen, so it helps to send the information to a text file (in this case, reviews.txt ), like so:

 %  perl get_reviews.pl     asin     > reviews.txt

Hack 52 Scraping Amazon.com Product Reviews

The Code

Running the Hack

Figure 4-3. Amazon.com's unique ID, listed as an ASIN or ISBN

See Also