Hack 58 Scraping Alexa s Competitive Data with Java

Hack 58 Scraping Alexa 's Competitive Data with Java

Alexa tracks the browsing habits of its millions of users daily. This hack allows you to aggregate the traffic statistics of multiple web properties into one RSS file, with subscriptions available daily .

Alexa (http://www.alexa.com) recently launched a section of its web site, detailing the observed traffic of its millions of users on a daily basis. Using this freely available data, you can track the traffic of your site, or your competitors ' sites, over time. We'll scrape this traffic data into an RSS file [Hack #94] for your consumption.

The Code

The hack consists of five Java classes, each designed to handle different aspects of downloading, parsing, and presenting Alexa's traffic content. The full code can be downloaded from this book's web site (http://www.oreilly.com/catalog/spiderhks/).

The primary class of our Java application ( Report ) allows you to pass a URL to Alexa's web site for every domain you're interested in tracking. The appropriate Alexa page is downloaded, and its content is parsed for the key bits of daily data. Once this data is organized, we will need to mark it up for presentation and, finally, write the presentable file to disk.

The first step ( Website ) streams the source into your computer's memory. We eliminate everything but the body of the page, since this is where all of our data lies.

Now that we have the page's source stored in memory, we need to identify the key data components within the myriad lines of HTML. Alexa does not conform to strict XML, and string parse ( Parse ) is our best and quickest route of attack.

We will navigate through the page's source code in serial, pulling the data we need and leaving a marker on our trail to speed up our search. Key phrases of text need to be identified in close vicinity to our key data so that we can consistently pull the correct data, regardless of the size of a web property.

Now that we have all our data, we need somewhere to store our findings for use across multiple classes. We create an entity bean-style data object to store each of the key pieces of data. Our code for doing so is in TrafficBean .

Finally, we present our findings to whomever might be interested through an RSS file ( RSSWriter ). By default, the RSS file is saved to the current user 's home directory ( C:\Documents And Settings\ $user on Microsoft Windows platforms, or /home/ $user on most versions of Unix). It is assumed that you have sufficient write permissions within your home directory to perform this action.

Running the Hack

The only external library required is Apache's Xerces for Java (http://xml.apache.org/xerces2-j/). Web property names should be hardcoded in the Report class to allow for consistent scheduled runs. You can pass domain strings in the format of site.tld at runtime or, if no parameters are found, the code will iterate through a previously created string array. You might also want to set yourself up with an RSS aggregator if you do not already have one. I use FeedDemon (http://www.bradsoft.com/feeddemon/index.asp) for Windows.

Hacking the Hack

Possibilities abound:

Set up a cron script [Hack #91] on your machine to generate a new report every evening.
Using the percentage numbers from the returned subdomains, calculate the total reach and views for each of the domains within the web property.
Hook your findings into a database for larger comparison sets over time.
Using the RSS file as your data source, create a time series graph [Hack #62]. Use views or ranges as your y-axis and time as your x-axis. Overlay all of your sites using different colors and save for use in reports .

Niall Kennedy