Hack 2 Best Practices for You and Your Spider

Some rules for the road as you're writing your own well-behaved spider .

In order to make your spider as effective, polite, and useful as possible, there are some general things you'll have to keep in mind as you create them.

Be Liberal in What You Accept

To spider, you must pull information from a web site. To pull information from a web site, you must wade your way through some flavor of tag soup, be it HTML, XML, plain text, or something else entirely. This is an inexact science, to put it mildly. If even one tag or bit of file formatting changes, your spider will probably break, leaving you dataless until such time as you retool. Thankfully, most sites aren't doing huge revamps every six months like they used to, but they still change often enough that you'll have to watch out for this.

To minimize the fragility of your scraping, use as little boundary data as you can when gleaning data from the page. Boundary data is the fluff around the actual goodness you want: the tags, superfluous verbiage, spaces, newlines, and such. For example, the title of an average web page looks something like this:

 <title>This is the title</title>

If you're after the title, the boundary data is the <title> and </title> tags.

Monitor your spider's output on a regular basis to make sure it's working as expected [Hack #31], make the appropriate adjustments as soon as possible to avoid losing ground with your data gathering, and design your spider to be as adaptive to site redesigns [Hack #32] as possible.

Don't Limit Your Dataset

Just because you're working with the Web doesn't mean you're restricted to spidering HTML documents. If you're considering only web pages, you're potentially narrowing your dataset arbitrarily. There are images, sounds, movies, PDFs, text filesall worthy of spidering for your collection.

Don't Reinvent the Wheel

While it's tempting to think what you're up to is unique, chances are, someone's already spidered and scraped the same or similar sites, leaving clear footprints in the form of code, raw data, or instructions.

CPAN (http://www.cpan.org), the Comprehensive Perl Archive Network, is a treasure trove of Perl modules for programming to the Internet, shuffling through text in search of data, manipulating gleaned datasetsall the functionality you're bound to be building into your spider. And these modules are free to take, use, alter, and augment. Who knows , by the time you finish your spider, perhaps you'll end up with a module or three of your own to pass on to the next guy.

Before you even start coding, check the site to make sure you're not spending an awful lot of effort building something the site already offers. If you want a weather forecast delivered to your email inbox every morning, check your local newspaper's site or sites like weather.com (http://www.weather.com) to see if they offer such a service; they probably do. If you want the site's content as an RSS feed and they don't appear to sport that orange "XML" button, try a Google search for it ( rss site:example.com ( filetype:rss filetype:xml filetype:rdf ) ) or check Syndic8 (http://www.syndic8.com) for an original or scraped version.

Then, of course, you can always contact the site owner, asking him if he has a particular service or data format available for public consumption. Your query might just be the one that convinces him that an RSS feed of or web service API to his content is a good idea.

See [Hack #100] for more pointers on scraping resources and communities.

Best Practices for You

Just as it is important to follow certain rules when programming your spider, it's important to follow certain rules when designing them as well.

Choose the most structured format available

HTML files are fairly unstructured, focusing more on presentation than on the underlying raw data. Often, sites have more than one flavor of their content available; look or ask for the XHTML or XML version, which are cleaner and more structured file formats. RSS, a simple form of XML, is everywhere.

If you must scrape HTML, do so sparingly

If the information you want is available only embedded in an HTML page, try to find a "Text Only" or "Print this Page" variant; these usually have far less complicated HTML and a higher content-to-presentation markup quotient , and they don't tend to change all that much (by comparison) during site redesigns.

Regardless of what you eventually use as your source data, try to scrape as little HTML surrounding the information you want as possible. You want just enough HTML to uniquely identify the information you desire . The less HTML, the less fragile your spider will be. See "Anatomy of an HTML Page" [Hack #3] for more information.

Use the right tool for the job

Should you scrape the page using regular expressions? Or would a more comprehensive tool like WWW::Mechanize [Hack #22] or HTML::TokeParser [Hack #20] fit the bill better? This depends very much on the data you're after and the crafting of the page's HTML. Is it handcrafted and irregular, or is it tool-built and regular as a bran muffin? Choose the simplest and least fragile method for the job at handwith an emphasis on the latter.

Don't go where you're not wanted

Your script may be the coolest thing ever, but it doesn't matter if the site you want to spider doesn't allow it. Before you go to all that trouble, make sure that the site doesn't mind being spidered and that you're doing it in such a way that you're having the minimal possible impact on site bandwidth and resources [Hack #16]. For more information on this issue, including possible legal risks, see [Hack #6] and [Hack #17].

Choose a good identifier

When you're writing an identifier for your spider, choose one that clearly specifies what the spider does: what information it's intended to scrape and what it's used for. There's no need to write a novel ; a sentence will do fine. These identifiers are called User-Agent s, and you'll learn how to set them in [Hack #11].

Whatever you do, do not use an identifier that impersonates an existing spider, such as Googlebot, or an identifier that's confusingly similar to an existing spider. Not only will your spider get iced, you'll also get into trouble with Google or whoever you're imitating. See [Hack #6] for the possible consequences of playing mimicry games .

Make information on your spider readily available

Put up a web page that provides information about your spider and a contact address. Be sure that it's accessible from your friendly neighborhood search engine. See [Hack #4] for some ways and places to get the word out about its existence.

Don't demand unlimited site access or support

You may have written the greatest application since Google's PageRank, but it's up to the webmaster to decide if that entitles you to more access to site content or restricted areas. Ask nicely , and don't demand. Share what you're doing; consider giving them the code! After all, you're scraping information from their web site. It's only fair that you share the program that makes use of their information.

Best Practices for Your Spider

When you write your spider, there are some good manners you should follow.

Respect robots.txt

robots.txt is a file that lives at the root of a site and tells spiders what they can and cannot access on that server. It can even tell particular spiders to leave the site entirely unseen. Many webmasters use your spider's respector lack thereoffor robots.txt as a benchmark; if you ignore it, you'll likely be banned. See [Hack #17] for detailed guidelines.

Secondary to the robots.txt file is the Robots META tag (http://www.robotstxt.org/wc/exclusion.html#meta), which gives in- dexing instructions to spiders on a page-by-page basis. The Robots META tag protocol is not nearly as universal as robots.txt , and fewer spiders respect it.

Go light on the bandwidth

You might love a site's content and want to make the most of it for your application, but that's no reason to be greedy. If your spider tries to slurp up too much content in a short stretch of timedozens or even hundreds of pages per secondyou could hurt both the bandwidth allowances of the site you're scraping and the ability of other visitors to access the site. This is often called hammering (as in, "That stupid spider is hammering my site and the page delivery has slowed to a crawl!").

There is no agreement on how quickly spiders can politely access pages. One or two requests per page per second has been proposed by contributors to WebmasterWorld.com.

WebmasterWorld.com (http://www.webmasterworld.com) is an online gathering of search engine enthusiasts and webmasters from all over the world. Many good discussions happen there. The best part about WebmasterWorld.com is that representatives from several search engines and sites participate in the discussions.

Unfortunately, it seems that it's easier to define what's unacceptable than to figure out a proper limit. If you're patient, one or two requests a second is probably fine; beyond that, you run the risk of making somebody mad. Anywhere's walking distance if you have the time; in the same manner, if you're in no rush to retrieve the data, impart that to your spider. Refer to [Hack #16] for more information on minimizing the amount of bandwidth you consume .

Take just enough, and don't take too often

Overscraping is, simply, taking more than you need and thus taking more of the site's bandwidth than necessary. If you need a page, take a page. Don't take the entire directory or (heaven help you) the entire site.

This also applies to time. Don't scrape the site any more often than is necessary. If your program will run with data scraped from the site once a day, stick with that. I wouldn't go more than once an hour , unless I absolutely had to (and had permission from the site owner).