Hack 4 Registering Your Spider

figs/beginner.gif figs/hack04.gif

If you have a spider you're programming or planning on using even a minimal amount, you need to make sure it can be easily identified. The most low-key of spiders can be the subject of lots of attention .

On the Internet, any number of "arms races" are going on at the same time. You know: spammers versus antispammers, file sharers versus non-file sharers , and so on. A lower-key arms race rages between web spiders and webmasters who don't want the attention.

Who might not want to be spidered? Unfortunately, not all spiders are as benevolent as the Googlebot, Google's own indexer. Many spiders go around searching for email addresses to spam. Still others don't abide by the rules of gentle scraping and data access [Hack #2]. Therefore, spiders have gotten to the point where they're viewed with deep suspicion by experienced webmasters.

In fact, it's gotten to the point where, when in doubt, your spider might be blocked. With that in mind, it's important to name your spider wisely, register it with online databases, and make sure it has a reasonably high profile online.

By the way, you might think that your spider is minimal or low-key enough that nobody's going to notice it. That's probably not the case. In fact, sites like Webmaster World (http://www.webmasterworld.com) have entire forums devoted to identifying and discussing spiders. Don't think that your spider is going to get ignored just because you're not using a thousand online servers and spidering millions of pages a day.

Naming Your Spider

The first thing you want to do is name your spider. Choose a name that gives some kind of indication of what your spider's about and what it does. Examplebot isn't a good name. NewsImageScraper is better. If you're planning to do a lot of development, consider including a version number (such as NewsImageScraper/1.03).

If you're running several spiders, you might want to consider giving your spider a common name. For example, if Kevin runs different spiders, he might consider giving them a naming convention starting with disobeycom: disobeycomNewsImageScraper, disobeycomCamSpider, disobeycomRSSfeeds, and so on. If you establish your spiders as polite and well behaved, a webmaster who sees a spider named similarly to yours might give it the benefit of the doubt. On the other hand, if you program rude, bandwidth-sucking spiders, giving them similar names makes it easier for webmasters to ban 'em all (which you deserve).

Considering what you're going to name your spider might give you what, at first glance, looks like a clever idea: why not just name your spider after one that already exists? After all, most corners of the web make their resources available to the Googlebot; why not just name your spider Googlebot?

As we noted earlier, this is a bad idea for many reasons, including the fact that the owner of the spider you imitate is likely to ice your spider. There are web sites, like http://www.iplists.com , devoted to tracking IP addresses of legitimate spiders. (For example, there's a whole list associated with the legitimate Googlebot spider.) And second, though there isn't much legal precedent addressing fraudulent spiders, Google has already established that they don't take kindly to anyone misappropriating, or even just using without permission, the Google name.

A Web Page About Your Spider

Once you've created a spider, you'll need to register it. But I also believe you should create a web page for it, so a curious and mindful webmaster has to do no more than a quick search to find information. The page should include:

  • Its name, as it would appear in the logs (via User -Agent )

  • A brief summary of what the spider was intended for and what it does (as well as a link to the resources it provides, if they're publicly available)

  • Contact information for the spider's programmer

  • Information on what webmasters can do to block the script or make their information more available and usable to the spider if it's preferred

Places to Register Your Spider

Even if you have a web page that describes your spider, be sure to register your spider at the online database spots. Why? Because webmasters might default to searching databases instead of doing web searches for spider names. Furthermore, webmasters might use databases as a basis for deciding which spiders they'll allow on their site. Here are some databases to get you started:


Web Robots Database (http://www.robotstxt.org/wc/active.html)

Viewable in several different formats. Adding your spider requires filling out a template and emailing to a submission address.


Search engine robots (http://www.jafsoft.com/searchengines/webbots.html)

User-Agent s and spiders organized into different categoriessearch engine robots, browsers, link checkers, and so onwith a list of "fakers" at the end, including some webmaster commentary .


List of User-Agents (http://www.psychedelix.com/ agents .html)

Divided over several pages and updated often. There's no clear submission process, though there's an email address at the bottom of each page.


The User Agent Database (http://www.icehousedesigns.com/useragents/)

Almost 300 agents listed, searchable in several different ways. This site provides an email address to which you can submit your spider.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net