Hack 5 Preempting Discovery

figs/beginner.gif figs/hack05.gif

Rather than await discovery, introduce yourself!

No matter how gentle and polite your spider is, sooner or later you're going to be noticed. Some webmaster's going to see what your spider is up to, and they're going to want some answers. Rather than wait for that to happen, why not take the initiative and make the first contact yourself? Let's look at the ways you can preempt discovery, make the arguments for your spider, and announce it to the world.

Making Contact

If you've written a great spider, why not tell the site about it? For a small site, this is relatively easy and painless: just look for the Feedback, About, or Contact links. For larger sites, though, figuring out whom to contact is more difficult. Try the technical contacts first, and then web feedback contacts. I've found that public relations contacts are usually best to reach last. Although tempting, because it's usually easy to find their addresses, PR folk like to concentrate on dealing with press people (which you're probably not) and they probably won't know enough programming to understand your request. (PR people, this isn't meant pejoratively. We still love you. Keep helping us promote O'Reilly books. Kiss, kiss.)

If you absolutely can't find anyone to reach out to, try these three steps:

  1. Many sites, especially technical ones, have employees with weblogs. See if you can find them via a Google search. For example, if you're looking for Yahoo! employees , the search " work for yahoo" (weblog blog) does nicely . Sometimes, you can contact these people and let them know what you're doing, and they can either pass your email to someone who can approve it, or give you some other feedback.

  2. 99.9% of the time, an email to webmaster@ will work (e.g., webmaster@example.com ). But it's not always guaranteed that anyone reads this email more than once a month, if at all.

  3. If you're absolutely desperate, you can't find email addresses or contact information anywhere on the site, and your emails to webmaster@ have bounced, try looking up the domain registration at http://www.whois.org or a similar domain lookup site. Most of the time, you'll find a contact email at this address, but again, there's no guarantee that anyone checks it, or even that it's still active. And remember, this works only for top-level domain information. In other words, you might be able to get the contact information for www.example.com but not for www.example.com/resource/ .

Making the Arguments for Your Spider

Now that you have a contact address, give a line of reasoning for your spider. If you can clearly describe what your spider's all about, great. But it may get to the point where you have to code up an example to show to the webmaster. If the person you're talking to isn't Perl-savvy, consider making a client-side version of your script with Perl2Exe (http://www.indigostar.com/perl2exe.htm) or PAR (http://search.cpan.org/author/AUTRIJUS/PAR) and sending it to her to test drive.

Offer to show her the code. Explain what it does. Give samples of the output. If she really likes it, offer to let her distribute it from her site! Remember, all the average, nonprogramming webmaster is going to hear is "Hi! I wrote this Program and it Does Stuff to your site! Mind if I use it?" Understand if she wants a complete explanation and a little reassurance.

Making Your Spider Easy to Find and Learn About

Another good way to make sure that someone knows about your spider is to include contact information in the spider's User-Agent [Hack #11]. Contact information can be an email or a web address. Whatever it is, be sure to monitor the address and make sure the web site has adequate information.

Considering Legal Issues

Despite making contact, getting permission, and putting plenty of information about your spider on the Web, you may still have questions. Is your spider illegal? Are you going to get in trouble for using it?

There are many open issues with respect to the laws relating to the Web, and cases, experts, and scholarsnot to mention members of the Web communitydisagree heartily on most of them. Getting permission and operating within its limits probably reduces your risk, particularly if the site's a small one (that is, run by a person or two instead of a corporation). If you don't have permission and the site's terms of service aren't clear, risk is greater. That's probably also true if you've not asked permission and you're spidering a site that makes an API available and has very overt terms of service (like Google).

Legal issues on the Internet are constantly evolving; the medium is just too new to make sweeping statements about fair use and what's going to be okay and what's not. It's not just how your spider does its work, but also what you do with what you collect. In fact, we need to warn you that just because a hack is in the book doesn't mean that we can promise that it won't create risks or that no webmaster will ever consider the hack a violation of the relevant terms of service or some other legal rights.

Use your common sense (don't suck everything off a web site, put it on yours, and think you're okay), keep copyright laws in mind (don't take entire wire service stories and stick them on your site), and ask permission (the worst thing they can say is no, right?). If you're really worried, your best results will come from talking to an experienced lawyer.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net