Hack 6 Keeping Your Spider Out of Sticky Situations

figs/beginner.gif figs/hack06.gif

You see tasty data here, there, and everywhere. Before you dive in, check the site's acceptable use policies .

Because the point of Spidering Hacks is to get to data that APIs can't (or haven't been created to) reach, sometimes you might end up in a legal gray area. Here's what you can do to help make sure you don't get anywhere near a " cease and desist" letter or the threat of a lawsuit.

Perhaps, one fine day, you visit a site and find some data you'd simply love to get your hands on. Before you start hacking, it behooves you to spend a little time looking around for an Acceptable Use Policy (AUP) or Terms of Service (TOS)occasionally you'll see a Terms of Use (TOU)and familiarize yourself with what you can and can't do with the site itself and its underlying data. Usually, you'll find a link at the bottom of the home page, often along with the site's copyright information. Yahoo! has a Terms of Service link as almost the last entry on its front page, while Google's is at the bottom of their About page. If you can't find it on the front page, look at the corporate information or any About sections. In some cases, sites (mostly smaller ones) won't have them, so you should consider contacting the webmasterjust about always webmaster@sitename.com and ask.

So, you've found the AUP or TOS. Just what is it you're supposed to be looking for? What you're after is anything that has to do with spidering or scraping data. In the case of eBay, their position is made clear with this excerpt from their User Agreement:

You agree that you will not use any robot, spider, scraper or other automated means to access the Site for any purpose without our express written permission.

Clear enough, isn't it? But sometimes it won't be this obvious. Some usage agreements don't make any reference whatsoever to spidering or scraping. In such cases, look for a contact address for the site itself or technical issues relating to its operation, and ask.

Bad Spider, No Biscuit!

Even with adherence to the terms of service and usage agreements you find on its pages, a web site might simply have a problem with how you're using its data. There are several ways in which a spider might be obeying the letter of a service agreement yet still doing something unacceptable from the perspective of the owners of the content. For example, a site might say that it doesn't want its content republished on a web site. Then, a spider comes along and turns its information into an RSS feed. An RSS feed is not, technically speaking, a web page. But the site owners might still find this use unacceptable. There is nothing stopping a disgruntled site from revising its TOS to deny a spider's access, and then sending you a "cease and desist" letter.

But let's go beyond that for a minute. Of course we don't want you to violate Terms of Service, dance with lawyers , and so on. The Terms of Service are there for a reason. Usually, they're the parameters under which a site needs to operate in order to stay in business. Whatever your spider does, it needs to do it in the spirit of keeping the site from which it draws information healthy . If you write a spider that sucks away all information from advertiser-supported sites, and they can't sell any more advertising, what happens? The site dies. You lose the site, and your program doesn't work any more.

Though it's rarely done in conjunction with spidering, framing data is a long-established legal no-no. Basically, framing data means that you're putting the content of someone else's site under a frame of your own design (in effect, branding another site's data with your own elements). The frame usually contains ads that are paying you for the privilege. Spidering another site's content and reappropriating it into your own framed pages is bad. Don't do it.

Violating Copyright

I shouldn't even have to say this, but reiteration is a must. If you're spidering for the purpose of using someone else's intellectual property on your web site, you're violating copyright law. I don't care if your spider is scrupulously obeying a site's Terms of Service and is the best-behaved spider in the world, it's still doing something illegal. In this case, you can't fix the spider; it's not the code that's at fault. Instead, you'd better fix the intent of the script you wrote. For more information about copyright and intellectual property on the Web, check out Lawrence Lessig's weblog at http://www.lessig.org/blog/ (Professor Lessig is a Professor of Law at Stanford Law School); the Electronic Frontier Foundation (http://www.eff.org); and Copyfight, the Politics of IP (http://www.copyfight.org/).

Aggregating Data

Aggregating data means gathering data from several different places and putting it together in one place. Think of a site that gathers different airline ticket prices in one place, or a site that compares prices from several different online bookstores. These are online aggregators, which represent a gray area in Internet etiquette. Some companies resent their data being aggregated and compared to the data on other sites (like comparison price shopping). Other companies don't care. Some companies actually have agreements with certain sites to have their information aggregated! You won't often find this spelled out in a site's Terms of Service, so when in doubt, ask.

Competitive Intelligence

Some sites complain because their competitors access and spider their datadata that's publicly available to any browserand use it in their competitive activities. You might agree with them and you might not, but the fact is that such scraping has been the object of legal action in the past. Bidder's Edge was sued by eBay (http://pub.bna.com/lw/21200.htm) for such a spider.

Possible Consequences of Misbehaving Spiders

What's going to happen if you write a misbehaving spider and unleash it on the world? There are several possibilities. Often, sites will simply block your IP address. In the past, Google has blocked groups of IP addresses in an attempt to keep a single automated process from violating its TOS. Otherwise , the first course of action is usually a "cease and desist" letter, telling you to knock it off. From there, the conflict could escalate into a lawsuit, depending on your response.

Besides the damages that are assessed against people who lose lawsuits, some of the laws governing content and other web issuesfor example, copyright lawscarry criminal penalities, which means fines and imprisonment in really extreme situations.

Writing a misbehaving spider is rarely going to have the police kicking down your door, unless you write something particularly egregious, like something that floods a web site with data or otherwise interferes with that site's ability to operate (referred to as denial of service ). But considering lawyer's fees, the time it'll take out of your life, and the monetary penalties that might be imposed on you, a lawsuit is bad enough, and it's a good enough reason to make sure that your spiders are behaving and your intent is fair.

Tracking Legal Issues

To keep an eye on ongoing legal and scraping issues, try the Blawg Search (http://blawgs.detod.com/) search engine, which indexes only weblogs that cover legal issues and events. Try a search for spider , scraper , or spider lawsuit . If you're really interested, note that Blawg Search's results are also available as an RSS feed for use in popular syndicated news aggregators. You could use one of the hacks in this book and start your own uber-RSS feed for intellectual property concerns.

Other resources for keeping on top of brewing and current legal issues include: Slashdot (http://slashdot.org/search.pl?topic=123), popular geek hangout; the Electronic Freedom Foundation (http://www.eff.org), keeping tabs on digital rights; and the Berkman Center for Internet & Society at Harvard Law School (http://cyber.law. harvard .edu/home/), a research program studying cyberspace and its implications.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net