Hack 1 A Crash Course in Spidering and Scraping

figs/beginner.gif figs/hack01.gif

A few of the whys and wherefores of spidering and scraping .

There is a wide and ever-increasing variety of computer programs gathering and sifting information, aggregating resources, and comparing data. Humans are just one part of a much larger and automated equation. But despite the variety of programs out there, they all have some basic characteristics in common.

Spiders are programs that traverse the Web, gathering information. If you've ever taken a gander at your own web site's logs, you'll see them peppered with User-Agent names like Googlebot , Scooter , and MSNbot . These are all spidersor bots , as some prefer to call them.

Throughout this book, you'll hear us referring to spiders and scrapers. What's the difference? Broadly speaking, they're both programs that go out on the Internet and grab things. For the purposes of this book, however, it's probably best for you to think of spiders as programs that grab entire pages, files, or sets of either, while scrapers grab very specific bits of information within these files. For example, one of the spiders [Hack #44] in this book grabs entire collections of Yahoo! Group messages to turn into mailbox files for use by your email application, while one of the scrapers [Hack #76] grabs train schedule information. Spiders follow links, gathering up content, while scrapers pull data from web pages. Spiders and scrapers usually work in concert; you might have a program that uses a spider to follow links but then uses a scraper to gather particular information.

Why Spider?

When learning about a technology or way of using technology, it's always good to ask the big question: why? Why bother to spider? Why take the time to write a spider, make sure it works as expected, get permission from the appropriate site's owner to use it, make it available to others, and spend time maintaining it? Trust us; once you've started using spiders, you'll find no end to the ways and places they can be used to make your online life easier:


Gain automated access to resources

Sure, you can visit every site you want to keep up with in your web browser every day, but wouldn't it be easier to have a program do it for you, passing on only content that should be of interest to you? Having a spider bring you the results of a favorite Google search can save you a lot of time, energy, and repetitive effort. The more you automate, the more time you can spend having fun with and making use of the data.


Gather information and present it in an alternate format

Gather marketing research in the form of search engine results and import them into Microsoft Excel for use in presentations or tracking over time [Hack #93]. Grab a copy of your favorite Yahoo! Groups archive in a form your mail program can read just like the contents of any other mailbox [Hack #43]. Keep up with the latest on your favorite sites without actually having to pay them a visit one after another [Hack #81]. Once you have raw data at your disposal, it can be repurposed, repackaged, and reformatted to your heart's content.


Aggregate otherwise disparate data sources

No web site is an island, but you wouldn't know it, given the difficulty of manually integrating data across various sites. Spidering automates this drudgery, providing a 15,000- foot view of otherwise disparate data. Watch Google results change over time [Hack #93] or combine syndicated content [Hack #69] from multiple weblogs into one RSS feed. Spiders can be trained to aggregate data, both across sources and over time.


Combine the functionalities of sites

There might be a search engine you love, but which doesn't do everything you want. Another fills in some of those gaps, but doesn't fill the need on its own. A spider can bridge the gap between two such resources [Hack #48], querying one and providing that information to another.


Find and gather specific kinds of information

Perhaps what you seek needs to be searched for first. A spider can run web queries on your behalf , filling out forms and sifting through the results [Hack #51].


Perform regular webmaster functions

Let a spider take care of the drudgery of daily webmastering. Have it check your HTML to be sure it is standards-compliant and tidy (http://tidy. sourceforge .net/), that your links aren't broken, or that you're not linking to any prurient content.

For more detail on spiders, robots, crawlers, and scrapers, visit the Web Robot FAQ at http://www.robotstxt.org/wc/faq.html.




Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net