Hack 100 Going Beyond the Book

figs/beginner.gif figs/hack100.gif

As much as we would have liked to deliver a 1,500-page tome, sooner or later you're going to have to think outside the confines of this book .

There are two very frustrating things about writing computer books. The first is knowing that it's unlikely anyone will revere your book as a classic and be studying your "Twenty Ways to Use Caps Lock" 50 years from now. The other frustrating thing is the fact that you can't cover as much as you'd like in the confines of one book. So, because we don't want to leave you without direction, here are some other places you can go to find scraping information, resources, and code. Onward!

Using Google and Other Search Engines

If you have a question about your code not working or figuring out what part of a site to scrape, you are of course welcome to visit the O'Reilly Hacks site at http://hacks.oreilly.com and participate in one of our discussions. But if that isn't enough, you can also search Google and see what you can see.

Say I want to find out if there's a Perl module for scraping Yahoo! Finance. This query gives you plenty of resources:

 perl module "Yahoo Finance" 

Or you might have a question about a regular expression that you can't get to work. In that case, using keywords that describe what you want can work, as in:

 "remove html" Perl ( regex  "regular expressions" ) 

Mailing Lists

A search engine is the first place I go when I have a scraping/spidering question but can't think of the right community for it. But sometimes, if you want to debate the points of using a particular solution, or if you want an in-depth discussion of a certain module, single web pages don't work as well. In that case, you'll want to check out more community-oriented solutions. We'll list a few here.

Perl4Lib (http://perl4lib.perl.org/) focuses on Perl as used by librarians and information professionals. Why focus on that here? Because online libraries have some of the most extensive and well-organized information collections available on the Web, and they're only going to add more. This page features several Perl modules that deal with information collection and unique identifying information.

Speaking of Perl.org, there's a huge list of available mailing lists at http://lists.perl.org. Other lists you should check out here include libwww (for discussing LWP ) perl-xml (using Perl with XML) and www-search (a discussion group for the WWW::Search modules).

Web Sites

If we were to talk about web sites that deal with Perl and offer Perl resources, we could be here for days. Let me just focus on three that offer different things.


CPAN (http://www.cpan.org)

CPAN (the Comprehensive Perl Archive Network) contains scripts, modules, source code, and binary distributions. There are also the aforementioned mailing lists and a pretty brief link list.


Perl Monks (http://www.perlmonks.org/index.pl)

Perl Monks is a great community. On the front page, you'll see several ongoing conversations about various aspects of Perl, a "CB" channel (online chat; it takes up a very small part of the righthand column), and a list of who's currently browsing the site. There are several other areas you might find useful, including snippets (bits of Perl code used for various tasks ), Questions and Answers (divided up into categories; there's a nice section on regular expressions here), and tutorials.


WebmasterWorld (http://www.webmasterworld.com)

This might seem like an odd thing to include on this list. After all, its primary audience is webmasters. But it's here for several reasons: it has a Perl forum, extensive discussions about User-Agents and spidering (and you can learn a lot from the perspective of a webmaster), and some tools that will give you information about web pages from a spider's point of view.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net