Perl Modules

You may have been tripped up by the word modules in the previous paragraph. Don't fret, dear reader. A module is simply an encapsulated bit of Perl code, written by someone else, that you employ in your own application. By leaving the implementation details and much of the dirty work to the module author, using a module rather then writing all the code yourself makes a complicated task far, far easier. When we say we're going to install a module, we really mean we're going to get a copy from CPAN (http://www.cpan.org), test to make sure it'll work in our environment, ensure it doesn't require other modules that we don't yet have, install it, and then prepare it for general use within our own scripts.

Sounds pretty complicated, right? Repeat ad infinitum: don't fret, dear reader, as CPAN has you covered. One of Perl's greatest accomplishments, CPAN is a large and well-categorized selection of modules created and contributed by hundreds of authors. Mirrored worldwide, there's a good chance your "I wish I had a . . . " wonderings have been placated, bug- tested , and packaged for your use.

Since CPAN is such a powerful accoutrement to the Perl language, the task of installing a module and ensuring its capabilities has been made far easier than the mumbo jumbo I uttered previously. We cover exactly how to install modules in our first hack of this chapter [Hack #8].

As you browse through this book, you'll see we use a number of noncore moduleswhere noncore is defined as "not already part of your Perl installation." Following are a few of the more popular ones you'll be using in your day-to-day scraping. Again, worry not if you don't understand some of this stuff; we'll cover it in time:


LWP

A package of modules for web access, also known as libwww-perl


LWP::Simple

Simple functions for getting web documents (http://search.cpan.org/author/GAAS/libwww-perl/lib/LWP/Simple.pm)


LWP::UserAgent

More powerful functions for implementing a spider that looks and acts more like a specialized web browser (http://search.cpan.org/author/GAAS/libwww-perl/lib/LWP/UserAgent.pm)


HTTP::Response

A programmatic encapsulation of the response to an HTTP request (http://search.cpan.org/author/GAAS/libwww-perl/lib/HTTP/Response.pm)


HTTP::Message and HTTP::Headers

Classes that provide more methods to HTTP::Response , such as convenient access to setting and getting HTTP headers (http://search.cpan.org/author/GAAS/libwww-perl/lib/HTTP/Message.pm, http://search.cpan.org/author/GAAS/libwww-perl/lib/HTTP/Headers.pm)


URI

Methods to operate on a web address, such as getting the base of a URL, turning a relative URL into an absolute, and returning the individual path segments (http://search.cpan.org/author/GAAS/URI/URI.pm)


URI::Escape

Functions for URL-escaping and URL-unescaping strings, such as turning " this & that " to " this%20%26%20that " and vice versa (http://search.cpan.org/author/GAAS/URI/URI/Escape.pm)


HTML::Entities

Functions for HTML-escaping and HTML-unescaping strings, such as turning " C. & E. Bront « " to " C. & E. Brontë " and vice versa (http://search.cpan.org/author/GAAS/HTML-Parser/lib/HTML/Entities.pm)


HTML::TokeParser and HTML::TreeBuilder

Classes for parsing HTML (http://search.cpan.org/author/GAAS/HTML-Parser/, http://search.cpan.org/author/SBURKE/HTML-Tree/)


WWW::Mechanize

Automates interaction with sites, including filling out forms, traversing links, sending referrers, accessing a history of URLs visited, and so on (http://search.cpan.org/author/PETDANCE/WWW-Mechanize/)

For more information on these modules once they are installed, you can issue a man Module:: Name command, where Module::Name is the name of your desire . Online documentation is also available at CPAN.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net