Hack 13 Authentication, Cookies, and Proxies

Hack 13 Authentication, Cookies, and Proxies

figs/moderate.gif figs/hack13.gif

Access restricted resources programmatically by supplying proper authentication tokens, cookies, or proxy server information .

Accessing public resources assumes that you have the correct privileges to do so. The vast majority of sites you encounter every day on the Web are usually wide open to any visitor anxious to satisfy his browsing desires. Some sites, however, require password authentication before you're allowed in. Still others will give you a special file called a cookie , without which you'll not get any further. And sometimes, your ISP or place of work may require that you use a proxy server , a sort of handholding middleman that preprocesses everything you view. All three of these techniques will break any LWP::UserAgent [Hack #10] code we've previously written.

Authentication

Many web sites restrict access to documents by using HTTP Authentication , a mechanism whereby the web server sends the browser an HTTP code that says "You are entering a protected realm, accessible only by rerequesting it along with some special authorization headers." Your typical web browser deals with this request by presenting you with a username/password prompt, as shown in Figure 2-1, passing whatever you enter back to the web server as the appropriate authentication headers.

Figure 2-1. A typical browser authentication prompt
figs/sphk_0201.gif

For example, the Unicode.org administrators stop email- harvesting bots from spidering the contents of their mailing list archives by protecting them with HTTP Authentication and then publicly stating the username and password (at http://www.unicode.org/mail-arch/)namely, username "unicode-ml" and password "unicode".

Consider this URL, part of the protected area of the web site:

 http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html 

If you access this URL with a browser, you'll be prompted to "Enter username and password for `Unicode-MailList-Archives' at server `www.unicode.org'". Attempting to access this URL via LWP without providing the proper authentication will not work. Let's give it a whirl:

 #!/usr/bin/perl -w use strict; use LWP 5.64; my $browser = LWP::UserAgent->new; my $url = 'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html'; my $response = $browser->get($url); die "Error: ", $response->header('WWW-Authenticate')       'Error accessing', "\n ", $response->status_line,     "\n at $url\n Aborting" unless $response->is_success; 

As expected, we didn't get very far:

 %  perl get_protected_resource.pl  Error: Basic realm="Unicode-MailList-Archives"    401 Authorization Required    at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html 

You've not told your $browser object about the username and password needed for that realm ( Unicode-MailList-Archives ) at that host ( www.unicode.org ). The fix is to provide the proper credentials to the $browser object using the credentials method:

 $browser->credentials(     '   servername   :   portnumber   ',     '   realm-name   ',     '   username   ' => '   password   ' ); 

In most cases, the port number is 80, the default TCP/IP port for HTTP, and you usually call the credentials method before you make any requests . So, getting access to the Unicode mailing list archives looks like this:

 $browser->credentials(     'www.unicode.org:80',     'Unicode-MailList-Archives',     'unicode-ml' => 'unicode' ); 

Enabling Cookies

You know those cards offered by your local carwash, pizza joint, or hairdresserthe ones you pick up the first time you go in and they stamp each time you visit? While the card itself does not identify you in any manner, it does keep track of how many times you've visited and, if you're lucky, when you're owed a free carwash, slice, or haircut for 10, 20, or however many stamps. Now imagine you have one of those cards for each of the popular sites you visit on the Web. That's the idea behind so-called cookies a cookie jar filled with magic cookies slathered with identifiers and information in icing (yes, it is a slightly silly analogy).

Cookies are wodges of text issued by sites to your browser. Your browser keeps track of these and offers them up to the appropriate site upon your next visit. Some cookies simply keep track of your current session and aren't maintained for very long. Others keep track of your preferences from visit to visit. Still others hold identifying information and authentication tokens; you'll usually find these belonging to e-commerce sites like Amazon.com, eBay, E*Trade, your online banking system, library, and so forth.

The magic in these magic cookies is that all this happens behind the scenes; your browser manages the acquisition, offering, and maintenance of all the cookies in your jar. It is careful to pass only the appropriate cookie to the right site, it watches the expiration date and throws out old cookies, and it generally allows for an all-but-seamless experience for you.

Most browsers actually allow you to take a gander at the contents of your cookie jar (Safari on Mac OS X: Safari Preferences . . . Security Show Cookies; Mozilla on any platform: Tools Cookie Manager Manage Stored Cookies; Internet Explorer on Windows: depends on OS/browser version, but generally, a folder called Temporary Internet Files or Cookies in your home directory). You might even be able to delete cookies, alter your cookie preferences so that you're warned of any incoming cookies, or indeed refuse cookies altogetherthe latter two robbing you of some of the seamless experience I was just talking about.

A default LWP::UserAgent object acts like a browser with its cookie support turned off. There are various ways of turning cookie support on, by setting the LWP::UserAgent object's cookie_jar attribute. A cookie jar is an object representing a little database of all the HTTP cookies that a browser can know about. It can correspond to a file on disk (the way Netscape or Mozilla uses its cookies.txt file), or it can be just an in-memory object that starts out empty and whose collection of cookies will disappear once the program is finished running.

To use an in-memory empty cookie jar, set the cookie_jar attribute, like so:

 $browser->cookie_jar({}); 

To give the cookie jar a copy that will be read from a file on disk with any modifications being saved back to the file when the program is finished running, set the cookie_jar attribute like this:

 use HTTP::Cookies; $browser->cookie_jar( HTTP::Cookies->new(     'file' => '/some/where/cookies.lwp',  # where to read/write cookies     'autosave' => 1,                      # save it to disk when done )); 

That file will be in an LWP -specific format. If you want to access the cookies in your Netscape cookies file, you can use HTTP::Cookies::Netscape :

 use HTTP::Cookies; # yes, loads HTTP::Cookies::Netscape too $browser->cookie_jar( HTTP::Cookies::Netscape->new(     'file' => 'c:/Program Files/Netscape/Users/DIR-NAME-HERE/cookies.txt', )); 

You could add an ' autosave' => 1 line as we did earlier, but it's uncertain whether Netscape will respect or simply discard some of the cookies you write programmatically back to disk.

Using Proxies

In some cases, you have to use proxies to access sites or use certain protocols. This is most commonly the case when your LWP program is running (or could be running) on a machine that is behind a firewall or in a business environment. More and more businesses are requiring their employees to use proxies, either to ensure that they're not playing online games during working hours, or to help prevent the accidental display of pornographic or otherwise offensive material.

When a proxy server is installed on a network, a special environment variable that points to its location can be defined. This environment variable, HTTP_PROXY , can be automatically understood and processed by programs that know of its existence. To ensure that LWP can utilize this information, just call the env_proxy on a User -Agent object before you go making any requests on it:

 use LWP::UserAgent; my $browser = LWP::UserAgent->new; $browser->env_proxy; 

For more information on proxy parameters, see the LWP::UserAgent documentation (type perldoc LWP::UserAgent on the command line), specifically the proxy , env_proxy , and no_proxy methods .

Sean Burke



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net