Hack 12 Posting Form Data with LWP

figs/beginner.gif figs/hack12.gif

Automate form submission, whether username and password authentication, supplying your Zip Code for location-based services, or simply filling out a number of customizable fields for search engines .

Say you search Google for three blind mice . Your result URL will vary depending on the preferences you've set, but it will look something like this:

 http://www.google.com/search?num=100&hl=en&q=%22three+blind+mice%22 

The query itself turns into an ungodly mess, &q=%22three+blind+mice%22 , but why? Whenever you send data through a form submission, that data has to be encoded so that it can safely arrive at its destination, the server, intact. Characters like spaces and quotesin essence, anything not alphanumericmust be turned into their encoded equivalents, like + and %22 . LWP will automatically handle most of this encoding (and decoding) for you, but you can request it at will with URI::Escape 's uri_escape and uri_unescape functions.

Let's break down what those other bits in the URL mean.

num=100 refers to the number of search results to a page, 100 in this case. Google accepts any number from 10 to 100 . Altering the value of num in the URL and reloading the page is a nice shortcut for altering the preferred size of your result set without having to meander over to the Advanced Search (http://www.google.com/advanced_search?hl=en) and rerunning your query.

h1=en means that the language interfacethe language in which you use Google, reflected in the home page, messages, and buttonsis in English. Google's Language Tools (http://www.google.com/language_tools?hl=en) provide a list of language choices.

The three variables q , num , and h1 and their associated values represent a GET form request; you can always tell when you have one by the URL in your browser's address bar, where you'll see the URL, then a question mark ( ? ), followed by key/value pairs separated by an ampersand ( &) . To run the same search from within LWP , you use the URI module to assemble a URL with embedded key/value pairs, which is, in turn , passed to an existing LWP $browser object. Here's a simple example:

 #!/usr/bin/perl -w use strict; use LWP 5.64; use URI; my $browser = LWP::UserAgent->new; my $url = URI->new( 'http://www.google.com/search' ); # the pairs: $url->query_form(     'h1'    => 'en',     'num' => '100',     'q' => 'three blind mice', ); my $response = $browser->get($url); 

Many HTML forms, however, send data to their server using an HTTP POST request, which is not viewable in the resulting URL. The only way to discern which variables and values will be included in the request is to consult the source code of the form page itself. Here's a basic HTML form example using POST as its submission type:

 <form method="POST" action="/process"> <input type="hidden" name="formkey1" value="value1"> <input type="hidden" name="formkey2" value="value2"> <input type="submit" name="go" value="Go!"> </form> 

To simulate a POST from within LWP , call the post subroutine, passing it key/value pairs. Simulating a POST from the previous form looks like this:

 $response = $browser->post( $url,     [      formkey1 => value1,       formkey2 => value2,       go => "Go!"      ...     ], ); 

Or, if you need to send HTTP headers as well, simply append them like this:

 $response = $browser->post( $url,     [      formkey1 => value1,       formkey2 => value2,       go => "Go!"      ...     ],     headerkey1 => value1,      headerkey2 => value2,  ); 

The following program makes a search request to AltaVista (by sending some form data via an HTTP POST request) and extracts from the HTML the report of the number of matches:

 #!/usr/bin/perl -w use strict; use LWP 5.64; my $word = shift; $word or die "Usage: perl altavista_post.pl [keyword]\n"; my $browser = LWP::UserAgent->new; my $url = 'http://www.altavista.com/web/results'; my $response = $browser->post( $url,     [ 'q' => $word,  # the Altavista query string       'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX',     ] ); die "$url error: ", $response->status_line unless $response->is_success; die "Weird content type at $url -- ", $response->content_type    unless $response->content_type eq 'text/html'; if ( $response->content =~ m{ found ([0-9,]+) results} ) { print "$word:  [RETURN]  \n"; } else { print "Couldn't find the match-string in the response\n"; } 

Save this script as altavista_post.pl and invoke it on the command line, passing it a keyword (or quoted set of keywords) you wish to search AltaVista for:

 %  perl altavista_post.pl tarragon  tarragon: 80,349 

Being able to program form submissions becomes especially handy when you're looking to automate "Keep trying!" contest submissions that require users to manually enter their age, state, Zip Code, or similar bit of ephemera, only to receive "Sorry, try again" repetitively and without remorse.

Sean Burke and Tara Calishain



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net