Hack 30 Utilizing the Web Scraping Proxy

figs/moderate.gif figs/hack30.gif

With the use of a Perl proxy, you'll be able to browse web sites and have the LWP code written out automatically for you. Although not perfect, it can certainly be a time saver .

In this hack, we're going to use something called a proxy. In essence, a proxy is a piece of middleware that sits between you and your Internet connection. When you make a request for a web page, the request goes to the proxy, which downloads the relevant data, optionally preprocesses it, then returns it to the browser as expected.

If you have ever used browsers other than the latest version of Internet Explorer, you've probably had sites complain that your browser isn't supported. When writing code, things can get even more complicated with the inclusion of JavaScript, frames , cookies, and other evil tricks of the trade.

We could use command-line utilities like tcpdump to log traffic during a session with particular web sites, and manually copy headers, cookies, and referers to mock up our code as a legitimate browser, but with the Web Scraping Proxy, we can log the data automatically and get a firm basis for our own scripts.

The Web Scraping Proxy (http://www.research.att.com/~hpk/wsp/) is a way to automatically generate the Perl code necessary to emulate a real browser within your scripts. The LWP code it writes is similar to the various LWP hacks we've covered previously in this chapter.

After downloading wsp.pl from http://www.research.att.com/~hpk/wsp/ (be sure to get Version 2), we can set our browser to proxy all requests through wsp and get a record of all the transactions. How to set the proxy in our browser really depends on our browser and our OS, but here's a quick sample of starting wsp.pl and then requesting, with Mozilla 1.3, an Amazon.com search:

 % perl wsp.pl --- Proxy server running on disobey.local. port: 5364 # Request: http://www.amazon.com/exec/obidos/search-handle-form/  # Cookie: [long string here] # Cookie: 'session-id', '103-3421686-4199019' # Cookie: 'session-id-time', '1057737600' # Cookie: 'ubid-main', '430-5587053-7200154' # Cookie: 'x-main', '?r2eEc7UeYLH@lbaOUWV4wg0oCdCqHdO' # Referer: http://www.amazon.com/ $req = POST "http://www.amazon.com/exec/obidos/search-handle-form/", [         'url' => "index=blended",         'field-keywords' => "amazon hacks",         'Go.x' => "0",         'Go.y' => "0", ] ; 

As you can see, the last five or six lines spit out the beginnings of some code you'd be able to use in your own Perl scripts, like so:

 #!/usr/bin/perl use LWP; $ua = LWP::UserAgent->new; $req = $ua->post("http://www.amazon.com/exec/obidos/search-handle-form/", [         'url' => "index=blended",         'field-keywords' => "amazon hacks",         'Go.x' => "0",         'Go.y' => "0", ]); print $req->content; 

Using this translation, it's simple to make your Perl scripts emulate form submissions, without having to decode the form information yourself. In itself, the Web Scraping Proxy outputs only basic code for use with LWP , which is useful only for running the requests, but does not take HTTP headers into account. To emulate a browser fully, you could either copy each header laboriously, as if by using tcpdump , or use a bit of additional Perl to generate the ancillary code for you.

The Code

Save the following code to a file called translate.pl :

 #!/usr/bin/perl-w # # translate.pl - translates the output of wsp.pl -v. # # This code is free software; you can redistribute it and/or # modify it under the same terms as Perl itself. # use strict; my $save_url; my $count = 1; # Print the basics print "#!/usr/bin/perl\n"; print "use warnings;\n"; print "use strict;\n"; print "use LWP::UserAgent;\n"; print "my $ua = LWP::UserAgent->new;\n\n"; # read through wsp's output. while (<>) {     chomp; s/\x0D$//;     # add our HTTP request headers...     if (/^INPUT: ([a-zA-Z0-9\-\_]+): (.*)$/) {         print '$req'.$count.'->header(\''.."' => '".."');\n";     }     # what URL we're actually requesting...     if (/^Request for URL: (.*)$/) { $save_url=; }     # the HTTP 1.x request line (GET or POST).     if (/^FIRST LINE: ([A-Z]+) \S+ (.*)$/) {         print "\n\n### request number $count ###\n";         print "my $req$count = HTTP::Request->new( => '$save_url');\n";      }     # the POST information sent off, if any.     if (/^POST body: (.*)$/) { print "$req$count->content('');\n"; }     # and finish up our request.     if (/^ --- Done sending./) {         print "print $ua->request($req$count)->as_string;\n";         $count++; # move on to our next request. yeedawg.     } } 

Running the Hack

The first order of business is to set up your browser to use wsp.pl as a proxy. Methods vary from browser to browser, but in most cases you just set HTTP Proxy to localhost and Port to 5364 (see Figure 2-3).

Figure 2-3. Configuring your proxy in the Mozilla browser
figs/sphk_0203.gif

Then, in an empty directory, run the following command:

 %  perl wsp.pl -v  translate.pl  

The output is a stronger version of the previous Amazon.com search script:

 #!/usr/bin/perl -w use strict; use LWP::UserAgent; my $ua = LWP::UserAgent->new; ### request number 1 ### my $req1 = HTTP::Request->new(POST =>   'http://amazon.com/exec/obidos/search-handle-form/'); $req1->header('Accept' => '*/*'); $req1->header('Accept-Language' => 'en-us, ja;q=0.33, en;q=0.67'); $req1->header('Cookie' => '[long string here]'); $req1->header('Referer' => 'http://amazon.com/'); $req1->header('User-Agent' => 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X;  [RETURN]  en-us)'); $req1->header('Content-Type' => 'application/x-www-form-urlencoded'); $req1->header('Content-Length' => '61'); $req1->header('Connection' => 'close'); $req1->header('Host' => 'amazon.com'); $req1->content('url=index%3Dblended&field-keywords=amazon+hacks&Go.x=0&Go.  [RETURN]  y=0'); print $ua->request($req1)->as_string; 

If you've read through the transplate.pl code, you've seen that it simply takes the output from wsp.pl and prints LWP code with HTTP headers, as well as the warnings and strict pragma. The output, as shown in this section, is Perl code that executes the same requests as your browser did and with the same settings and headers. Note that in the full output of translate.pl , you'd get another 30 or so requests for each and every image/resource on the returned Amazon.com page. I've left those out for brevity's sake.

Hacking the Hack

There are many possible improvements for this hack. If you want to use the output script to simulate real user scenarios in a load test situation, then it makes sense to add a little timer to translate.pl and let it add sleep statements to simulate user inactivity. Another improvement would be to pick up any Set-Cookie headers and reuse them for the remainder of the session.

Mads Toftum



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net