Hack 29 Running Multiple Utilities at Once

figs/moderate.gif figs/hack29.gif

You've got scrapers, spiders, and robots aplenty, all to run daily according to a particular schedule. Should you set up a half- dozen cron jobs, or combine them into one script?

Finding and collecting information from many sources may involve running several programs, possibly in conjunction with one another. Combining multiple utilities into one script has a few benefits: if you're running, heaven forbid , a dozen different wget [Hack #26] commands every night, you can throw them in a single shell script and worry about only 1 crontab entry [Hack #90] instead of 12. Unlike a dozen separate entries, a single shell script also allows you to check how each wget performed (either by checking the return code [Hack #34] or by collating and reporting the final results). Likewise, shell scripts can be used to hold complicated and lengthy pipes [Hack #28], saving your weary fingers from typing them out each time.

This hack explores doing this kind of combinatorial spidering.

Shell Scripts

A shell script is a series of statements that you would otherwise run on the command line, one after another. Perhaps you'd first run a simple Perl script to spit out the Amazon.com PageRank of your book and, if it had lowered , you'd use curl to upload a sad face to your web site. If it had risen, you'd upload a happy face, and if it was in the Top 100, you'd automatically email a prepared resignation letter to your day job. A shell script would make these decisions for you, first by checking the output of the Perl script, and then responding appropriately.

Here's a simple shell script, which we'll dissect in a moment. Its purpose is menial: depending on the options passed to it on the command line, we'll use lynx to automatically display either an RFC (an Internet "Request for Comments" from http://www.ietf.org/rfc/) or a quote from the QDB (a database of humorous IRC quotes at http://bash.org/).

 #!/bin/sh usage_and_exit(  ) {     echo "usage: get (rfcqdb) term" 2>&1     echo "where term is an rfc number, or a quote number from bash.org" 2>&1     exit 0 } if test -z ""; then     usage_and_exit fi url="" case "" in     rfc)          url="http://www.ietf.org/rfc/rfc.txt" ;;     qdb)         url="http://bash.org/?" ;; esac      if test -z "$url"; then     usage_and_exit fi lynx -dump $url 

The first line tells the operating system that the interpreter for the rest of this program is /bin/sh . Other interpreters run programs written in other languagesfor example, Perl ( /usr/bin/perl ) and Python ( /usr/bin/python ).

Next, we define a function called usage_and_exit , which prints our usage message when we don't get the correct number of arguments. Once the message is printed, our shell script will exit .

We then check to see if $1 is empty. $1 is the first argument to a script, $2 is the second, $3 is the third, and so forth, up to $9 . Almost all checking of variables in a script will be done with the test binary that comes with your system. However, in this script we needed -z only for an "empty string" test.

Note the syntax of the if statement. A full if looks like this:

 if test -z $some_variable; then     do_something elif test $some_variable = $a_different_variable; then     do_something_else else     do_another_thing fi 

Next, we check what the contents of $1 actually are. If they match the sites we know how to get quotes from, then we can set the url based on the value of the second argument, $2 .

If we still haven't decided on a URL by the time we exit the case statement, then we should show the usage_and_exit . If we have decided on a URL, then we have a whole command line to get the results in lynx , so we do so.

That's our complete simple shell script, but it's hardly a large example. We need to know how to combine programs and filter data we want [Hack #28]. Far more information about shell programming, including complete scripts, can be found at the Linux Documentation Project's "Advanced Bash-Scripting Guide" (http://tldp.org/LDP/abs/html/).

Perl Equivalence

We can, of course, emulate a shell script using Perl. (Breathe easy, Windows users!) All a shell script is, really, is a series of commands that the shell runs for you. We can run external commands in Perl too, and gather their output as we go. Here's an example of the previous shell script:

 #!/usr/bin/perl -w use strict; my ($db, $num) = (shift, shift); die "You must specify 'rfc' or 'qdb'!\n"  [RETURN]  unless ($db eq "rfc"  $db eq "qdb"); die "You must specify a numerical search!\n" unless $num =~ /\d+/; my $url; if ($db eq "rfc") { $url = "http://www.ietf.org/rfc/rfc$num.txt"; } elsif ($db eq "qdb") { $url = "http://bash.org/?$num"; }  system("lynx -dump $url");  

If we want to capture lynx 's output and operate on it within Perl, we can do this:

 open INPUT, "  lynx -dump $url  " or die $!; while (<INPUT>) {     # process input from lynx. } close INPUT; 

We can use a similar idea to write data to a program that wants input on STDIN :

 open OUTPUT,  " my_program  " or die $!; foreach my $line (@lines){     print OUTPUT $line; } close OUTPUT; 

With the knowledge of what a shell script is, the Perl equivalences , and everything else we know about Perl, we can make a powerful set of utilities by combining existing Perl modules with results from other prewritten programs.

Richard Rose



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net