Hack 28 Using Pipes to Chain Commands

figs/moderate.gif figs/hack28.gif

Chaining commands into a one-liner can make for powerful functionality .

If you want to do something only once, writing a full-blown script for it can often be overkill. Part of the design of Unix is to write small applications that do one specific thing. Combining these programs can produce spectacular and powerful results.

In this hack, we'll retrieve a list of files to save locally, then actually download themall with existing Unix utilities. The logic goes something like this: first, we use lynx to grab the list of files; then, we use grep to filter the lynx data into just the necessary information we desire ; finally, we use wget [Hack #26] to actually retrieve the final results.

Browsing for Links with lynx

lynx is usually thought of as a console-based web browser. However, as you will see here, it has some powerful command-line uses as well. For example, rather than run an interactive browser session, lynx can be told to send its output directly to STDOUT , like so:

 %  lynx -dump "http://google.com/"  

Not only does lynx nicely format the page in glorious plain text, but it also provides a list of links, bibliography style, in the "References" section. Let's give that another spin, this time with the -image_links option:

 %  lynx -dump -image_links "http://google.com/"  

lynx now includes the URLs of all images in that web page. Thanks to - dump and - image_links , we now have a list of all the links and images related to the URL at hand.

Before moving on, lynx has a few more options you might find helpful if you're having trouble accessing a particular page. If the page restricts access using basic authentication (where a dialog box asking for a username and password appears in a visual browser), use the -auth option to pass the appropriate username/password combination to lynx :

 %  lynx -dump -auth=   user_name   :   password   "http://google.com/"  

Some sites check for browser type, either to provide a different view depending on the capabilities of the browser, or simply to weed out robots. Thankfully, lynx is quite the chameleon and can pretend to be any browser you might need. Use the - useragent option to change the User -Agent variable passed by lynx to the web server. For example, pretending to be Internet Explorer 6 on Windows XP with .NET looks like this:

 %  lynx -dump -useragent="Mozilla/4.0 (compatible; MSIE 6.0;   [RETURN]    Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"   [RETURN]    "http://google.com/"  

Finally, what if the page you are looking for is the result of a form input? This will take some digging into the HTML source of the form (see [Hack #12] for some examples). Armed with the form's information, you can build a response by placing each input name with its corresponding value, separated by an ampersand, like so:

 input1=value1&input2=value2 

If the method is a GET request, the easiest thing to do is append the response value to the end of the destination URL, prefixed with a question mark:

 %  lynx -dump "http://google.com/search?q=unix+pipes"  

If the method is POST , things are a bit more complicated, as you'll need to use the -post_data option to pass your data to lynx via standard input. If the data is a short string, echo the input to lynx :

 %  echo "input=value"  lynx -dump -post_data   [RETURN]    "http://www.site.com/put_action"  

If the data string is long, save it to a file and cat it:

 %  cat data_file  lynx -dump -post_data   "http://www.site.com/put_action"  

By now, you should be able to get almost any page from any source. Next, we will move on to formatting this data to get only the links we want.

grepping for Patterns

The grep command is a powerful tool for piping data through and processing using regular expressions. Entire books have been written about regular expressions (e.g., Mastering Regular Expressions by Jeffrey E.F. Friedl, O'Reilly & Associates), so I will show only what we need to get the job done. First, we want to grab just the URLs, and no other prose . Since grep normally outputs the entire line that matches a search, we will want to use the -o option, which prints only the part of the line that matches what we are searching for. The following invocation searches each line passed on STDIN for http :, followed by the rest of the characters on that line:

 grep -o "http:.*" 

Now, let's hook all these commands together into a pipeline, passing a web page in one end and waiting at the other end for a list of URLs. The pipe ( ) will serve as glue, sticking the lynx and the grep commands together so that the output from lynx will be fed into grep , whose output will then be sent to STDOUT :

 %  lynx -dump "http://google.com/"  grep -o "http:.*"  

Let's take grep a little further, focusing only on links to images. But before we can do that, there is one small hurdle : in its basic form, grep understands only a small subset of the regular expression metacharacters, and we'll need more. There are two ways around this. You can pass the -E option to grep to force it to understand a more advanced version of regular expressions, or you can use the egrep command instead; these approaches are functionally identical. So, our new pipeline, with egrep instead of grep and a regular expression match for images, looks like this:

 %  lynx -dump -image_links "http://google.com/"   [RETURN]    egrep -o "http:.*(gifpngjpg)"  

Notice that we're not after all images, only gif , png , or jpg the three most popular image formats on the Web. This trick can also be used to grab only Java applets ( class ), Flash files ( swf ), or any other file format you can think of. Be sure to separate each extension with the symbol (this time meaning OR , not pipe ), and put them all between the parentheses.

So, now we have our list of files to download, but we still haven't actually retrieved anything. wget to the rescue . . .

wgetting the Files

wget is a popular command-line download tool, usually used to download a single file. However, it can be used for many more interesting purposes. We'll take advantage of its ability to read a list of links and download them one at a time. The first wget option we will use is -i , which tells wget to read a file and download the links contained within. Combined with the - symbol (short for standard input ), we can now download a list of dynamically determined links piped to wget :

 %  lynx -dump "http://google.com/"  egrep -o "http:.*"  wget -i -  

There is one little problem, though: all the files are downloaded to the same directory, so multiple files with the same name are suffixed with a number (e.g., imagename.jpg , imagename.jpg.1 , imagename.jpg.2 , and so forth). wget has a yet another command-line solution for us; the -x option creates a directory for each server from which it downloads, recreating the directory structure based on the URL for each file it saves:

 %  lynx -dump "http://google.com/"  egrep -o "http:.*"  wget -xi -  

So, instead of our initial wget , which would create numbered files like this:

 % ls -l -rw-r--r--  1 morbus  staff  10690 Aug 13 22:04 advanced_search?hl=en -rw-r--r--  1 morbus  staff   9688 Aug 13 22:04 dirhp?hl=en&tab=wd&ie=UTF-8 -rw-r--r--  1 morbus  staff   5262 Aug 13 22:04 grphp?hl=en&tab=wg&ie=UTF-8 -rw-r--r--  1 morbus  staff   3259 Aug 13 22:04 imghp?hl=en&tab=wi&ie=UTF-8 -rw-r--r--  1 morbus  staff   3515 Jul  1 18:26 index.html -rw-r--r--  1 morbus  staff   6393 Jul 16 21:52 index.html.1 -rw-r--r--  1 morbus  staff  13690 Jul 28 19:30 index.html.2 

our new command will place those duplicate index.html 's where they belong:

  drwxr-xr-x   3 morbus  staff    102 Aug 13 22:06 ads/  -rw-r--r--   1 morbus  staff  10690 Aug 13 22:06 advanced_search?hl=en -rw-r--r--   1 morbus  staff   9688 Aug 13 22:06 dirhp?hl=en&tab=wd&ie=UTF-8 -rw-r--r--   1 morbus  staff   5262 Aug 13 22:06 grphp?hl=en&tab=wg&ie=UTF-8 -rw-r--r--   1 morbus  staff   3259 Aug 13 22:06 imghp?hl=en&tab=wi&ie=UTF-8 -rw-r--r--   1 morbus  staff  29027 Aug 13 22:06 language_tools?hl=en  drwxr-xr-x   3 morbus  staff    102 Aug 13 22:06 options/  -rw-r--r--   1 morbus  staff  11682 Aug 13 22:06 preferences?hl=en  drwxr-xr-x   3 morbus  staff    102 Aug 13 22:06 services/  

Voila! With a pipeline of existing Unix command-line applications, you now have a one-liner for grabbing a series of images, Flash animations, or any other set of linked-to files you might want.

Pipes are a powerful tool for completing tasks without writing your own scripts. By combining utilities your system already has onboard, code you may have previously written, and other bits and bobs you might have come across on the Net, you can save yourself the effort of writing a brand-new script for one use onlyor, indeed, for aliasing to its own command to be run at any time.

Hacking the Hack

lynx can even deal in any cookies necessary to gain access to the content. Place cookies in a file using the format shared by Netscape and Mozilla or, alternatively, point lynx directly to your existing browser cookie file.

To find your existing Netscape or Mozilla cookie jar, run the following on your command-line : locate cookies.txt .


Here is a sample cookie:

 .google.com     TRUE    /       FALSE   2147368537      PREF     ID=268a71f72dc9b915:FF=4:LD=en:NR=10:TM=1057718034: LM=1059678022:S=JxfnsCODMtTm0Ven 

The first column is the domain of the cookie; this is usually the web site of the page in question. The second column is a flag, TRUE if all subdomains (e.g., news.google.com or groups.google.com ) should also be passed this cookie. The third column is a path, signifying the subsections of the site the cookie should be sent to. For example, a path of /shopping/ties indicates the cookie should be sent only to anything deeper than /shopping/ties in the URL http:// servername /shopping/ties/hideous , for instance. The fourth column should be TRUE if the cookie should be sent only over secure connections, FALSE otherwise . The fifth column is the expiration date of the cookie in Unix timestamp format; a good browser will usually delete old cookies from its cookie jar. The sixth column is the cookie's name , and the seventh and final column is the cookie's value ; both are set by the site to something meaningful. Using this format, you can create a cookie jar containing any cookies you pleasethey should, of course, be validand pass it to lynx using the -cookie_file option:

 %  lynx -dump -cookie_file=cookies "http://google.com/"   [RETURN]    egrep -o "http:.*"  wget -xi -  

Adam Bregenzer



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net