Hack 27 More Advanced wget Techniques

figs/moderate.gif figs/hack27.gif

wget has a huge number of features that can make downloading data from the web easier than sitting down and rolling your own Perl script. Here, we'll cover some of the more useful configuration options .

wget is capable of fetching files via HTTP, HTTPS, and FTP, and it can even mix all three protocols as needed. Fetching can be optimized for specific uses, including customized HTTP headers, SSL support, and proxy configurations:

 %  wget --referer=http://foo.com/ -U MyAgent/1.0 http://bar.net/  

In this example, wget sends HTTP headers for Referer ( --referer ) and User -Agent ( -U or --user-agent ). This is generally considered to be a good practice, as it allows the server administrators to know who and what is getting files from their server. The --referer option is also handy in avoiding some of the more basic antileech/antimirror configurations, which allow only requests with a certain Referer .

It is important to control what wget does; otherwise , you could end up attempting to download half the Internet. When mirroring a site, this control starts with setting the depth of the crawl ( -l or --level ) and whether or not wget gets images and other supplementary resources along with text ( -p or --page-requisites ):

 %  wget -l 2 -p -r http://www.example.com/  

This recursively retrieves the first two layers of a web site, including all files that the HTML requires. Further control can be attained by setting rate limit ( --limit-rate ), fetch timeout ( -T or --timeout ), or using date/time checking ( -N or --timestamping ). Date/time comparison is highly effective when mirroring is scheduled, because it compares the local file's time and date with the remote file's time and date and fetches only files that are newer than the local version.

Controlling which directories wget will recurse into is another means of keeping bandwidth usage (and administrator's tempers) to a minimum. This can be done by telling wget either which directories to look in ( -I or --include-directories ) or which to ignore ( -X or --exclude-directories ). Similarly, you can control which HTML tags wget will follow ( --follow-tags ) or ignore ( --ignore-tags ) when dealing with HTML content.

Generally, it isn't necessary to use the HTML tag controls, unless you want to do something specific, such as grab only images, which can be done like this:

 %  wget -m --follow-tags=img http://www.example.com/  

Many sites require basic HTTP Authentication to view files, and wget includes options that make this easy. By appending the username ( --http-user ) and password ( --http-passwd ) to the HTTP headers for each request, wget will be able to fetch protected content:

 %  wget -r --http-user=me --http-passwd=ssssh http://example.com/  

One other major consideration is a local matterkeeping your own mirror directories clean and usable. When using wget , you can control where it places downloaded files on your own drive:

 %  wget -r -P /home/me http://www.example.com/  

If the directory is not specified ( -P or --directory-prefix ), wget will simply put the spidered files into the current directory. If you specify the folder, it will create the spidered content in that directory, in this case /home/me . If you run cron jobs to schedule mirroring scripts [Hack #90], this option makes it simple to keep everything straight, without worrying what directory the cron job is executing from.

You can also control whether wget creates the remote site's directory structure on the local side ( -x or --force-directories ) or not ( -nd or --no-directories ). When using the mirror option ( -m or --mirror ), wget automatically creates the directories.

For more advanced spidering applications, wget will accept raw HTTP headers ( --header ) to send with each request, allow you to specify a file with a URL list file ( -i or --input-file ), save raw server responses (headers and content) to the local files ( -s or --save-headers ), and log output messages to a file ( -o or --output-file ).:

 %  wget --header="From: me@sample.com" -i ./urls.txt -s -o ~/wget.log  

Each file in ./urls.txt will be requested with the additional HTTP/1.1 header From and saved with raw server responses. A log will also be created as ~/wget.log and will look something like this:

 --20:22:39--  http://www.example.com/index.html     => 'www.example.com/index.html' Connecting to www.example.com[207.99.3.256]:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html]     0K .........                                    9.74 MB/s 20:22:39 (9.74 MB/s) - 'www.example.com/index.html' saved [10215] FINISHED --20:22:39-- Downloaded: 10,215 bytes in 1 file 

These output files are particularly useful when you want to know some specifics about wget 's results but don't want to watch it fly by on the screen. It's quite simple to write a script to parse the output file and create graphs based on file sizes, transfer speeds, file types, and status codes.

While some options are exclusive of others, most options can be combined for rather sophisticated spidering uses. The best way to really figure out wget is to run variations of the options against your own site.

More information about wget is available by typing man wget in your shell.

James Linden



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net