Hack 26 Downloading with curl and wget

figs/beginner.gif figs/hack26.gif

There are a number of command-line utilities to download files over HTTP and FTP. We'll talk about two of the more popular choices: curl and wget .

There are hundreds of ways to download files located on the Net: FTP, HTTP, NNTP, Gnutella, Hotline, Carrachothe list of possible options goes on and on. There is, however, an odd man out in these protocols, and that's HTTP. Most web browsers are designed to view web pages (as you'd expect); they're not designed to download mass amounts of files from a public web directory. This often leaves users with a few meager choices: should they manually and slowly download each file themselves or go out and find some software that could do it for them?

Oftentimes, you'll have one or more utilities that can answer this question already installed on your machine. We'll first talk about curl (http://curl.sf.net/), which has an innocent and calming description:

 curl is a client to get documents/files from or send docu- ments to a server, using any of the supported protocols (HTTP, HTTPS, FTP, GOPHER, DICT, TELNET, LDAP or FILE). The command is designed to work without user interaction or any kind of interactivity. 

Further reading through its manual (accessible by entering man curl as a shell command or a slightly longer version with curl --manual ) shows a wide range of features, including the ability to get SSL documents, manipulate authentication credentials, change the user agent, set cookies, and prefill form values with either GET or POST . Sadly, curl has some apparent shortcomings, and they all revolve around downloading files that don't have similar names .

Almost immediately, the manual instructs you of curl 's range power, so you can download a list of sequentially numbered files with a simple command:

 %  curl -LO http://example.com/file[0-100].txt  

The -L flag tells curl to follow any redirects that may be issued, and the -O flag saves the downloaded files into similarly named copies locally ( ./file0.txt , ./file1.txt , etc.). Our limitations with the range feature show all too clearly with date-based filenames. Say you want to download a list of files that are in the form of yymmdd.txt . You could use this innocent command:

 %  curl -LO http://example.com/[1996-2002]/[000001-999999].txt  

If you are patient enough, this will work fine. The downside is that curl will literally try to grab a million files per year (which would range from 1996 through 2002). While a patient downloader may not care, this will create an insane amount of bandwidth waste, as well as a potentially angry web host. We could split the previous command in two:

 %  curl -LO http://example.com/[1996-1999]/[96-99][01-12][01-31].txt  %  curl -LO http://example.com/[2000-2002]/[00-02][01-12][01-31].txt  

These will also work correctly, at the expense of being lengthy (technically, we could combine the curl commands into one, with two URLs), but still cause a large number of "file not found" errors for the web host (albeit not as many as the first one).

Solving this sort of problem is easy with the second of our utilities, wget (http://www.gnu.org/software/wget/wget.html):

 %  wget -m -A txt -np http://example.com/text/  

We start off in mirror mode ( -m ), which allows us to run the command at a later date and grab only content that has changed from what we've previously downloaded. We accept ( -A ) only files that end in .txt , and we don't want to get anything from our parent directory ( -np or "no parent"); this stops wget from following links that lead us out of the text directory. wget (as well as curl ) will show you a running progress as it's downloading files. More information about wget is available by typing man wget on the command line.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net