Download Files Non-interactively

`wget`

The Net is a treasure trove of pictures, movies, and music that are available for downloading. The problem is that manually downloading every file from a collection of 200 MP3s quickly grows tedious, leading to mind rot and uncontrollable drooling. The wget command is used to download files and websites without any interference; you set the command in motion and it happily downloads whatever you specified, for hours on end.

The tricky part, of course, is setting up the command. wget is another super-powerful program that really deserves a book all to itself, so we don't have space to show you everything it can do. Instead, you're going to focus on doing two things with wget: downloading a whole mess of files, which is looked at here, and downloading entire websites, which is covered in the next section.

Here's the premise: You find a wonderful website called "The Old Time Radio Archives." On this website are a large number of vintage radio shows, available for download in MP3 format365 MP3s, to be exact, one for every day of the year. It would sure be nice to grab those MP3s, but the prospect of right-clicking on every MP3 hyperlink, choosing Save Link As, and then clicking OK to start the download isn't very appealing.

Examining the directory structure a bit more, you notice that the MP3s are organized in a directory structure like this:

http://www.oldtimeradioarchives.com/mp3/       season_10/       season_11/       ...       season_20/       ...       season_3/       season_4/       ...       season_9/

Note

The directories are not sorted in numerical order, as humans would do it, but in alphabetical order, which is how computers sort numbers unless told otherwise. After all, "ten" comes before "three" alphabetically.

Inside each directory sit the MP3s. Some directories have just a few files in them, and some have close to 20. If you click on the link to a directory, you get a web page that lists the files in that directory like this:

[BACK] Parent Directory     19-May-2002 01:03      - [SND]  1944-12-24_532.mp3   06-Jul-2002 13:54    6.0M [SND]  1944-12-31_533.mp3   06-Jul-2002 14:28    6.5M [SND]  1945-01-07_534.mp3   06-Jul-2002 20:05    6.8M

[SND] is a GIF image of musical notes that shows up in front of every file listing.

So the question is, how do you download all of these MP3s that have different filenames and exist in different directories? The answer is wget!

Start by creating a directory on your computer into which you'll download the MP3 files.

$ mkdir radio_mp3s

Now use the cd command to get into that directory, and then run wget:

$ cd radio_mp3s $ wget -r -l2 -np -w 5 -A.mp3 -R.html,.gif http://www.oldtimeradioarchives.com/mp3/`

Let's walk through this command and its options.

wget is the command you're running, of course, and at the far end is the URL that you want wget to use: http://www.oldtimeradioarchives.com/mp3. The important stuff, though, lies in between the command and the URL.

The -r (or --recursive) option for wget follows links and goes down through directories in search of files. By telling wget that it is to act recursively, you ensure that wget will go through every season's directory, grabbing all the MP3s it finds.

The -l2 (or --level=[#]) option is important yet tricky. It tells wget how deep it should go in retrieving files recursively. The lowercase l stands for level and the number is the depth to which wget should descend. If you specified -l1 for level one, wget would look in the /mp3 directory only. That would result in a download of...nothing. Remember, the /mp3 directory contains other subdirectories: season_10, season_11, and so on, and those are directories that contain the MP3s you want. By specifying -l2, you're asking wget to first enter /mp3 (which would be level one), and then go into each season_# directory in turn and grab anything in it. You need to be very careful with the level you specify. If you aren't careful, you can easily fill your hard drive in very little time.

One of the ways to avoid downloading more than you expected is to use the -np (or --no-parent) option, which prevents wget from recursing into the parent directory. If you look back at the preceding list of files, you'll note that the very first link is the parent directory. In other words, when in /season_10, the parent is /mp3. The same is true for /season_11, /season_12, and so on. You don't want wget to go up, however, you want it to go down. And you certainly don't need to waste time by going up into the same directory/mp3every time you're in a season's directory.

This next option isn't required, but it would sure be polite of you to use it. The -w (or --wait=[#]) option introduces a short wait between each file download. This helps prevent overloading the server as you hammer it continuously for files. By default, the number is interpreted by wget as seconds; if you want, you can also specify minutes by appending m after the number, or hours with h, or even days with d.

Now it gets very interesting. The -A (or --accept) option emphasizes to wget that you only want to download files of a certain type and nothing else. The A stands for accept, and it's followed by the file suffixes that you want, separated by commas. You only want one kind of file type, MP3, so that's all you specify: -A.mp3.

On the flip side, the -R (or --reject) option tells wget what you don't want: HTML and GIF files. By refusing those, you don't get those little musical notes represented by [SND] shown previously. Separate your list of suffixes with a comma, giving you -R.html,.gif.

Running wget with those options results in a download of 365 MP3s to your computer. If, for some reason, the transfer was interruptedyour router dies, someone trips over your Ethernet cable and yanks it out of your box, a backhoe rips up the fiber coming in to your businessjust repeat the command, but add the -c (or --continue) option. This tells wget to take over from where it was forced to stop. That way you don't download everything all over again.

Here's another example that uses wget to download files. A London DJ released two albums worth of MP3s consisting of mash-ups of The Beatles and The Beastie Boys, giving us The Beastles, of course. The MP3s are listed, one after the other, on www.djbc.net/beastles.

The following command pulls the links out of that web page, writes them to a file, and then starts downloading those links using wget:

$ dog --links http://www.djbc.net/beastles/ | grep mp3 > beastles ; wget -i beastles --12:58:12--  http://www.djbc.net/beastles/ webcontent/djbc-holdittogethernow.mp3            => 'djbc-holdittogethernow.mp3' Resolving www.djbc.net... 216.227.209.173 Connecting to www.djbc.net|216.227.209.173|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 4,533,083 (4.3M) [audio/mpeg] 100%[=========>] 4,533,083    203.20K/s    ETA 00:00 12:58:39 (166.88 KB/s) - 'djbc-holdittogethernow. mp3' saved [4533083/4533083]

In Chapter 5, "Viewing Files," you learned about cat, which outputs files to STDOUT. The "Concatenate Files and Number the Lines" section mentioned a better cat, known as dog (which is true because dogs are better than cats). If you invoke dog with the --links option and point it at a URL, the links are pulled out of the page and displayed on STDOUT. You pipe those links to grep, asking grep to filter out all but lines containing mp3, and then redirect the resulting MP3 links to a text file named beastles (piping and redirecting are covered in Chapter 4, "Building Blocks," and the grep command is covered in Chapter 9, "Finding Stuff: Easy").

The semicolon (covered in the "Run Several Commands Sequentially" section in Chapter 4) ends that command and starts a new one: wget. The -i (or --input-file) option tells wget to look in a file for the URLs to download, instead of STDIN. If you have many links, put them all in a file and use the -i option with wget. In this case, you point wget to the beastles file you just created via dog and grep, and the MP3s begin to download, one after the other.

Now really, what could be easier? Ah, the power of the Linux command line!