Hack 35 Downloading Movies from the Library of Congress

figs/beginner.gif figs/hack35.gif

Often, downloading from the Web is accomplished more easily with a little exploration and a command-line utility or favorite browser than with even the most accomplished programming .

When you're looking at a web site, trying to decipher how best to get at the meaty files or information within, you may already have a preconceived notion that it's going to take more than a little programming to get at it. Often, this is entirely unnecessary.

Directory Indexes

Check to see if the site shows directory indexes, automatically generated lists of the files within a directory. This is easier than it sounds. In most cases, when you create a directory to serve web pages from, you add an index.html or default.html to signify your default document for that location. Since these are special filenames, you don't have to specify them in the final URL. For example, the following two addresses are equivalent:

 http://example.com/directory/index.html http://example.com/directory/ 

The second, while being smaller and easier to type, also ensures that redesigns (such as from .html to .php ) can happen in the future, without worrying about redirecting nonexistent files to the right place.

In some cases, though, directories are holding tanks for media files, like your typical /images/ , /graphics/ , or /movies/ . Without a user -generated index.html file, these directories can often be accessed directly via your browser:

 http://example.com/directory/images/ http://example.com/directory/movies/ 

One of a few different things will happen, depending on the web server and how it has been configured. You may well get a "Directory Listing Denied" error, which is the web server's way of saying "I've been told not to automatically generate a listing of files within this directory." Sometimes, the web site owner has dropped a snarky comment into an index.html filesomething to the effect of "These are not the files you seek"or just a blank page. If you're lucky, you'll be provided with a listing of everything in the directory. Figure 3-2 shows the contents of my images directory, http:// disobey .com/images/, in my browser.

Figure 3-2. An example of a generated file listing
figs/sphk_0302.gif

There's no need to scrape those href s or src s from an HTML page; what you see is what's available to you, which can become amazingly handy. Not only can you send this simplified directory listing to a shell utility like curl or wget [Hack #26], but some web browsers have a Download Manager in which to queue files for local storage. For example, in Internet Explorer on OS X, I can Option-click these images (or any other file) for them to become automatically queued for downloading. (Yes, you could do this on any page, directory listing or not, but an HTML page that's spread out over several pages can make this more trouble than it's worth.) Figure 3-3 shows Internet Explorer's Download Manager.

Figure 3-3. Internet Explorer's Download Manager
figs/sphk_0303.gif

An Example: Origins of American Animation

The Origins of American Animation site (http://lcweb2.loc.gov/ammem/oahtml/oahome.html) offers 21 animated films and two fragments from the years 1900-1921. If you thought American animation started with Steamboat Willie, get over it. These animations, though mostly very simple, are worth seeing. He Resolves Not to Smoke has readable dialog, and it's charming and very, very weird at the same time. It's definitely worth taking the time to archive for your own offline use.

The URL linking to He Resolves Not To Smoke is:

 http://lcweb2.loc.gov/cgi-bin/query/S?ammem/papr:@FILREQ (@field(TITLE+@od1(He+resolves+not+to+smoke++) )+@FIELD(COLLID+animat)) 

This is nothing I want to decipher anytime soon. And, in the "hold to heart" repetition of "need to know" basis only, we shouldn't have to. Each movie in the exhibit includes three different file formats, and if we mouse over each of the links for He Resolves . . . , we see:

 http://memory.loc.gov/mbrs/animp/4067.ram http://memory.loc.gov/mbrs/animp/4067.mpg http://memory.loc.gov/mbrs/animp/4067.mov 

These are much simpler URLs to remember, type, and understand. But what's even better is, as we examine other movie pages, they're all located in the same directory. Taking a calculated chance ( calculated in the sense that I'm basing a hack on its success), load the following in your browser:

 http://memory.loc.gov/mbrs/animp/ 

We now have an automated directory listing that provides a complete list of files available to us, without having to follow link after link from page after page just to arrive at the same directory.

If you're using Internet Explorer on Mac OS X, you can Option-click on each of the files in the directory, adding the movies to your download queue. After just a couple of minutes' work, you can go merrily about your day, letting IE do the work of grabbing all those files for you.

Alternatively, now that we have a single listing of all available files, as opposed to multiple pages to worry about, we can use an automated utility like wget [Hack #26] to grab them all in a single command:

 %  wget -m --accept=mpg http://memory.loc.gov/mbrs/animp/  

This instructs wget to mirror the entire URL, but to accept (and save) only files that end in .mpg . A single command, and that's all there is to it.

Another Example: America at Work, America at Leisure

The same technique can be applied to another find: the Library of Congress's America at Work, America at Leisure project (http://memory.loc.gov/ammem/awlhtml/). The site includes more than 100 films made between 1894 and 1915. There are films of firefighters, ice manufacturers, paperboysyou name itall going about their business. There are lots of Edison films here too.

The directory listing you're after lives at http://lcweb2.loc.gov/mbrs/awal/.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net