Recipe 2.7. Downloading All Files from a Site | Web Site Cookbook: Solutions & Examples for Building and Administering Your Web Site (Cookbooks (OReilly))

Problem

You need to create a backup, mirror, or offline copy of your web site.

Solution

Use the Unix utility wget to mirror the files on the server to another location either by HTTP with this command:

 wget --mirror http://yourwebsite.com

or by FTP:

 wget --mirror ftp://username:password@yourwebsite.com

Alternatively, you can use GUI-based utilities on your PC. Some choices are listed in the "See Also" section of this Recipe.

Discussion

With wget, you can perform heroic feats of webmastering, whether it's copying a single file from one site to another, or an entire site to another server.

When spidering a site over HTTP, wget will only copy files it finds links to. Unused images and old web pages still lingering on the server will be skipped. Using FTP, wget will copy everything.

Some scenarios where wget can be indispensable include:

Keeping frequently updated pages or images in sync on two sites: Say you want to display a real-time webcam image on your site, but don't want to (or can't) use an absolute URL to the site where the camera saves the image in the image tag's src attribute. (Perhaps the other site's server is slower or less reliable than yours, or outside linking to the image has been disabled, as described in Recipe 5.5.) With wget, you can specify the URL of the file, a local directory on your server where it should be copied, and the number of times to retry a flaky HTTP connection. Combined with cron (see Recipe 1.8), wget can perform its connect-and-copy task as often as you (or your system administrator) want it to.
Setting up a mirror version of a site: Because wget also can connect via FTP, you can use it as part of your backup strategy. When wget retrieves a file with HTTP, you get the same rendered code that you would see if you viewed source on the page in a browser. Server-side code, such as PHP scripting and include file tags, won't show up. Using FTP, which requires adding a username and password to the wget command, yields the actual files with all the "pre-rendering" code intact. If an unexpected outage or traffic spike knocks your site offline, wget can help you quickly relocate it on another server.
Getting all the files needed for an offline copy of a site: Offline copies of a site can be useful in situations where connecting to the real thing is impractical or not feasible. For example, the sales staff wants to demonstrate your password-protected tech support site for a prospective customer. Putting a copy of the site on a CD or laptop hard drive can prevent connection or login problems that could sink the demo. In HTTP mode, wget can negotiate HTTP authentication logins (the type that appear in a browser-based pop-up dialog box). With its cookie-handling options, wget can load authentication information from a previously created session cookie to access a protected site, a technique reserved for power users who review the wget manual (referred to in the "See Also" section of this Recipe).

However, one shortcoming of wget is its ability to handle dynamically generated sites. With the --html-extension option enabled, wget will append a .html suffix to dynamically generated pages, but links to those pages that include the query string will not be updated.

For example, a site might have several FAQs stored in a database and displayed on the site through a PHP template that retrieves the content based on a record ID. A link to the page might look like this:

 <a href="/faq.php?id=1">Question No. 1</a>

This link's destination file, retrieved by wget, will be named faq.php?id=1.html (or faq.php?id=1 without the extension option enabled). But the link itself won't be changed, which will detract from the offline browsing experience.

A PC site-downloading utility (two are listed in the "See Also" section of this Recipe, and there are surely others) can take the extra step of converting a dynamic site to a static site. For each dynamic link the utility finds while crawling a site, it creates a unique file (such as faq60148.php for the hypothetical FAQ) and updates all the links to point to the new static page.