Download Websites Non-interactively


wget

If you want to back up your website or download another website, look to wget. In the previous section you used wget to grab individual files, but you can use it to obtain entire sites as well.

Note

Please be reasonable with wget. Don't download enormous sites, and keep in mind that someone creates and owns the sites that you're copying. Don't copy a site because you want to "steal" it.


Let's say you're buzzing around a site at www.neato.com and you find yourself at www.neato.com/articles/index.htm. You'd like to copy everything in the /articles section, but you don't want anything else on the site. The following command does what you'd like:

$ wget -E -r -k -p -w 5 -np http://www.neato.com/articles/index.htm 


You could have combined the options this way, as well:

$ wget -Erkp -w 5 -np http://www.neato.com/articles/index.htm 


As in the previous section, the command begins with wget and ends with the URL you want to use. You looked at the -w (or --wait=[#]) option before, and that's the same, as well as -np (or --no-parent) and -r (or --recursive). Let's examine the new options that have been introduced in this example.

When you download a site, some of the pages might not end with .htm or .html; instead, they might end with .asp, .php, .cfm, or something else. The problem comes in if you try to view the downloaded site on your computer. If you're running a web server on your desktop, things might look just fine, but more than likely you're not running Apache on your desktop. Even without a web server, however, pages ending in .htm or .html will work on your box if you open them with a web browser. If you use the -E (or --html-extension) option, wget converts every page so that it ends with .html, thus enabling you to view them on your computer without any special software.

Downloading a site might introduce other issues, however, which you can fortunately get around with the right wget options. Links on the pages you download with wget might not work after you open the pages on your computer, making it impossible to navigate from page to page. By specifying the -k (or --convert-links) option, you order wget to rewrite links in the pages so they work on your computer. This option fixes not only links to pages, but also links to pictures, Cascading Style Sheets, and files. You'll be glad you used it.

Speaking of Cascading Style Sheets (CSS) and images, they're why you want to use the -p (or --page-requisites) option. In order for the web page to display correctly, the web developer might have specified images, CSS, and JavaScript files to be used along with the page's HTML. The -p option requires that wget download any files needed to display the web pages that you're grabbing. With it, looking at the page after it's on your machine duplicates what you saw on the Web; without it, you might end up with an unreadable file.

The man page for wget is enormously long and detailed, and it's where you will ultimately end up if you want to use wget in a more sophisticated way. If you think wget sounds interesting to you, start reading. You'll learn a lot.



Linux Phrasebook
Linux Phrasebook
ISBN: 0672328380
EAN: 2147483647
Year: 2007
Pages: 288

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net