Hack 92 Mirroring Web Sites with wget and rsync

figs/beginner.gif figs/hack92.gif

Is there a site you check frequently, or do you want a backup of your own site? Various mirroring tools are available that can ensure you're creating duplicate and complete backups on another machine .

Maybe you have a favorite picture site that you are always checking out or a music site that posts new files daily; either way, it can be cumbersome to write a script to grab new content every day for each site. And then there is the hassle of always making a backup of your own sites. Regardless, you do not always have to write some fancy LWP script or spend hours parsing HTML, trying to get just what you want. If you have the disk space, mirroring the site can be an effective, if somewhat lazy, solution.

There are two basic ways to go about mirroring sites: downloading the pages through the web server or accessing the content directly from the server it is hosted on. Obviously, if you don't have access to the server itself, you will have to go with the first option.

Mirroring via the Web

For downloading content directly from the web, the easiest tool to use is wget (http://www.gnu.org/software/wget/wget.html). With one command line, you can mirror an entire site, set it up in cron , and forget about it. The -m option to wget is most likely all you will need:

 %  wget -m "http://www.gnu.org/"  

Run this and watch your screen scroll. The first time you mirror a site will likely take quite a while; remember, you are getting every image, file, and web page on a site. (There are ways to limit how far wget goes, which we'll get to in a minute.) Once it is finished, you will notice in the current directory a new directory with the same name as the root of the site you downloadedfor example, www.gnu.org . Within that directory, you will find the entire structure of the site, which you can browse with your favorite browser as if you were on the actual site. From here, all you have to do is run the same command from the same directory you ran it initially, and it will go back through the site and update files as necessary, taking as many advantages of the HTTP protocol allows in determining which files need to be updated.

When you first run a mirror, you may quickly realize that you're grabbing more content than you need. One way to help limit this is not to mirror the entire site; instead, mirror only what you really want:

 %  wget -m "http://www.gnu.org/software/wget/wget.html"  

This is still not perfect. However, here is where some more of wget 's options come into play. First of all, there is the -np option. This option prevents wget from going into a higher directory than where you made it start. For example, the following command:

 %  wget -m -np "http://www.gnu.org/software/wget/wget.html"  

does not download any pages in http://www.gnu.org/software/, nor does it go across and download anything from different web sites, which is probably a good idea. However, should you want wget to travel to different hosts , try adding the -H option.

Here are a few more options you may find worthwhile. The -l option limits the number of links wget will follow. For example, passing -l1 makes wget get the page you specify and all the links on that page, then stop. The -l option is useful if you want to follow only relative links. Not only does this prevent wget from crossing over and downloading from another site, but it also prevents it from following absolute URLs within the same site.

If you plan on browsing the HTML pages locally after mirroring, you will want to take a look at the -k and -p options. The -k option makes sure all links in the pages are made to be relative; this keeps you from crossing over into the live site if you are browsing locally and click on the link that previously was absolute. The -p option makes sure it downloads all the images and files on a page, even if they are located on a different site.

After you determine which options are best for your particular need, simply run it once in a directory and then cron the job with a line like this:

 0 0 * * 6 (cd ~/mirrors; wget -m -np -k -p "http://www.gnu.org/" >/dev/null) 

Mirroring Directly with the Server

In this case, you have access to the server. Most likely, you want to mirror your own site or perhaps some other data. For this, rsync (http://rsync.samba.org/) is the ideal tool. rsync is a versatile tool for mirroring or backing up data across computers. There are multiple ways of using rsync between machines; however, here we are going to use ssh . This is the easiest to configure and has the added advantage of providing good security for the files being transferred.

Obviously, you will need ssh installed and configured on both systems. You will also need to make sure you can log in to the system you want to mirror from. Next , you'll need to determine which directories you want to mirror and where you want them mirrored to on your system. With that in mind, you just need to run rsync , passing it the necessary options:

  rsync -a -e ssh remote.machine.com:/some/directory /local/directory  

The -a option tells rsync you want to mirror the directory; it sets a series of options that make rsync keep timestamps, permissions, user and group ownership, soft links, and so on for all the files, and it recurses through the directory. The -e option followed by ssh tells rsync to use ssh to connect to the remote server; if you are not using public key encryption, you will be prompted for the password by ssh when it connects to the server. The next argument is the server to connect to, followed by the directory to mirror from and, finally, the directory to mirror to. Make sure the last argument is a directory that already exists on your system, because rsync will create directories only inside this one.

Before getting into more options, it is a good idea to take a look at what this command just did. The mirrored directory should now appear the same as the directory on the server. Every file should be the same and should have the same timestamps, permissions, and so on. If you ran the command as root , the mirrored directory will also have the same usernames, assuming they exist on this system.

Now, we'll talk a bit about how rsync works; it was designed for exactly what we are using it for. It checks each file, comparing it to see if changes were made. If changes exist, it attempts to update the local file by sending only the parts that have changed. For new files, it sends the whole file. This is great, because not only is it better at checking for changes than wget 's use of the HTTP protocol, but it also tries to send only the data necessary to update the file, saving nicely on bandwidth.

Now, onto the other options. The -z option is probably one you will always want to use; it tells rsync to compress the data stream, decreasing bandwidth and most likely making the entire process go faster.

The -v option tells rsync to spit out the names of the files it is syncing; this works well when coupled with the --progress and --stats options. The former adds a progress indicator to each file as it is downloaded, and the latter details statistics about the entire mirroring operation.

The -u option tells rsync to only update files (it does not touch local files with a timestamp newer than the one on the server). This option is useful only if you modify the files locally and want to keep those changes. If you intend to keep a fully accurate mirror of the remote site, do not use this option; however, keep in mind that any changes you make to the files locally will be overwritten.

Finally, the --delete option deletes files that no longer exist on the server. If a file is deleted on the server, it will also be deleted on your backup. Again, this is very useful if you want to maintain an exact mirror of the files on the server.

Hacking the Hack

The way we use rsync here is a secure, easy way to handle it. However, if you do not have ssh installed, you may be looking for an alternative. There are basically two other options. One is to use rsync with rsh instead of ssh . This still requires setup on the server, though it is more traditional than ssh and considerably less secure than ssh . If you use rsh, remove the -e ssh option and make sure you have rsh set up correctly on your server. Another option is to run rsync as a service on the server. This option does not have the security of ssh , but it allows you to use rsync without having ssh set up. To do this, you still need rsync installed on both servers, but you have to create an rsync configuration file on the server and make sure rsync runs as a service.

To begin, you'll want the following command run at startup on the server:

 rsync --daemon 

Then, you will want to create a configuration file for rsync , such as the following:

 [backup]   path = /some/directory 

Put this in the /etc/rsyncd.conf file and have rsync start as shown previously. Now, when you connect to the rsync server, you will want to change the options a bit. Instead of remote.machine.com:/some/directory , you will want remote.machine.com::backup . This tells rsync to connect to the backup module on the rsync server. You will also want to omit the -e ssh option. There is more you can do with the rsyncd.conf file, including restricting access based on usernames, setting read-only access, and so on. For a complete list of options, view the manpage for rsyncd.conf by typing man rsyncd.conf .

Adam Bregenzer



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net