Chapter 9: Web and Internet Administration

If you're running a web server or are responsible for a website, simple or complex, you find yourself performing some tasks with great frequency, ranging from identifying broken internal and external site links to checking for spelling errors on web pages. Using shell scripts, you can automate these tasks, as well as some common client/ server tasks, such as ensuring that a remote directory of files is always completely in sync with a local copy, to great effect.

#77 Identifying Broken Internal Links

The scripts in Chapter 7 highlighted the value and capabilities of the lynx text-only web browser, but there's even more power hidden within this tremendous software application. One capability that's particularly useful for a web administrator is the traverse function (which you enable by using -traversal ), which causes lynx to try to step through all links on a site to see if any are broken. This feature can be harnessed in a short script.

The Code

 #!/bin/sh # checklinks - Traverses all internal URLs on a website, reporting #   any errors in the "traverse.errors" file. lynx="/usr/local/bin/lynx"      # this might need to be tweaked # Remove all the lynx traversal output files upon completion: trap "/bin/rm -f traverse*.errors reject*.dat traverse*.dat" 0 if [ -z "" ] ; then   echo "Usage: checklinks URL" >&2 ; exit 1 fi $lynx -traversal "" > /dev/null if [ -s "traverse.errors" ] ; then   echo -n $(wc -l < traverse.errors) errors encountered.   echo Checked $(grep '^http' traverse.dat  wc -l) pages at :   sed "sg" < traverse.errors else   echo -n "No errors encountered. ";   echo Checked $(grep '^http' traverse.dat  wc -l) pages at    exit 0 fi baseurl="$(echo   cut -d/ -f3)" mv traverse.errors ${baseurl}.errors echo "(A copy of this output has been saved in ${baseurl}.errors)" exit 0

How It Works

The vast majority of the work in this script is done by lynx ; the script just fiddles with the resultant lynx output files to summarize and display the data attractively. The lynx output file reject.dat contains a list of links pointing to external URLs (see Script #78, Reporting Broken External Links, for how to exploit this data); traverse.errors contains a list of failed, invalid links (the gist of this script); traverse.dat contains a list of all pages checked; and traverse2.dat is identical to traverse.dat except that it also includes the title of every page visited.

Running the Script

To run this script, simply specify a URL on the command line. Because it goes out to the network, you can traverse and check any website, but beware: Checking something like Google or Yahoo! will take forever and eat up all of your disk space in the process.

The Result

First off, let's check a tiny website that has no errors:

 $  checklinks http://www.ourecopass.org/  No errors encountered. Checked 4 pages at http://www.ourecopass.org/

Sure enough, all is well. How about a slightly larger site?

 $  checklinks http://www.clickthrustats.com/  1 errors encountered. Checked 9 pages at http://www.clickthrustats.com/: contactus.shtml         in privacy.shtml (A copy of this output has been saved in www.clickthrustats.com.errors)

This means that the file privacy.shtml contains a link to contactus.shtml that cannot be resolved: The file contactus.shtml does not exist. Finally, let's check my main website to see what link errors might be lurking:

 $  date ; checklinks http://www.intuitive.com/ ; date  Tue Sep 16 21:55:39 GMT 2003 6 errors encountered. Checked 728 pages at http://www.intuitive.com/: library/f8      in library/ArtofWriting.shtml library/f11     in library/ArtofWriting.shtml library/f16     in library/ArtofWriting.shtml library/f18     in library/ArtofWriting.shtml articles/cookies/       in articles/csi-chat.html ~taylor         in articles/aol-transcript.html (A copy of this output has been saved in www.intuitive.com.errors) Tue Sep 16 22:02:50 GMT 2003

Notice that adding a call to date before and after a long command is a lazy way to see how long the command takes. Here you can see that checking the 728-page intuitive.com site took just over seven minutes.

Hacking the Script

The grep statement in this script produces a list of all files checked, which can be fed to wc -l to ascertain how many pages have been examined. The actual errors are found in the traverse.errors file:

 echo Checked $(grep '^http' traverse.dat  wc -l) pages at : sed "sg" < traverse.errors

To have this script report on image (img) reference errors instead, grep the traverse.errors file for gif , jpeg , or png filename suffixes before feeding the result to the sed statement (which just cleans up the output format to make it attractive).