78 Reporting Broken External Links | Wicked Cool Shell Scripts

#78 Reporting Broken External Links

This partner script to Script #77, Identifying Broken Internal Links , utilizes the -traversal option of lynx to generate and test a set of external links ” links to other websites . When run as a traversal of a site, lynx produces a number of data files, one of which is called reject.dat . The reject.dat file contains a list of all external links, both website links and mailto: links. By iteratively trying to access each http link in reject.dat , you can quickly ascertain which sites work and which sites fail to resolve, which is exactly what this script does.

The Code

 #!/bin/sh # checkexternal - Traverses all internal URLs on a website to build a #   list of external references, then checks each one to ascertain #   which might be dead or otherwise broken. The -a flag forces the #   script to list all matches, whether they're accessible or not: by #   default only unreachable links are shown. lynx="/usr/local/bin/lynx"      # might need to be tweaked listall=0; errors=0             # shortcut: two vars on one line! if [ "" = "-a" ] ; then   listall=1; shift fi outfile="$(echo ""  cut -d/ -f3).external-errors" /bin/rm -f $outfile     # clean it for new output trap "/bin/rm -f traverse*.errors reject*.dat traverse*.dat" 0 if [ -z "" ] ; then   echo "Usage: $(basename  #!/bin/sh # checkexternal - Traverses all internal URLs on a website to build a # list of external references, then checks each one to ascertain # which might be dead or otherwise broken. The -a flag forces the # script to list all matches, whether they're accessible or not: by # default only unreachable links are shown. lynx="/usr/local/bin/lynx" # might need to be tweaked listall=0; errors=0 # shortcut: two vars on one line! if [ "$1" = "-a" ] ; then listall=1; shift fi outfile="$(echo "$1"  cut -d/ -f3).external-errors" /bin/rm -f $outfile # clean it for new output trap "/bin/rm -f traverse*.errors reject*.dat traverse*.dat" 0 if [ -z "$1" ] ; then echo "Usage: $(basename $0) [-a] URL" >&2 exit 1 fi # Create the data files needed $lynx -traversal $1 > /dev/null; if [ -s "reject.dat" ] ; then # The following line has a trailing space after the backslash! echo -n $(sort -u reject.dat  wc -l) external links encountered echo in $(grep '^http' traverse.dat  wc -l) pages for URL in $(grep '^http:' reject.dat sort -u) do if ! $lynx -dump $URL > /dev/null 2>&1 ; then echo "Failed : $URL" >> $outfile errors="$(($errors + 1))" elif [ $listall -eq 1 ] ; then echo "Success: $URL" >> $outfile fi done if [ -s $outfile ] ; then cat $outfile echo "(A copy of this output has been saved in ${outfile})" elif [ $listall -eq 0 -a $errors -eq 0 ] ; then echo "No problems encountered." fi else echo -n "No external links encountered "; echo in $(grep '^http' traverse.dat  wc -l) pages. fi exit 0 
 ) [-a] URL" >&2   exit 1 fi # Create the data files needed $lynx -traversal  > /dev/null; if [ -s "reject.dat" ] ; then   # The following line has a trailing space after the backslash!   echo -n $(sort -u reject.dat  wc -l) external links encountered   echo in $(grep '^http' traverse.dat  wc -l) pages   for URL in $(grep '^http:' reject.dat  sort -u)   do     if ! $lynx -dump $URL > /dev/null 2>&1 ; then       echo "Failed : $URL" >> $outfile       errors="$(($errors + 1))"     elif [ $listall -eq 1 ] ; then       echo "Success: $URL" >> $outfile     fi   done   if [ -s $outfile ] ; then     cat $outfile     echo "(A copy of this output has been saved in ${outfile})"   elif [ $listall -eq 0 -a $errors -eq 0 ] ; then     echo "No problems encountered."   fi else   echo -n "No external links encountered ";   echo in $(grep '^http' traverse.dat  wc -l) pages. fi exit 0

How It Works

This is not the most elegant script in this book. It's more of a brute-force method of checking external links, because for each external link found, the lynx command tests the validity of the link by trying to grab the contents of its URL and then discarding them as soon as they've arrived, as shown in the following block of code:

 if ! $lynx -dump $URL > /dev/null 2>&1 ; then       echo "Failed : $URL" >> $outfile       errors="$(($errors + 1))"     elif [ $listall -eq 1 ] ; then       echo "Success: $URL" >> $outfile     fi

The notation 2>&1 is worth mentioning here: It causes output device #2 to be redirected to whatever output device #1 is set to. With a shell, output #2 is stderr (for error messages) and output #1 is stdout (regular output). Used alone, 2>&1 will cause stderr to go to stdout . In this instance, however, notice that prior to this redirection, stdout is already redirected to the so-called bit bucket of /dev/null (a virtual device that can be fed an infinite amount of data without ever getting any bigger. Think of a black hole, and you'll be on the right track). Therefore, this notation ensures that stderr is also redirected to /dev/null . We're throwing all of this information away because all we're really interested in is whether lynx returns a zero or nonzero return code from this command (zero indicates success; nonzero indicates an error).

The number of internal pages traversed is calculated by the line count of the file traverse.dat , and the number of external links is found by looking at reject.dat . If the - a flag is specified, the output lists all external links, whether they're reachable or not; otherwise only failed URLs are displayed.

Running the Script

To run this script, simply specify the URL of a site to check.

The Results

Let's check a simple site with a known bad link. The -a flag lists all external links, valid or not.

 $  checkexternal -a http://www.ourecopass.org/  8 external links encountered in 4 pages Failed : http://www.badlink/somewhere.html Success: http://www.ci.boulder.co.us/goboulder/ Success: http://www.ecopass.org/ Success: http://www.intuitive.com/ Success: http://www.ridearrangers.org/ Success: http://www.rtd-denver.com/ Success: http://www.transitalliance.org/ Success: http://www.us36tmo.org/ (A copy of this output has been saved in www.ourecopass.org.external-errors)

To find the bad link, we can easily use the grep command on the set of HTML source files:

 $  grep 'badlink/somewhere.html' ~ecopass/*  ~ecopass/contact.html:<a href="http://www.badlink/somewhere.html">bad </a>

With a larger site, well, the program can run for a long, long time. The following took three hours to finish testing:

 $  date ; checkexternal http://www.intuitive.com/ ; date  Tue Sep 16 23:16:37 GMT 2003 733 external links encountered in 728 pages Failed : http://chemgod.slip.umd.edu/~kidwell/weather.html Failed : http://epoch.oreilly.com/shop/cart.asp Failed : http://ezone.org:1080/ez/ Failed : http://techweb.cmp.com/cw/webcommerce/ Failed : http://tenbrooks11.lanminds.com/ Failed : http://www.builder.cnet.com/ Failed : http://www.buzz.builder.com/ Failed : http://www.chem.emory.edu/html/html.html Failed : http://www.truste.org/ Failed : http://www.wander-lust.com/ Failed : http://www.websitegarage.com/ (A copy of this output has been saved in www.intuitive.com.external-errors) Wed Sep 17 02:11:18 GMT 2003

Looks as though it's time for some cleanup work!