Recipe 18.13 Program: LinkChecker | Java Cookbook, Second Edition

One of the hard parts of maintaining a large web site is ensuring that all the hypertext links, images, applets, and so forth remain valid as the site grows and changes. It's easy to make a change somewhere that breaks a link somewhere else, exposing your users to those "Doh!"-producing 404 errors. What's needed is a program to automate checking the links. This turns out to be surprisingly complex due to the variety of link types. But we can certainly make a start.

Since we already created a program that reads a web page and extracts the URL-containing tags (Recipe 18.10), we can use that here. The basic approach of our new LinkChecker program is this: given a starting URL, create a GetURLs object for it. If that succeeds, read the list of URLs and go from there. This program has the additional functionality of displaying the structure of the site using simple indentation in a graphical window, as shown in Figure 18-3.

Figure 18-3. LinkChecker in action

So using the GetURLs class from Recipe 18.10, the rest is largely a matter of elaboration. A lot of this code has to do with the GUI (see Chapter 14). The code uses recursion: the routine checkOut( ) calls itself each time a new page or directory is started.

Example 18-10 shows the code for the LinkChecker program.

Example 18-10. LinkChecker.java

/** A simple HTML Link Checker.   * Need a Properties file to set depth, URLs to check. etc.  * Responses not adequate; need to check at least for 404-type errors!  * When all that is (said and) done, display in a Tree instead of a TextArea.  * Then use Color coding to indicate errors.  */ public class LinkChecker extends Frame implements Runnable {     protected Thread t = null;     /** The "global" activation flag: set true to halt. */     boolean done = false;     protected Panel p;     /** The textfield for the starting URL.      * Should have a Properties file and a JComboBox instead.      */     protected TextField textFldURL;     protected Button checkButton;     protected Button killButton;     protected TextArea textWindow;     protected int indent = 0;             protected Map hash = new HashMap( );        public static void main(String[] args) {         LinkChecker lc = new LinkChecker( );         lc.setSize(500, 400);         lc.setLocation(150, 150);         lc.setVisible(true);         if (args.length == 0)             return;         lc.textFldURL.setText(args[0]);     }        public void startChecking( ) {         done = false;         checkButton.setEnabled(false);         killButton.setEnabled(true);         textWindow.setText("");         doCheck( );     }     public void stopChecking( ) {         done = true;         checkButton.setEnabled(true);         killButton.setEnabled(false);     }     /** Construct a LinkChecker */     public LinkChecker( ) {         super("LinkChecker");                         addWindowListener(new WindowAdapter( ) {             public void windowClosing(WindowEvent e) {             setVisible(false);             dispose( );             System.exit(0);             }         });         setLayout(new BorderLayout( ));         p = new Panel( );         p.setLayout(new FlowLayout( ));         p.add(new Label("URL"));         p.add(textFldURL = new TextField(40));         p.add(checkButton = new Button("Check URL"));         // Make a single action listener for both the text field (when         // you hit return) and the explicit "Check URL" button.         ActionListener starter = new ActionListener( ) {             public void actionPerformed(ActionEvent e) {                 startChecking( );             }         };         textFldURL.addActionListener(starter);         checkButton.addActionListener(starter);         p.add(killButton = new Button("Stop"));         killButton.setEnabled(false);    // until startChecking is called.         killButton.addActionListener(new ActionListener( ) {             public void actionPerformed(ActionEvent e) {                 if (t == null || !t.isAlive( ))                     return;                 stopChecking( );             }         });         // Now lay out the main GUI - URL & buttons on top, text larger         add("North", p);         textWindow = new TextArea(80, 40);         add("Center", new JScrollPane(textWindow));     }     public void doCheck( ) {         if (t!=null && t.isAlive( ))             return;         t = new Thread(this);         t.start( );     }     public synchronized void run( ) {         textWindow.setText("");         checkOut(textFldURL.getText( ));         textWindow.append("-- All done --");     }        /** Start checking, given a URL by name.      * Calls checkLink to check each link.      */     public void checkOut(String rootURLString) {         URL rootURL = null;         GetURLs urlGetter = null;         if (done)             return;         if (rootURLString == null) {             textWindow.append("checkOut(null) isn't very useful");             return;         }           if (hash.get(rootURLString) != null) {         // Open the root URL for reading         try {             rootURL = new URL(rootURLString);             urlGetter = new GetURLs(rootURL);         } catch (MalformedURLException e) {             textWindow.append("Can't parse " + rootURLString + "\n");             return;         } catch (FileNotFoundException e) {             textWindow.append("Can't open file " + rootURLString + "\n");             return;         } catch (IOException e) {             textWindow.append("openStream " + rootURLString + " " + e + "\n");             return;         }         // If we're still here, the root URL given is OK.         // Next we make up a "directory" URL from it.         String rootURLdirString;         if (rootURLString.endsWith("/") ||             rootURLString.endsWith("\\"))                 rootURLdirString = rootURLString;         else {             rootURLdirString = rootURLString.substring(0,                  rootURLString.lastIndexOf('/'));    // TODO might be \         }         try {             ArrayList urlTags = urlGetter.getURLs( );             Iterator urlIterator = urlTags.iterator( );             while (urlIterator.hasNext( )) {                 if (done)                     return;                 String tag = (String)urlIterator.next( );                 System.out.println(tag);                                          String href = extractHREF(tag);                 for (int j=0; j<indent; j++)                     textWindow.append("\t");                 textWindow.append(href + " -- ");                 // Can't really validate these!                 if (href.startsWith("mailto:")) {                     textWindow.append(href + " -- not checking\n");                     continue;                 }                 if (href.startsWith("..") || href.startsWith("#")) {                     textWindow.append(href + " -- not checking\n");                     // nothing doing!                     continue;                  }                 URL hrefURL = new URL(rootURL, href);                 // TRY THE URL.                 // (don't combine previous textWindow.append with this one,                 // since this one can throw an exception)                 textWindow.append(checkLink(hrefURL));                 // There should be an option to control whether to                 // "try the url" first and then see if off-site, or                 // vice versa, for the case when checking a site you're                 // working on on your notebook on a train in the Rockies                 // with no web access available.                 // Now see if the URL is off-site.                 if (!hrefURL.getHost( ).equals(rootURL.getHost( ))) {                     textWindow.append("-- OFFSITE -- not following");                     textWindow.append("\n");                     continue;                 }                 textWindow.append("\n");                 // If HTML, check it recursively. No point checking                 // PHP, CGI, JSP, etc., since these usually need forms input.                 // If a directory, assume HTML or something under it will work.                 if (href.endsWith(".htm") ||                     href.endsWith(".html") ||                     href.endsWith("/")) {                         ++indent;                         if (href.indexOf(':') != -1)                             checkOut(href);            // RECURSE                         else {                             String newRef =                                   rootURLdirString + '/' + href;                             checkOut(newRef);        // RECURSE                         }                         --indent;                 }             }             urlGetter.close( );         } catch (IOException e) {             System.err.println("Error " + ":(" + e +")");         }     }     /** Check one link, given its DocumentBase and the tag */     public String checkLink(URL linkURL) {         try {              // Open it; if the open fails we'll likely throw an exception             URLConnection luf = linkURL.openConnection( );             if (linkURL.getProtocol( ).equals("http")) {                 HttpURLConnection huf = (HttpURLConnection)luf;                 String s = huf.getResponseCode( ) + " " + huf.getResponseMessage( );                 if (huf.getResponseCode( ) == -1)                     return "Server error: bad HTTP response";                 return s;             } else if (linkURL.getProtocol( ).equals("file")) {                 InputStream is = luf.getInputStream( );                 is.close( );                 // If that didn't throw an exception, the file is probably OK                 return "(File)";             } else                 return "(non-HTTP)";         }         catch (SocketException e) {             return "DEAD: " + e.toString( );         }         catch (IOException e) {             return "DEAD";         }     }       /** Extract the URL from <sometag attrs HREF="http://foo/bar" attrs ...>       * We presume that the HREF is correctly quoted!!!!!      * TODO: Handle Applets.      */     public String extractHREF(String tag) throws MalformedURLException {         String caseTag = tag.toLowerCase( ), attrib;         int p1, p2, p3, p4;         if (caseTag.startsWith("<a "))             attrib = "href";        // A         else             attrib = "src";            // image, frame         p1 = caseTag.indexOf(attrib);         if (p1 < 0) {             throw new MalformedURLException("Can't find " + attrib + " in " + tag);         }         p2 = tag.indexOf ("=", p1);         p3 = tag.indexOf("\"", p2);     // TODO should also handle single-quotes here!         p4 = tag.indexOf("\"", p3+1);         if (p3 < 0 || p4 < 0) {             throw new MalformedURLException("Invalid " + attrib + " in " + tag);         }         String href = tag.substring(p3+1, p4);         return href;     } }

18.13.1 Downloading an Entire Web Site

It would also be useful to have a program that reads the entire contents of a web site and saves it on your local hard disk. Sounds wasteful, but disk space is quite inexpensive nowadays, and this would allow you to peruse a web site when not connected to the Internet. Of course much of the dynamic content (Servlets, CGI scripts) would no longer be dynamic in pages that you downloaded, but at least you could navigate around the text and view the images. The LinkChecker program contains all the seeds of such a program: you need only to download the contents of each nondynamic URL (see the test for HTML and directories near the end of routine checkOut( ) and the code in Recipe 18.7), create the requisite directories (Recipe Recipe 11.9), and create and write to a file on disk (see Chapter 10). This final step is left as an exercise for the reader. Sites that use absolute references to their own pages would need to be normalized and relativized (see Recipe Recipe 18.8) during the download process.