Recipe 18.10 Extracting URLs from a File


Problem

You need to extract just the URLs from a file.

Solution

Use ReadTag from Recipe 18.9 and just look for tags that might contain URLs.

Discussion

The program in Example 18-8 uses ReadTag from the previous recipe and checks each tag to see if it is a "wanted tag" defined in the array wantedTags. These include A (anchor), IMG (image), and APPLET tags. If it is determined to be a wanted tag, the URL is extracted from the tag and printed.

Example 18-8. GetURLs.java
public class GetURLs {     /** The tag reader */     ReadTag reader;     public GetURLs(URL theURL) throws IOException {         reader = new ReadTag(theURL);     }     public GetURLs(String theURL) throws MalformedURLException, IOException {         reader = new ReadTag(theURL);     }     /* The tags we want to look at */     public final static String[] wantTags = {         "<a ", "<A ",         "<applet ", "<APPLET ",         "<img ", "<IMG ",         "<frame ", "<FRAME ",     };     public ArrayList getURLs( ) throws IOException {         ArrayList al = new ArrayList( );         String tag;         while ((tag = reader.nextTag( )) != null) {             for (int i=0; i<wantTags.length; i++) {                 if (tag.startsWith(wantTags[i])) {                     al.add(tag);                     continue;        // optimization                 }             }         }         return al;     }     public void close( ) throws IOException {         if (reader != null)              reader.close( );     }     public static void main(String[] argv) throws              MalformedURLException, IOException {         String theURL = argv.length == 0 ?             "http://localhost/" : argv[0];         GetURLs gu = new GetURLs(theURL);         ArrayList urls = gu.getURLs( );         Iterator urlIterator = urls.iterator( );         while (urlIterator.hasNext( )) {             System.out.println(urlIterator.next( ));         }     } }

The GetURLs program prints the URLs contained in a given web page:

darian$ java GetURLs http://daroad <IMG src="/books/2/213/1/html/2/ian.gif"> <A  HREF="webserver/index.html"> <A HREF="quizzes/"> <A HREF="servlets/IsItWorking"> <A HREF="demo.jsp"> <A ID=LinkRemote HREF="http://java.sun.com"> <A ID=LinkRemote HREF="http://www.openbsd.org"> <A ID=LinkRemote HREF="http://www.cpg.com"> <A ID=LinkRemote HREF="http://www.ssc.com"> <A ID=LinkRemote HREF="http://www.learningtree.com"> <A ID=LinkLocal HREF="javacook.html"> <A ID=LinkRemote HREF="http://java.oreilly.com"> <A ID=LinkLocal HREF="lookup/index.htm"> <A ID=LinkLocal HREF="readings/index.html"> <A ID=LinkLocal HREF="download.html"> <IMG src="/books/2/213/1/html/2/miniduke.gif" BORDER=0> darian$

The LinkChecker program in Recipe 18.13 extracts the HREF or SRC attributes and validates them.



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net