ProblemYou need to extract just the URLs from a file. SolutionUse ReadTag from Recipe 18.9 and just look for tags that might contain URLs. DiscussionThe program in Example 18-8 uses ReadTag from the previous recipe and checks each tag to see if it is a "wanted tag" defined in the array wantedTags. These include A (anchor), IMG (image), and APPLET tags. If it is determined to be a wanted tag, the URL is extracted from the tag and printed. Example 18-8. GetURLs.javapublic class GetURLs { /** The tag reader */ ReadTag reader; public GetURLs(URL theURL) throws IOException { reader = new ReadTag(theURL); } public GetURLs(String theURL) throws MalformedURLException, IOException { reader = new ReadTag(theURL); } /* The tags we want to look at */ public final static String[] wantTags = { "<a ", "<A ", "<applet ", "<APPLET ", "<img ", "<IMG ", "<frame ", "<FRAME ", }; public ArrayList getURLs( ) throws IOException { ArrayList al = new ArrayList( ); String tag; while ((tag = reader.nextTag( )) != null) { for (int i=0; i<wantTags.length; i++) { if (tag.startsWith(wantTags[i])) { al.add(tag); continue; // optimization } } } return al; } public void close( ) throws IOException { if (reader != null) reader.close( ); } public static void main(String[] argv) throws MalformedURLException, IOException { String theURL = argv.length == 0 ? "http://localhost/" : argv[0]; GetURLs gu = new GetURLs(theURL); ArrayList urls = gu.getURLs( ); Iterator urlIterator = urls.iterator( ); while (urlIterator.hasNext( )) { System.out.println(urlIterator.next( )); } } } The GetURLs program prints the URLs contained in a given web page: darian$ java GetURLs http://daroad <IMG src="/books/2/213/1/html/2/ian.gif"> <A HREF="webserver/index.html"> <A HREF="quizzes/"> <A HREF="servlets/IsItWorking"> <A HREF="demo.jsp"> <A ID=LinkRemote HREF="http://java.sun.com"> <A ID=LinkRemote HREF="http://www.openbsd.org"> <A ID=LinkRemote HREF="http://www.cpg.com"> <A ID=LinkRemote HREF="http://www.ssc.com"> <A ID=LinkRemote HREF="http://www.learningtree.com"> <A ID=LinkLocal HREF="javacook.html"> <A ID=LinkRemote HREF="http://java.oreilly.com"> <A ID=LinkLocal HREF="lookup/index.htm"> <A ID=LinkLocal HREF="readings/index.html"> <A ID=LinkLocal HREF="download.html"> <IMG src="/books/2/213/1/html/2/miniduke.gif" BORDER=0> darian$ The LinkChecker program in Recipe 18.13 extracts the HREF or SRC attributes and validates them. |