19.3 Using Regular Expressions

Java Servlet Programming, 2nd Edition > 19. Odds and Ends > 19.3 Using Regular Expressions

< BACK

CONTINUE >

If you're a servlet programmer with a background in Perl-based CGI scripting and you're still smitten with Perl's regular expression capabilities, this section is for you. Here we show how to use regular expressions from within Java. For those of you who are unfamiliar with regular expressions, they are a mechanism for allowing extremely advanced string manipulation with minimal code. Regular expressions are wonderfully explained in all their glory in the book Mastering Regular Expressions by Jeffrey E. F. Friedl (O'Reilly).

With all the classes and capabilities Sun has added in to the JDK throughout the years, one feature still absent is a regular expression engine. Ah, well, not to worry. As with most Java features, if you can't get it from Sun, a third-party vendor is probably offering what you need at a reasonable price, or when it's something as generally useful as regular expressions, the odds are good it's available as a open source library. And in fact there's an open source regular expression engine available as part of Apache's Jakarta Project, originally developed by Jonathan Locke and available under the Apache license. The Apache license is a much more forgiving license than the GNU General Public License (GPL) because it allows developers to use the regular expression engine in creating new products without requiring them to release their own products as open source.

19.3.1 Finding Links with Regular Expressions

To demonstrate the use of regular expressions, let's use Apache's regular expression engine to write a servlet that extracts and displays in a list all the HTML <A HREF> links found on a web page. The code is shown in Example 19-4.

Example 19-4. Searching for All Links

import java.io.*; import java.net.*; import java.util.*; import javax.servlet.*; import javax.servlet.http.*; import com.oreilly.servlet.*; import org.apache.regexp.*; public class Links extends HttpServlet {   public void doGet(HttpServletRequest req, HttpServletResponse res)                                throws ServletException, IOException {     res.setContentType("text/html");     PrintWriter out = res.getWriter();     // We accept the URL to process as extra path info     // http://localhost:8080/servlet/Links/http://www.servlets.com/     String url = req.getPathInfo();     if (url == null || url.length() == 0) {       res.sendError(res.SC_BAD_REQUEST,                     "Please pass a URL to read from as extra path info");       return;     }     url = url.substring(1);  // cut off leading '/'     String page = null;     try {       // Request the page       HttpMessage msg = new HttpMessage(new URL(url));       BufferedReader in =         new BufferedReader(new InputStreamReader(msg.sendGetMessage()));          // Read the entire response into a String       StringBuffer buf = new StringBuffer(10240);       char[] chars = new char[10240];       int charsRead = 0;       while ((charsRead = in.read(chars, 0, chars.length)) != -1) {         buf.append(chars, 0, charsRead);       }       page = buf.toString();     }     catch (IOException e) {       res.sendError(res.SC_NOT_FOUND,                     "Link Extractor could not read from " + url + ":<BR>" +                     ServletUtils.getStackTraceAsString(e));       return;     }     out.println("<HTML><HEAD><TITLE>Link Extractor</TITLE>");     try {       // We need to specify a <BASE> so relative links work correctly       // If the page already has one, we can use that       RE re = new RE("<base[^>]*>", RE.MATCH_CASEINDEPENDENT);       boolean hasBase = re.match(page);       if (hasBase) {         // Use the existing <BASE>         out.println(re.getParen(0));       }       else {         // Calculate the base from the URL, use everything up to last '/'         re = new RE("http://.*/", RE.MATCH_CASEINDEPENDENT);         boolean extractedBase = re.match(url);         if (extractedBase) {           // Success, print the calculated base           out.println("<BASE HREF=\"" + re.getParen(0) + "\">");         }         else {           // No trailing slash, add one ourselves           out.println("<BASE HREF=\"" + url + "/" + "\">");         }       }       out.println("</HEAD><BODY>");       out.println("The links on <A HREF=\"" + url + "\">" + url + "</A>" +                   " are: <BR>");       out.println("<UL>");       String search = "<a\\s+[^<]*</a\\s*>";       re = new RE(search, RE.MATCH_CASEINDEPENDENT);       int index = 0;       while (re.match(page, index)) {         String match = re.getParen(0);         index = re.getParenEnd(0);         out.println("<LI>" + match + "<BR>");       }       out.println("</UL>");       out.println("</BODY></HTML>");     }     catch (RESyntaxException e) {       // Should never happen as the search strings are hard coded       e.printStackTrace(out);     }   } }

A screen shot is shown in Figure 19-1.

Figure 19-1. Ultralow-bandwidth browsing

Let's walk through the code. First, the servlet determines the URL whose links are to be extracted by looking at its extra path info. This means that this servlet should be invoked like this: http://localhost:8080/servlet/Links/http://www.servlets.com. Then the servlet reads the contents at that URL using the HttpMessage class and stores the page as a String. For extremely large pages this approach is not efficient, but it makes for a good book example.

The next step is to make sure the output page has a proper <BASE> tag so that any relative links in our list will be interpreted correctly by the browser. If there's a preexisting <BASE> tag on the input page we can use that, so we search for such a tag using the regular expression <base[^>]*> We pass that string to the org.apache.regexp.RE constructor along with a case-insensitivity flag. This regexp syntax is standard and exactly like Perl. It says to match the text <base>, followed by any number of characters that aren't >, followed by the text >. If we have a match, we extract the match using re.getParen(0) which gets the most outside match (matches may be nested by parentheses).

If there's no <BASE> tag in the page, we need to construct one. The <BASE> should be everything in the source URL up to and including the last slash.^[1] We can extract this information using the simple regular expression http://.*/. This says to match the text http://, followed by any number of characters, followed by a /. The .* pattern reads as many characters as possible while still satisfying the rest of the regexp condition (what regexp terminology calls being greedy) so this expression returns everything up to and including the trailing slash. If there's no trailing slash, we simply add one.

^[1] Without extra logic not shown here (involving preconnecting to the server), this approach can fail for URLs like http://www.jdom.org/news that redirect to http://www.jdom.org/news/. To be safe, make explicit any trailing slash in the URL passed to this servlet.

Finally we extract the <A HREF> tags from the page using the fairly complicated regular expression <a\s+[^<]*</a\s*>. (You'll notice the code has escaped the \ characters with an additional \ character.) This says to match the text <a, followed by one or more whitespace characters (\s indicates whitespace), followed by any number of chracters that aren't <, followed by the text </a, followed by any number of whitespace characters, followed by >. Put together, this extracts <A HREF> tags from the beginning <A to the trailing </A>, all nicely case insensitive and whitespace forgiving, and making sure not to erroneously match tags like <APPLET>. As each match is found, it's displayed in the list, and the search continues using an index to record the next starting point.

For more information on what can be done with regular expressions in Java, see the documentation that comes with the library.

< BACK

CONTINUE >