Introduction


Suppose you have been on the Internet for a few years and have been very faithful about saving all your correspondence, just in case you (or your lawyers, or the prosecution) need a copy. The result is that you have a 50-megabyte disk partition dedicated to saved mail. And let's further suppose that you remember that somewhere in there is an email message from someone named Angie or Anjie. Or was it Angy? But you don't remember what you called it or where you stored it. Obviously, you have to look for it.

But while some of you go and try to open up all 15,000,000 documents in a word processor, I'll just find it with one simple command. Any system that provides regular expression support allows me to search for the pattern in several ways. The simplest to understand is:

Angie|Anjie|Angy

which you can probably guess means just to search for any of the variations. A more concise form ("more thinking, less typing") is:

An[^  dn]

to search in all the files. The syntax will become clear as we go through this chapter. Briefly, the "A" and the "n" match themselves, in effect finding words that begin with "An", while the cryptic [^ dn] requires the "An" to be followed by a character other than a space (to eliminate the very common English word "an" at the start of a sentence) or "d" (to eliminate the common word "and") or "n" (to eliminate Anne, Announcing, etc.). Has your word processor gotten past its splash screen yet? Well, it doesn't matter, because I've already found the missing file. To find the answer, I just typed the command:

grep 'An[^ dn]' *

Regular expressions, or regexes for short, provide a concise and precise specification of patterns to be matched in text.

As another example of the power of regular expressions, consider the problem of bulk-updating hundreds of files. When I started with Java, the syntax declaring array references was baseType arrayVariableName[]. For example, a method with an array argument, such as every program's main method, was commonly written as:

public static void main(String args[]) {

But as time went by, it became clear to the stewards of the Java language that it would be better to write it as baseType[] arrayVariableName, e.g.:

public static void main(String[] args) {

This is better Java style because it associates the "array-ness" of the type with the type itself, rather than with the local argument name, and the compiler now accepts both modes. I wanted to change all occurrences of main written the old way to the new way. I used the pattern 'main(String [a-z]' with the grep utility described earlier to find the names of all the files containing old-style main declarations, that is, main(String followed by a space and a name character rather than an open square bracket. I then used another regex-based Unix tool, the stream editor sed, in a little shell script to change all occurrences in those files from 'main(String \([a-z][a-z]*\)\[\]' to 'main(String[] \1' (the syntax used here is discussed later in this chapter). Again, the regex-based approach was orders of magnitude faster than doing it interactively, even using a reasonably powerful editor such as vi or emacs, let alone trying to use a graphical word processor.

Unfortunately, the syntax of regexes has changed as they get incorporated into more tools[1] and more languages, so the exact syntax in the previous examples is not exactly what you'd use in Java, but it does convey the conciseness and power of the regex mechanism.

[1] Non-Unix fans fear not, for you can do this on Win32 using one of several packages. One is an open source package alternately called CygWin (after Cygnus Software) or GnuWin32 (http://sources.redhat.com/cygwin/). Another is Microsoft's own Unix Services for Windows. Or you can use my Grep program in Recipe 4.6 if you don't have grep on your system. Incidentally, the name grep comes from an ancient Unix line editor command g/RE/p, the command to find the regex globally in all lines in the edit buffer and print the lines that match just what the grep program does to lines in files.

As a third example, consider parsing an Apache web server log file, where some fields are delimited with quotes, others with square brackets, and others with spaces. Writing ad-hoc code to parse this is messy in any language, but a well-crafted regex can break the line into all its constituent fields in one operation (this example is developed in Recipe 4.10).

These same time gains can be had by Java developers. Prior to 1.4, Java did not include any facilities for describing regular expressions in text. This is mildly surprising given how powerful regular expressions are, how ubiquitous they are on the Unix operating system (where Java was first brewed), and how powerful they are in modern scripting languages like sed, awk, Python, and Perl. Table 4-1 lists about half a dozen regular expression packages for Java. I even wrote my own at one point; it worked well enough but was too slow for production use. The Apache Jakarta Regular Expressions and ORO packages are widely used.

Table 4-1. Some Java regex packages

Package

Notes

URL

JDK 1.4 API

Package java.util.regex

http://java.sun.com/

Richard Emberson's

Unknown license; not being maintained

None; posted to advanced-java@berkeley.edu in 1998

Ian Darwin's regex

Simple, but slow. Incomplete; didactic

http://www.darwinsys.com/java/

Apache Jakarta RegExp(original by Jonathan Locke)

Apache (BSD-like) license

http://jakarta.apache.org/regexp/

Apache Jakarta ORO(original by Daniel Savarese)

Apache license; more comprehensive than Jakarta RegExp

http://jakarta.apache.org/oro/

GNU Java Regexp

Lesser GNU Public License

http://www.cacas.org/java/gnu/regexp/


With JDK 1.4 and later, regular expression support is built into the standard Java runtime. The advantage of using the JDK 1.4 package is its integration with the runtime, including the standard class java.lang.String and the "new I/O" package. In addition to this integration, the JDK 1.4 package is one of the fastest Java implementations. However, code using any of the other packages still works, and you will find existing applications using some of these packages for the next few years since the syntax of each package is slightly different and it's not necessary to convert. Any new development, though, should be based on the JDK 1.4 regex package.

The first edition of this book focused on the Jakarta RegExp package; this edition covers the JDK 1.4 Regular Expressions API and does not cover any other package. The syntax of regexes themselves is discussed in Recipe 4.1, and the syntax of the Java API for using regexes is described in Recipe 4.2. The remaining recipes show some applications of regex technology in JDK 1.4.

See Also

Mastering Regular Expressions by Jeffrey E. F. Friedl (O'Reilly), now in its second edition, is the definitive guide to all the details of regular expressions. Most introductory books on Unix and Perl include some discussion of regexes; Unix Power Tools (O'Reilly) devotes a chapter to them.



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net