Purpose of Regex

     

If you're already Mr. Big Deal Regex Guy [1] and just want to know how Java does it, skip this section. Else, continue.

[1] I would like to take this opportunity to acknowledge our female readers within the programming community. Please note that throughout this book, for each reference to "x guy", where x refers to some technology, let "guy" refer to a member of either gender. I hope that this is accepted usage here, as such use colloquially seems generally accepted. For example, "Hey, are you guys going to lunch with us?" spoken cheerfully to one's male and female coworkers, or, "Uh, when did you guys get back from Reno ”I thought you weren't gunna be home til Sunday," when referring directly to one's own parents.

A regular expression is a description of a textual pattern that enables string matching. You create a string of characters with special meaning and compile them into a regular expression pattern, and then use that pattern to find strings that match it.

The most common example of a regex type pattern is *, where the asterisk matches everything. In SQL, you might type %ant and because the % in SQL matches zero or more of any character, your expression will match "Fancy pants", "supplant", and so forth.

See Table 24-1 for a reference of metacharacter shortcuts you can use as you write regular expressions.

Regex Metacharacters

I F Y OU W ANT

U SE

Any digit 0 “ 9

\d

Any non-digit

\D

Letters , numbers , and underscores

\w

Any character that isn't a letter, number, or underscore

\W

A whitespace character

\s

Any non-whitespace character

\S

Any one character

-

Allow nothing before this character

^

Allow nothing after this character

$

Zero or one character

?

Zero or more characters

*

One or more characters

+


The metacharacters are those that stand in for character types or are a placeholder for the number of characters they specify. You use metacharacters in combination with regular character sequences. Although that sounds straightforward enough, it can be tricky to use them in meaningful combinations of any complexity with characters.

Maybe you feel comfortable with regex now after that short introduction. That's really about all there is to metacharacters. A moment to learn, way too long to master.

As the great blues artist Willie Dixon sang, "I ain't supestitious, but a black cat crossed my trail ." So just in case you do have some reservations , let's take a look at some examples in Table 24-2.

Matching Regular Expressions with Character Sequences

T HIS R EGEX

M ATCHES T HESE S TRINGS

A*B

B,A,AAB, AAAB, etc.

A+B

AB,AAB,AAAAB, etc.

A?B

B or AB only

[XYZ]C

XC,YC,ZC only

[A-C]B

AB,BB or CB only

[2-4]D

2D,3D,4D only

Deal\s\d

Deal 8, Deal 999, but not Deal3 or DealA

(XZ)Y

XY or ZY

(cat\s) { 2}

cat cat

(cat\s){ 1-3}

cat, cat cat, or cat cat cat


I hate saying this, but regular expressions are the subject of a number of complete books and it's probably best if I refer you to one of them at this point instead of going further with them outside the scope of their use in Java. Anyway, the preceding tables and the following sample code really should cover most of the situations you will need at first. Before you do anything crazy, check the API documentation for the java.util.Pattern class. It features a huge number of commonly used regular expression patterns, and if you're stuck, that might help you out. Which would be great. I have a really cool car.

Regex in Java

Ah, regex in Java. It's no big thing. Although Java can't do much to make our time-honored regular expressions any less obscure, they can make them easier to work with.

The relevant package here is java.util.regex. It contains two classes: Pattern and Matcher. Here's how it works:

  1. Create a String containing your regular expression.

  2. Compile that String into an instance of the Pattern class. You do this using the static Pattern.compile(String regex) method.

  3. Create an instance of the Matcher class to match arbitrary character sequences against your pattern.

Those steps look like the following in code:

 

 Pattern pattern = Pattern.compile("hi\s {1-3}"); Matcher matcher = p.matcher("hi hi hi"); boolean isMatch = m.matches(); //true 

You can also do all of this in one step (very cool).

 

 boolean isMatch = Pattern.matches("hi\s {1-3}", "hi hi hi"); 

It is easy to use the metacharacters in your regular expressions to determine if a String matches or the regex not. For example, say you need to do some validation on data entered by users in your application. You need to make sure that a U.S. zip code is entered, which means five digits. You can do it as follows :

 

 /**  * Returns whether or not the passed number is  * exactly five digits, which the requirements state  * the customer number is.  * Matches: 01234 and 85558<br>  * Does not match: three or 123ert or 999999 or 876  * @param numToCheck the number you want to check.  * @return true if the passed number is 5 digits,  * false otherwise  */ private boolean validateCustomerNumber(String           numToCheck){   return Pattern.matches("^\d{5}$", numToCheck); } 

Pattern and Matcher

As mentioned, there are only two classes in the java.util.regex package.

A Pattern object is a compiled instance of a String representing a regex. The Pattern object can be used to create a Matcher object to match character sequences.

A Matcher object interprets patterns to match them with character sequences. You can create a Matcher directly or through invoking the Pattern class's matcher() method. There are three ways to match a pattern using Matcher.

  1. The matches() method tries to match the complete input sequence against the pattern.

  2. The lookingAt() methods tries to match the input sequence against the pattern, starting at the beginning.

  3. The find() method scans the input sequence looking for the next subsequence that matches the pattern.

Regular expressions are commonly used to search through vast reams of digital text. They can also be used to check if a string passed in from a user matches a certain pattern. A good example of this is when you ask users to choose a password, and require that it be six characters and contain at least one number. User validation is another good example ”we use it commonly to make sure that an e-mail address entered by a user is well formed .

However, there is another common use for regular expressions, and that is replacing text that matches a certain pattern.

Let's look at a complete example now that demonstrates use of both features: matching strings (and returning true if they match and false if they don't) and replacing character sequences found with different character sequences of our choice.

RegexDemo.java
 

 package net.javagarage.demo.regex; import java.util.regex.*; /**  * Demonstrates how to use the Java regular expressions  * classes. You should be able to use  * this code right out of the box, as it also serves  * as a useful library that checks all of these things:  * <ul>  * <li>URL  * <li>email address  * <li>phone number  * <li>dates  * <li>numbers  * <li>letters  * <li>tags, such as <body>  * <li>general characters, such as new lines,  * whitespace, etc.  * </ul>  * <p>  * All of these regular expressions must be escaped  * because we are using them in Java. That means that  * a regex that looks like this: <br>  * <code>\s+</code>  * <br>  * we must write like this:<br>  * <code>\s+</code>  * <br>  * ...with all of those \ characters escaped.  * <p>  * By default, pattern matching is greedy, which  * means that the matcher returns the  * longest match possible.  * <p>  * These regexes should be just fine for general  * business app use, but don't put this code in your  * nuclear reactor or space shuttle (duh).  *  * @author eben hewitt  * @see java.util.regex.Pattern, java.util.regex.Matcher  **/ public class RegexDemo { //Constants for regular expressions for those actions: //Constant matches any number 0 through 9. public static final String NUMERIC_EXP = "[0-9]"; //Constant matches any letter, regardless of case. public static final String ALL_LETTERS = "[a-zA-Z]"; //POSIX for all letters and numbers. Works only for           US-ASCII. public static final String ALPHANUMERIC = "\p{Alnum}"; //Useful for stripping tags out of HTML public static final String TAG_EXP = "(<[^>]+>)"; //matches an email address public static final String EMAIL_ADDRESS_EXP = "(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})"; //matches a URL public static final String URL_EXP = "^http\://[a-zA-Z0-9\-\.]+\.            [a-zA-Z]{2,3}(/\S*)?$"; //matches 12/14/2003 and 1/5/03 public static final String DATE_EXP = "^((0[1-9])(1[0-2]))\/([1-9](0[1-9])            ([1-2][0-9])3[0-1])\/[0-9]{1,4}$"; //phone number with area code public static final String PHONE_EXP =           "((\(\d{3}\) ?)(\d{3}-))?\d{3}-\d{4}"; /**  * Matches a line termination character sequence.  * In this case, it is a character  * or character pair from the set:  * \n, \r,\r\n, \u0085,  * \u2028, and \u2029.  */ public static final String LINE_TERMINATION = "(?m)$^[\r\n]+\z"; /**  * Matches more than one whitespace character  * in a row. For example: "hi there".  */ public static final String DUPLICATE_WHITESPACE =           "\s+"; /**  * typical characters  */ public static final String TAB = "[\t]"; public static final String NEW_LINE = "\n"; public static final String CARRIAGE_RETURN = "\r"; public static final String BACKSLASH = "\"; /**  * Tells you if a certain String matches a certain  * regex pattern. Use this method if you want to  * define your own regex, and not use one provided  * via the constants.  * @param s String used as character sequence  * @param p String regular expression you want to match  * @return boolean indicates whether the pattern is  * found in the String.  * If the pattern is found in the string,  * returns true.  */ public static boolean regexMatcher(String input,           String p) { CharSequence inputStr = input; String patternStr = p; // Compile regular expression Pattern pattern = Pattern.compile(patternStr); // Replace all occurrences of pattern in input Matcher matcher = pattern.matcher(inputStr); return matcher.find(); } /**  * Finds all instances of <code>pattern</code> in  * <i>string</i>, and replaces each with the  * <code>replacer</code>. <br>  * Examples:<ul>  * <li>Input:regexReplacer("D99D",  * RegexHelper.ALL_LETTERS, "x")  * Result: x99x  * <li>etc  *  * @param input String you want to clean up.  * @param p String defining a regex pattern that you  * want to match the passed String against.  * @param replacer String containing the characters  * you want in the final product  * in place of the pattern string characters.  * @return String Containing the same String but with  * replaced instances.  */ public static String regexReplacer(String input,           String p, String replacer) {     CharSequence inputStr = input;     String patternStr = p;     String replacementStr = replacer;     // Compile regular expression     Pattern pattern = Pattern.compile(patternStr);     // Replace all occurrences of pattern in input     Matcher matcher = pattern.matcher(inputStr);     return matcher.replaceAll(replacementStr); } //shows the results of using the different //regexes for matching private static void demoMatchers(){     print("MATCHING:");     // This input is not alphanumeric     print("Is %*& alphanumeric? " +       regexMatcher("%*&", RegexDemo.ALPHANUMERIC));     //good email: returns true     print("Is VALID EMAIL: " +     regexMatcher("dude@fake.com",     RegexDemo.EMAIL_ADDRESS_EXP));     //bad email: returns false     print("Is VALID EMAIL: " +     regexMatcher("not @ good",     RegexDemo.EMAIL_ADDRESS_EXP));     //good URL: returns true     print("Is URL: " +     regexMatcher("http://www.javagarage.net",     RegexDemo.URL_EXP));     //bad URL: returns false     print("Is URL: " +     regexMatcher("java.com", RegexDemo.URL_EXP));     //check date 30/12/2003: false     //no, it isn't localized. you can't have everything     print("Is VALID DATE: " +     regexMatcher("30/12/2003", RegexDemo.DATE_EXP));     //check date     print("Is VALID DATE: " +     regexMatcher("5/12/2003", RegexDemo.DATE_EXP));     //check a phone number     print("Is VALID PHONE: " +     regexMatcher("(212)555-1000", RegexDemo.PHONE_EXP)); } //shows replacing characters matched private static void demoReplacers(){     print("REPLACING:");     //remove all HTML tags from this code     //and replace with nothing     print("Remove Tags: " +     regexReplacer("<html><body><p>the              text</p></body></html>",     RegexDemo.TAG_EXP, ""));     //remove extra whitespace     print("Extra whitespace replace: " +     regexReplacer("far away",     RegexDemo.DUPLICATE_WHITESPACE, " "));     //remove and replace new line characters     print("Remove New Lines: " +     regexReplacer("1\n2\n3", RegexDemo.NEW_LINE,     " NEW LINE WAS HERE "));     //replace tab characters (\t) with pipes ()     print("Tab replace: " +     regexReplacer("Hello\tSweetheart",                RegexDemo.TAB, "")); } //just to save typing private static void print(String s){     System.out.println(s); } //run the show public static void main(String[] a) {     demoMatchers();     demoReplacers(); } } 

The output of executing RegexDemo.java is

 

 MATCHING: Is %*& alphanumeric? false Is VALID EMAIL: true Is VALID EMAIL: false Is URL: true Is URL: false Is VALID DATE: false Is VALID DATE: false Is VALID PHONE: true REPLACING: Remove Tags: the text Extra whitespace replace: far away Remove New Lines: 1 NEW LINE WAS HERE 2 NEW LINE WAS HERE 3 Tab replace: HelloSweetheart 

As with most of the code in this here Garage book, the idea is that the comments are inline, and I want to encourage you to read the code. So I put all the stuff I want to say about RegexDemo.java in the class itself. You can use this code in your environment. It will come in handy maybe.

You can also match and/or replace strings in files (obviously, I think), not just in strings you pass into the static methods of the preceding class. Want to try it? Okay!

Let's say that you've got a file at E:\\old.data , and you want to find all occurrences of the phrase "wood" and replace it with the character sequence " banana ". You can do it with the file ReplacePatternFile.

The following is the content of old.data:

 

 How much wood could a woodchuck chuck if a woodchuck could chuck wood. 

ReplacePatternInFile.java
 

 package net.javagarage.demo.regex; import java.io.*; import java.util.regex.Matcher; import java.util.regex.Pattern; import javax.swing.JOptionPane; /**  * <p>  * Demos how to open an arbitrary text file and  * replace all occurrences of the given regex pattern  * with some other character sequence. Neat!  * <p>  * And Dude said let John Lee Hooker accompany the  * writing of this class. And thus it was done.  * And Dude saw that it was good.  *  * @author eben hewitt  */ public class ReplacePatternInFile {     BufferedWriter writer;     BufferedReader reader; public static void main(String[] args) {     ReplacePatternInFile work = new ReplacePatternInFile();     work.replaceFile("E:\old.data", "E:\new.data");     System.out.println("Replaced matches"); } //this method reads in the file specified by the first //param, and writes it out to the file name specified //by the second param private void replaceFile(String fileIn,                 String fileOut) {     File inFile = new File(fileIn);     File outFile = new File(fileOut);     try {     //get an inputstream to read in the desired file     FileInputStream inStream = new FileInputStream(inFile);     //get an output stream so we can write the new file     FileOutputStream outStream = new FileOutputStream(outFile);     //the buffered reader performs efficient reading-in     //of text from a source containing characters     reader = new BufferedReader(     new InputStreamReader(inStream));     //this will write the new data to our second file     writer = new BufferedWriter(     new OutputStreamWriter(outStream));     //this is the string we want to match for replacing     Pattern p = Pattern.compile("wood");     Matcher m = p.matcher("");     String s = null;     String result;     //loop over each line of the original file,     //and each time we meet an occurrence of p     //the readLine method     while ( (s = reader.readLine()) != null ) {     //reset with a new input sequence     m.reset(s);     //replace the matches with this string     result = m.replaceAll("banana");     //write the data to the file     writer.write(result);     //put a new line character     writer.newLine();     }     //clean up after yourself     reader.close();     writer.close();     } catch (IOException ioe){           JOptionPane.showMessageDialog(null, ioe.getMessage());           System.exit(1);     }     } } 

After executing the code, we have a new file called new.data. The contents of that file should look like this:

 

 How much banana could a bananachuck chuck if a bananachuck could chuck banana. 

Well, that's it for this topic of regular expressions. That's enough about it for all I care. I just hope that it's all you care for too. At least for right now. Now I feel uncomfortable. Why since I was in sixth grade I don't know what to do with my hands. Maybe I should take up smoking. Hm. It is possible, after all.

And on that note, let's split.



Java Garage
Java Garage
ISBN: 0321246233
EAN: 2147483647
Year: 2006
Pages: 228
Authors: Eben Hewitt

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net