Section 4.4. Regular Expressions and RegExp


4.4. Regular Expressions and RegExp

Regular expressions are arrangements of characters that form a pattern that can then be used against strings to find matches, make replacements, or locate specific substrings. Most programming languages support some form of regular expressions, and JavaScript is no exception.

Regular expressions can be created explicitly using the RegExp object, although you can also create one using a literal, as was demonstrated with the string literal in the last section. The following using the explicit option:

var searchPattern = new RegExp('+s');

While the next line of code demonstrates the literal RegExp option:

var searchPattern = /+s/;

In both cases, the plus sign(+) in the search pattern matches anything with one or more consecutive s's in a string. The forward slashes with the literal, (/+s/), mark that the object being created is a regular expression and not some other type of object.

4.4.1. The RegExp Methods: test and exec

The RegExp object has only two unique methods of interest: test and exec. The test method determines whether a string passed in as a parameter matches with the regular expression. In the following example, the pattern /JavaScript rules/ is tested against the string to see whether a match is found:

var re = /JavaScript rules/; var str = "JavaScript rules"; if (re.test(str)) document.writeln("I guess it does rule") ;

Matches are case-sensitive: if the pattern is instead /Javascript rules/, the result is false. To instruct the pattern-matching functions to ignore case, follow the second forward slash of the regular expression with the letter i:

var re =/Javascript rules/i;

The other flags are g for a global match and m to match over many lines. If using RegExp to generate the regular expression, pass these to the constructor as a second parameter:

var searchPattern = new RegExp('+s', 'g');

In the following snippet of code, the RegExp method, exec, searches for a specific pattern, /JS*/, across the entire string (g), ignoring case (i):

var re = /JS*/ig; var str = "cfdsJS *(&YJSjs 888JS"; var resultArray = re.exec(str); while (resultArray) {    document.writeln(resultArray[0]);    resultArray = re.exec(str); }

The pattern described in the regular expression is the letter J, followed by any number of S's. Since the i flag is used, case is ignored, so the js substring is found. As the g flag is given, the last index is set to the location where the last pattern was found on each successive call, so each call to exec finds the next pattern. In all, the four items found are printed out, and when no others are found, a null value is assigned to the array. This ends the loop.

These code samples have demonstrated a couple of the special regular-expression characters. There are several regular-expression characters, such as the plus sign and asterisk in the previous example.

Typically, books and articles throw all such characters into a table, and then provide a couple of examples where several are used together in a long and complicated pattern, and that's the extent of the coverage. Because of this, there are many people who have a lot of trouble putting together regular expressions and, as a consequence, their applications don't work as they originally anticipated. I think that regular expressions are important enough to at least provide several examples, from simple to complex. If you have worked with regular expressions before, you might want to skip this sectionunless you need the review.

Though the RegExp methods are used in applications, regular expressions and the RegExp object are used primarily with the String object's regex methods: replace, match, and search. The rest of the examples in this section demonstrate regular expressions using these methods.

4.4.2. Working with Regular Expressions

The first character is the backslash (\), usually called the escape character, because it's used to escape whatever character follows. In JavaScript regular expressions, this results in two behaviors. If the character is usually treated literally, such as the letter s, it's treated as a special character following the escape characterin this case, a whitespace (space, tab, form feed, line feed). If the backslash is used with a special character, such as the plus sign earlier, the character is treated as a literal.

Example 4-5 searches for instances of a space that's followed by an asterisk, and replaces them with a dash. Normally, the asterisk is used to match zero or more of the preceding characters in a regular expression, but in this case, we want to treat it as a literal.

Example 4-5. Escape character in regular expressions

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <title>The Backslash in RegExp</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <script type="text/javascript"> //<![CDATA[ var regExp = /\s\*/g; var str = "This *is *a *test *string"; var resultString = str.replace(regExp,'-'); document.writeln(resultString); //]]> </script> </body> </html>

The result of applying the regular expression against the string is the following line:

This-is-a-test-string

This is a very handy expression to keep in mind. If you want to replace all occurrences of spaces in a string with dashes, regardless of what's following the spaces, use the following pattern: /\\s/g in the replace method, passing in the hyphen as the replacement character.

Four of the regular-expression characters are used to match specific occurrences of characters: the asterisk (*) matches the character preceding it zero or more times, the plus/addition sign (+) matches the character preceding it one or more times, and the question mark (?) matches zero or one of the preceding characters. The dot (.) matches exactly one character.

Two patterns of interest are the greedy match (.*) and the lazy star (.*?). In the first, since a period can represent any character, the asterisk matches until the last occurrence of a pattern, rather than the first. If you're looking for anything within quotes, you might think of using /".*"/. If you use this with a string, such as:

test="one" or this is also a "test"

The match begins with the first double-quote and continues until the last one, not the second:

"one" or this is also a "test"

The lazy star forces the match to end on the second occurrence of the double quote, rather than the last:

"one" 


In Example 4-6, the String search method looks for a date in the format of month name followed by space, day of month, and then year. The date begins after a colon.

Example 4-6. Patterns of repeating characters

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <title>Find Date</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <script type="text/javascript"> //<![CDATA[ var regExp = /:\D*\s\d+\s\d+/; var str = "This is a date: March 12 2005"; var resultString = str.match(regExp); document.writeln("Date" + resultString); //]]> </script> </body> </html>

Looking more closely at the regular expression, the first character in the pattern is the colon, followed by the backslash with a capital letter D: \\D. This sequence is one way of looking for any nondigit character; the asterisk following means that any number of nondigit characters will match. The next part in the regular expression is a whitespace character \\s, followed by another new pattern: \\d. Unlike the earlier sequence, \\D, the lowercase letter means to match numbers only. The plus sign following it means one or more numbers. Another space follows \\s in the pattern and then another sequence of numbers \\d+.

When matched against the string using the String match method, the date preceded by the colon is found, returned, and printed out:

Date: March 12 2005

In the example, \D matches any nonnumber character. Another way to create this particular match is to use the square brackets with a number range, preceded by the caret character (^). If you want to match any character but numbers, use the following:

[^0-9]

The same holds true for \d, except now you want numbers, so leave off the caret:

[0-9]

If you wish to match on more than one character type, you can list each range of characters within the brackets. The following matches on any upper- or lowercase letters:

[A-Za-z]

Using these, the regular expression in Example 4-6 could also be given as:

var regExp = /:[^0-9]*\s[0-9]+\s[0-9]+/;

The caret is used in another pattern: it and the dollar sign are used to capture specific patterns relative to the beginning and end of a line. The caret, outside of brackets, matches any sequence beginning a line; the dollar sign matches any ending a line.

In the following code snippet, the match is not successful because the character searched did not occur at the beginning of the line:

var regExp = /^The/i; var str = "This is the JavaScript example";

However, the following would be successful:

var regExp = /^The/i; var str = "The example";

If the multiple line flag is given (m), the caret matches on the first character after the line break:

var regExp = /^The/im; var str = "This is\nthe end";

The same positional pattern matching holds true for the end-of-line character. The following doesn't match:

var regExp = /end$/; var str = "The end is near";

But this does:

var regExp = /end$/; var str = "The end";

If the multiple line flag is used, it matches at the end of the string and just before the line break:

var regExp = /The$/im; var str = "This is really the\nend";

The use of parentheses is significant in regular-expression pattern matching. Parentheses match and then remember the match. The remembered values are stored in the result array:

var rgExp = /(^\D*[0-9])/ var str = "This is fun 01 stuff"; var resultArray = str.match(rgExp); document.writeln(resultArray);

With this example, the array prints out This is fun 0 twice, separated by a comma indicating two array entries. The first result is the match; the second, the stored value from the parentheses. If, instead of surrounding the entire pattern, you surround only a portion, such as /(^\\D*)[0-9]/, this results:

This is fun 0,This is fun

Only the surrounded matched string is stored.

Parentheses can also help switch material around in a string. RegExp has special characters, labeled $1, $2, and so on to $9, that store substrings discovered through the use of the capturing parentheses. Example 4-7 finds pairs of strings separated by one or more dashes and switches the order of the strings.

Example 4-7. Swapping Strings using regular expressions

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <title>Regular Expression Switch</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <script type="text/javascript"> //<![CDATA[ var rgExp = /(\w*)-*(\w*)/ var str = "Java--Script"; var resultStrng = str.replace(rgExp,"$2-$1"); document.writeln(resultStrng); //]]> </script> </body> </html>

Here's the end result of this JavaScript:

Script-Java

Notice that the number of dashes is also stripped down to just one dash. This example also introduces another very popular pattern matching character sequence, \\w. This sequence matches any alphanumeric character, including the underscore (underline). It's equivalent to [A-Za-z0-9_]. Its converse is \\W, which is equivalent to any nonword character.

The last regular expression characters we'll examine in detail are the vertical bar (|) and curly braces. The vertical bar indicates optional matches. For instance, the following matches to either the letter a or the letter b:

a|b

You can use more than one character with vertical bars to provide more options:

a|b|c

The curly braces indicate repetition of the preceding character a set number of times. In the following, the pattern searched is two s characters together:

s{2}

Regular expressions are extremely useful when validating form contents, as demonstrated in Chapter 7.

Getting Regular with Expressions

I barely touched on regular-expression use in this chapterjust enough to introduce some key elements and several of the characters. If you're working with forms or other web page-reader input data, or with Ajax, I recommend the book, Mastering Regular Expressions by Jeffrey E.F. Friedl (O'Reilly).

There are numerous tools for working with regular expressions, and if you want to use regular expressions, I suggest taking some time to check out at least a few. If you work in Unix or Mac OS X, the utility grep is popular for finding strings within a file. Luckily, there's a Windows-based version of the tool, PowerGrep.

There are also tools that help you test regular expressions. Since I do most of my work on a Mac, I use CocoaRegex, a free and downloadable utility (shown in Figure 4-2). There are also several for Linux and Windows (search for "javascript regular expression tools"). Searching for "javascript regular expression" or just plain "regular expression" returns several sites devoted to regular expressionsincluding popular patterns and tutorials.


Figure 4-2. The regular expression tool CocoaRegex





Learning JavaScript
Learning JavaScript, 2nd Edition
ISBN: 0596521871
EAN: 2147483647
Year: 2006
Pages: 151

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net