Recipe 13.1. Understanding Regular Expression Patterns

Problem

You want to understand the basic building blocks of regular expressions.

Solution

Regular expressions are built by combining characters with special meaning. First start by learning the basic patterns, and then use this knowledge to put together more complex patterns.

Discussion

A regular expression is a pattern constructed using the regular expression syntax and is typically used during text processing and pattern matching. The syntax consists of characters, metacharacters, and metasequences. Characters are interpreted literally, whereas metacharacters and metasequences have special meaning in the regular expression context. For example, the regular expression built from the characters hello matches the string "hello," whereas the regular expression consisting only of the . metacharacter means "any character" and matches "a", "b", "1", etc. Additionally, the regular expression built from using the \\d metasequence matches any digit, such as "1" or "9".

Before getting too in-depth with the regular expression syntax, let's start by discussing how regular expressions are created in ActionScript 3.0. Regular expressions are built with the RegExp class and can be constructed from either a string describing the pattern or from a regular expression literal. A regular expression literal is a forward slash, followed by the regular expression pattern, followed by another forward slash, such as / pattern /. The follow code demonstrates how to create a regular expression for the pattern hello by using both a string and the RegExp constructor, as well as a regular expression literal:

// Create a pattern for hello using the RegExp class constructor  // passing in a string describing the pattern var example1:RegExp = new RegExp( "hello" ); // Create the same hello pattern using a regular expression literal var example2:RegExp = /hello/;

Both the example1 and example2 regular expressions match the same pattern, namely the string "hello." In general, the pattern is the same regardless of which method you use to create the regular expression. However, when a backslash (\\) is part of the regular expression pattern, using a string and the RegExp constructor gets tricky.

Because the RegExp object is created by passing a string to the constructor, all references to \\ within the string must be escaped as \\\\. Since \\ is also a special character in RegExp patterns, to search for backslash in a regular expression, you must escape it like this: \\\\\\\\.

Backslashes mark the beginning of an escape sequence inside a string (see Recipe 12.3) and lose their meaning in the regular expression context. That is, the backslash is interpreted as a special string character before being interpreted in the regex. Therefore, if you want to match a pattern with a backslash, you have to use a double backlash in the string approach. The regular expression literal does not have the same problem:

// Create a regular expression to match a digit (note the double  // backslash) var example1:RegExp = new RegExp( "\\d" ); // Create a regular expression to match a digit var example2:RegExp = /\d/; // Create a regular expression that matches a backslash. var example3:RegExp = new RegExp("\\\\"); // Create a regular expression to match a backslash Var example4:RegExp = /\\/;

The preferred way to create regular expressions is by using regular expression literals, and this convention is used throughout the rest of this chapter.

By now you know that characters in a regular expression pattern are interpreted literally. By combining metacharacters and metasequences with regular characters, you can create powerful combinations useful for matching many pattern types. Let's take a look at the metacharacters, what they mean, and how they might be used.

Table 13-1 summarizes the regular expression metacharacters. Any time you want to use one of these metacharacters literally, it must be preceded by a backslash. For example, to match an open curly brace, use the regular expression \\{.

Table 13-1. Regular expression metacharacters
Expression	Meaning	Example
?	Matches the preceding character zero or one time (i.e., preceding character is optional)	`ta?k` matches tak or tk but not tik or taak
*	Matches the preceding character zero or more times	`wo*k` matches wok, wk, or woook, but not wak
+	Matches the preceding character one or more times	`craw+l` matches crawl or crawwwl but not cral
`.` (period)	Matches any one character except newline (unless the dotall flag is set)	`c.ow` matches crow or clow but not cow
^	Matches the start of the string (also matches the start of a line when the multiline flag is set)	`^wap` matches wap but not swap
$	Matches the end of the string (also matches the position before a newline "`\\n`" when the multiline flag is set)	`ow$` matches ow but not owl
\|	Matches either the left or right side of the pipe	`one\|two` matches one or two but not ten
\	Escapes the special meaning of the metacharacter following the backslash	`\\.` matches a period, instead of "any one character" like the metacharacter `.` would
`(` and `)`	Creates groups within the regular expression to:Define the scope of `\|`Define the scope of `{` and `}`Use back references, where `\\1` refers to whatever is matched in the first group, etc.	`l(o\|a)g` matches log or lag but not lug`a(b){1,2}` matches ab or abb but not a`(a\|b)\\1` matches aa or bb but not ab or ba
`[` and `]`	Defines character classes that represent matches for a single character. Presence of a `indicates a` `range of` `characters`A caret (`^`) at the beginning negates the character class (everything except what is defined by the class matches)Metacharacters do not need to be escaped with a backslash (but a dash and beginning caret do)	`l[oa]g` matches log or lag but not lug`[a-z]` matches any lowercase character such as a or h but not 1, 2, or F`l[^oa]g` matches lug but not lag or log`[+\\-]` matches + or -

Similar to metacharacters, the metasequences are described in Table 13-2 listing what the expression matches along with an example.

Table 13-2. Regular expression metasequences
Expression	Matches	Example
{`n`}	Exactly `n` occurrences of the preceding character or group	`Cre{2}l` matches creel but not crel or creel
{`n`,}	At least `n` occurrences of the preceding character or group	`Cre{2,}l` matches creel or creeeel but not crel
{`n`,`m`}	At least `n` but no more than `m` instances of preceding character or group	`Cre{2,3}l` matches creel or creeel but not crel or creel
\A	At the start of the string; similar to (`^`)	`\\Awap` matches wap but not swap
\b	Word boundary	`\\b7\\b` matches 7 but not 71 or 573
\B	Non-word boundary	`\\B7\\B` matches 573 but not 71 or 7 or 37
\d	Any numeric digit; same as `[0-9]`	`a\\d` matches a1 and a8 but not ab or ad
\D	Any non-digit character; same as `[^0-9]`	`a\\D` matches aB and ak, but not a8 or a1
\n	The newline character	`a\\nb` matches `"a\\nb"`
\r	The return character	`a\\rb` matches `"a\\rb"`
\s	Single whitespace character (space, tab, line feed, or form feed)	`King\\sTut` matches King Tut and King\\tTut
\S	Single nonwhitespace character	`\\STut` matches gTut but not Tut
\t	The tab character	`a\\tb` matches `"a\\tb"`
\u`nnnn`	The Unicode character specified by the hex digits `nnnn`	`\\u000a` matches `"\\n"`
\w	Any word character; same as `[A-Za-z0-9_]`	`a\\wm` matches arm and a8m, but not a m or aém
\W	Any non-word character; same as `[^A-Za-z0-9_]`	`a\\Wm` matches a m or aém, but not a7m or aim
\x`nn`	The ASCII character specified by the hex digits `nn`	`\\x0a` matches "`\\n`"
\Z	The end of the string; matches before the line break if the string ends in one	`ab\\Z` matches `"ab\\n"` and ab, but not `"ab\\nc"`
\z	The end of the string; matches after the line break if the string ends in one	`ab\\z` matches ab, but not `"ab\\n"` or `"ab\\nc"`

Table 13-1 and Table 13-2 describe the basic syntax rules that make up regular expressions. By combining characters, metacharacters, and metasequences, you can match a wide variety of patterns. There is more to the story, however.

Regular expressions can also include certain flags that indicate if any special processing should be done with the pattern. There are five flags that can be accessed as properties of a RegExp object: global, ignoreCase, multiline, dotall, and extended.

The flags must be set when the expression is created; trying to modify a flag on a RegExp instance results in a compile-time error:

// Generates a compile-time error in strict mode: // Property is read-only example.global = true;

There are two ways to set flags, depending on which method is used to create the regex. When using the RegExp constructor, you can pass a second string parameter that lists the flags for the regex. When using a regular expression literal, the flags should follow the trailing forward slash that ends the expression:

// Create a regular expression with the global and ignoreCase flags var example1:RegExp = new RegExp( "hello", "gi" ); // Create a regular expression with the global and ignoreCase flags var example2:RegExp = /hello/gi;

By default, all the flags are set to false unless they are explicitly declared when the regex is created. Table 13-3 lists the various flags and their meaning.

Table 13-3. Regular expression flags
Flag	Meaning	Example
g (global)	Matches more than one match	`/the/g` matches the multiple times
i (ignoreCase)	Performs a case-insensitive match for `[a-z]` and `[A-Z]` (and not special characters like é)	`/a/i` matches a and A
m (multiline)	Allows `^` to match the end of a line; allows `$` to match the beginning of a line	`/^a/m` matches both \\na and a
s (dotall)	Allows `.` to match the newline character `\\n`	`/a./s` matches both a\\n and ab
x (extended)	Allows spaces in the regex that are ignored by the pattern, allowing regex to be written more clearly	`/a \\d/x` matches a2 but not a 2 (with a space between the characters)

The most commonly used flags are ignoreCase and global, but specifying the extended flag can help in understanding regexes. With the extended flag set, you can insert extra whitespace to highlight the different parts that make up the expression; for example:

var example1:RegExp = /(a(b)*){2,}/ // Use the extended flag for slightly more readability var example2:RegExp = /(a (b)* ){2,}/x;

The preceding code creates a regular expression for "a, followed by b any number of times, with the whole expression repeated at least 2 times" and matches "abba" and "abbbabbbbbbbb," but not "abbb."

A key point to remember is that every regex can be reduced to these fundamental building blocks. Understanding this and learning how to break down complex regex patterns will help avoid some of the frustration associated with learning regular expressions. It's worth your time to learn regular expressions, and once you've got them down, they'll prove to be a valuable tool to have on your belt.

Problem

Solution

Discussion

Table 13-1. Regular expression metacharacters

Table 13-2. Regular expression metasequences

Table 13-3. Regular expression flags

See Also