Crafting Your Own Regular Expressions

Up to this point, this chapter has introduced you to the reFind() and reReplace() functions (and their case-insensitive counterparts). Along the way, you learned about a number of RegEx concepts, such as subexpressions and backreferences. You've also seen some decent examples of actual RegEx criteria syntax (that is, the various wildcards you can use in regular expressions)but you haven't been formally introduced to what each of the wildcards does.

The remainder of this chapter will focus on the regular expressions themselves.

Understanding Literals and Metacharacters

Every regular expression you write includes two types of characters: literals and metacharacters.

Literals, or literal characters, are normal text characters that represent themselves literally. In other words, literals are all the characters in a RegEx that aren't wildcards of one form or another. In the email RegEx that's been used several times in this chapter (see Listing 13.1), the only literal character is the @ sign. If your search involves the word dog, your RegEx will likely contain the literal d, o, and g characters.

Metacharacters are the various special characters (what I've been calling wildcards up to this point) that have special meaning to the regular expression engine. You've already seen a few of the most common metacharacters, such as the [, ], {, }, and + characters. You'll learn about all the rest in the pages to come.

NOTE

Up to this point, I've been using the term wildcard as an approximate synonym for metacharacter. Wildcard is less technical and perhaps a bit less precise, but it rolls off the tongue a lot more easily and is more intuitively understood. I imagine you've understood what I've meant by wildcard all along, whereas metacharacter might have slipped us up a bit. I'll continue to use wildcard during the less formal parts of the remaining discussion.

Including Metacharacters Literally

Sometimes, you need to include one of the metacharacters as a literal. To do so, you escape the metacharacter by preceding it with a backslash. You saw this demonstrated in Listing 13.2, where the sequences $ and $ were used to denote literal parentheses characters (that is, parentheses that should actually be searched for, rather than having their usual special meaning of indicating a subexpression).

NOTE

If you need to search for a literal backslash, escape the backslash with another backslash. Just use two backslashes together, as in \\.

NOTE

Remember that the backslash serves another purpose, too: indicating a backreference. For details, see the earlier section "Altering Text with Backreferences."

Introducing the Cast of Metacharacters

The RegEx implementation in ColdFusion supports a lot of metacharacters, which can be broken into the conceptual groups shown in Table 13.6.

Table 13.6. Metacharacter Types
TYPE	DESCRIPTION
Character classes	Character classes define a set of characters that will match. They are defined with square brackets: `[aeiou]` matches any single vowel; `[0-9]` matches any single number, and `[^0-9]` matches any single character except numbers. There are also special shortcuts for often-used sets of characters, such as `\w` or for any letter or number, or `\s` for any whitespace character. Finally, there's the dot character (`.`), which matches any character at all.
Quantifiers	These metacharacters allow you to specify how many times a certain item can appear to still be considered a match. Quantifiers include `?` for optional matches, `+` for one or more matches, and `*` for any number of matches (including none). There are also the interval quantifiers: `{num}` for `num` number of matches; `{num,max}` for `num` to `max` number of matches; and `{num,}` for `num` or more matches.
Alternation	You can establish OR conditions in your regular expressions with the `\|` character. Parentheses constrain how far the \| reaches, so `(you\|we)` matches `you` or `we`.
String anchors	String anchors let you specify that a match must occur at a particular location in a chunk of text. Anchors include `^` for matches at the beginning of the text (or line) and `$` for matches at the end. There are also the `\A` and `\Z` anchors, which are similar, except do not work in multiline mode.
Escape sequences	Escape sequences are mostly for matching certain unprintable characters; for example, `\t` to match tabs or `\n` to match newlines.
Modifiers	Modifiers allow you to turn on different types of RegEx behavior for use in special cases. Modifiers include `(?m)` for line-by-line matching and `(?=)` for lookahead matching.

The next few sections present a kind of crash course in metacharacters. I've titled these sections Metacharacters 101, Metacharacters 102, and so on. By the end of this little course, you'll have a pretty good understanding of regular expression syntax. Aren't you glad you didn't actually have a course like this in school?

Metacharacters 101: Character Classes

Of all of the metacharacters available in regular expressions, character classes are probably the most important. Character classes are a way of specifying a set of characters, any one of which can be considered a match. You can specify your own classes or use any number of predefined classes supported by RegEx.

Specifying Character Classes with `[ ]`

You can specify any set of characters as a character class with the square bracket characters [ and ]. The class [aeiouAEIOU] will match any vowel; [12345] will match a 1, 2, 3, 4, or 5 character. For instance, perhaps your last name is Andersen and people often misspell it as Anderson or forget to capitalize the first letter. You could find any of the various spellings using [Aa]nders[eo]n as the regular expression.

The hyphen character has special meaning when it is between a set of square brackets: It indicates a range of acceptable characters. For instance, [1-5] is easier to type than [12345] and will still match a 1, 2, 3, 4, or 5 character. Very common character classes are [A-Za-z] for matching any letter, and [0-9] for matching any single number character. If your company uses an ID number composed of two letters followed by a dash and then three numbers, you could use this as the regular expression:

 [A-Z][A-Z]-[0-9][0-9][0-9]

As you'll learn in Metacharacters 102, you could use quantifiers as an easier way of specifying the part consisting of three numbers at the end.

Negating a Character Class with `^`

If the square bracket contents for a character class start with a caret character, the character class is negated, meaning that the class will match any character that isn't in the class. For example, [^A-Za-z0-9] matches anything other than a number or letter, and [^aeiouAEIOU] matches anything other than a vowel.

NOTE

Keep in mind that there are lots of other characters other than letters and numbers, including unprintable characters such as tabs and newlines. So, while you may think at first glance that[^aeiouAEIOU] would simply match all consonants, that's not all it will match. It will also match unprintable characters, and all other characters, too, including punctuation characters (commas, periods, and the like).

Common Character Classes

Because certain character classes are called for frequently (such as [A-Za-z] for matching any letter, and [0-9] for matching any digit), ColdFusion supports a number of shortcuts for the most commonly needed character classes. Different regular expression tools support slightly different ways of specifying these shortcuts, but most adhere to the shortcuts supported by Perl or by POSIX. ColdFusion's RegEx implementation supports both. The Perl shortcuts, in particular, are really easy to type.

Table 13.7 shows common character classes you might need to use in your regular expressions, with the Perl-style and POSIX-style shortcuts for each. The Normal column shows how to write the character class using the normal square bracket syntax. The Perl Shortcut and POSIX Shortcut columns show the shortcuts for each class; for some of the classes, there is a POSIX shortcut but no corresponding Perl shortcut, in which case the Perl Shortcut column is left blank. A few shortcuts shown at the bottom of the table would be virtually impossible to type using the manual [ ] syntax, so the Normal column is left blank.

Table 13.7. Common Character Classes and Their Shortcuts
NORMAL	PERL	POSIX SHORTCUT	MATCHES SHORTCUT
`[A-Z]`	`[[:upper:]]`	Any uppercase letter.
`[a-z]`	`[[:lower:]]`	Any lowercase letter
`[A-Za-z]`	`[[:alpha:]]`	Any letter, regardless of case.
`[0-9]`	`\d`	`[[:digit:]]`	Any number character (digit).
`[^0-9]`	`\D`	`[^[:digit:]]`	Any character other than a number.
`[0-9A-Za-z]`	`\w`	`[[:alnum:]]`	Any letter or number character.
`[^0-9A-Za-z]`	`\W`	`[^[:alnum:]]`	Any character other than a number or letter.
`[ \t]`	`[[:blank:]]`	A space or a tab.
`[ \t\n\r\f]`	`\s`	`[[:space:]]`	Any whitespace character, which means any spaces, tabs, or any of the end-of-line indicators (newlines, form feeds, and carriage returns).
`[^ \t\n\r\f]`	`\S`	`[[:graph:]]`	Any nonwhitespace character.
	`.` (dot)	Any character at all. It's important to understand that in ColdFusion, the dot character always matches newlines, which is not always the case with Perl.	Any number character (digit).

NOTE

As noted in this table, ColdFusion's dot metacharacter always matches any character, including newlines. In other words, the behavior is consistent with Perl behavior when Perl's /s switch is in effect.

In the preceding section, we discussed a regular expression for matching an ID number that comprised two letters, a dash, and three numbers. The RegEx looked like this:

 [A-Z][A-Z]-[0-9][0-9][0-9]

You can use Perl-style shortcuts to make the RegEx easier to type and look at, like this:

 [A-Z][A-Z]-\d\d\d

Or, you can use POSIX-style shortcuts, like so:

 [[:upper:]][[:upper:]]-[[:digit:]][[:digit:]][[:digit:]]

Feel free to mix and match the two types of shortcuts, like so:

 [[:upper:]][[:upper:]]-\n\n

NOTE

The POSIX shortcuts can be negated with the ^ character, as shown in the POSIX Column for the [^0-9] class in Table 13.7.

NOTE

You might be wondering why you would use [[:upper:]] instead of [A-Z], because it doesn't seem to be much of a shortcut at all (there's actually more to type). The main benefit is that the POSIX shortcuts attempt to understand uppercase and lowercase letters for each language, whereas something like [A-Z] will work only for English and other roman-style character sets.

Metacharacters 102: Quantifiers

As you learned in Table 13.6, quantifiers allow you to specify how many times certain parts of a RegEx can match for the overall regular expression to be considered a match. You will learn about the many quantifiers in detail as we work through the Metacharacters section.

Regardless of which quantifier you're using, you always place it right after the item that you want to affect. That item might be a single character, a character class, or the set of parentheses that sets off a subexpression. If character classes are the foundation of what regular expressions are about, quantifiers give the technology its muscle; without them, it would be hard to solve anything but simple problems with RegEx.

Table 13.8 lists the quantifier metacharacters available for your use.

Table 13.8. RegEx Quantifiers
QUANTIFIER	DESCRIPTION
`?`	Means that the preceding item is optional. In more technical terms, `?` matches the preceding item zero or one times. The preceding item might be a single character, a character class, or a subexpression.
`+`	Means that the preceding item appears at least once; that is, + matches the preceding item one or more times. Again, the preceding item might be a single character, a character class, or a subexpression.
`*`	Means that the preceding item is optional, but also may appear any number of times. Conceptually, it's like combining `?` and `+`. That is, it matches the preceding item zero or more times.
`{num}`	Matches the preceding item exactly `num` times; for instance, either `[0-9]{3}` or `\d{3}` will match three numbers (digits).
`{num,max}`	Matches the preceding item between `num` and `max` times. So, if you were looking for all words between 5 and 10 letters in length, you could use `[A-Za-z]{5,10}` or `[[:alpha:]]{5,10}`.
`{num,}`	Matches the preceding item at least `num` times, without a maximum, so `[0-9]{10,}` could be used to find long words (longer than 10 letters). If you think about it, the + character (above in this table) could be considered a shortcut for `{1,}`.

Using Quantifiers

Let's look at a few examples of using character classes and quantifiers. Say you need to create a regular expression that will match a U.S. ZIP code. Let's start off with the simple five-digit version of a ZIP code. Using the character class skills you learned in Metacharacters 101, you know you could use this:

 [0-9][0-9][0-9][0-9][0-9]

or this:

 \d\d\d\d\d

You can use the {num} quantifier from Table 13.8 to avoid having to type a separate class for each digit, like so:

 [0-9]{5}

or like so:

 \d{5}

Now let's say you want to match the nine-digit version of a ZIP code. Just add another class and quantifier sequence, like so:

 \d{5}-\d{4}

NOTE

For those of you who aren't from the U.S., sorry to use such a culturally myopic example. It's just a natural one to start off with. Anyway, U.S. ZIP codes are just the postal code used in a mailing address. ZIP codes come in two forms. For a long time, they were simply five-digit numbers. Later, the postal service introduced a nine-digit version, in the form 99999-9999. Both forms are used in practice today..

Making Certain Portions Be Optional with `?`

Okay, what if you wanted to accept either five- or nine-digit ZIP codes? You can use the ? quantifier to say that the second portion of the code is optional, as in the following (Figure 13.13):

 \d{5}(-\d{5})?

Figure 13.13. The `?` operator handles items that don't necessarily need to be present.

Note that the ? quantifier respects parentheses, so in this example everything within the parentheses is modified by the ?.

Including One or More Matches with `+`

Another cool quantifier is the + metacharacter. Because it matches one or more times, + is essential for matching substrings that will vary in length. That turns out to describe the majority of regular expression problems, so you'll be using + a lot.

The following matches any number of digits:

 [0-9]+

Like ? and all the other quantifiers, the + character respects parentheses. When it follows a parenthesized group, + matches the entire group one or more times. You can also nest these sets of parentheses within one another, an approach that forms the basis of the email address RegEx you have seen throughout this chapter:

 [\w._]+@[\w_]+(\.[\w_]+)+

That looks complex at first, but it's not so bad if you concentrate on each portion separately. The first portion is in charge of matching the username part of the email address (the part before the @ sign). I came up with [\w._]+ for this part, which matches any number of letters, numbers, dots, or underscores. After the @ sign, the next portion is [\w_]+, which is almost the same except that it doesn't match dots. Next comes a parenthesized group. Inside the parentheses, the expression reads \.[\w_]+, which means a dot, then any number of letters, numbers, or underscores. The fact that there's a + after the parentheses means that this pattern (a dot, then other stuff) can be repeated any number of times.

In plain English, then, the expression reads "any number of normal characters, then an @ sign, then any number of groups, where the groups each have dots at the beginning," which is a fair description of a validly formed email address.

NOTE

Some of the examples in this chapter add a few additional sets of parentheses to this regular expression so that it contains subexpressions for each part of the email address (see Figure 13.4). Those parentheses don't have anything to do with the + sign, and don't affect which addresses actually match. They just make it possible to capture each portion of the match separately.

Matching Any Number of Matches with `*`

The * metacharacter is similar to + in that it will match one, two, or any other number of whatever preceded it. The difference is that it will also match zero times: It matches even if the preceding item isn't present at all. I like to think of this quantifier as meaning "any amount of the preceding, but let it be optional."

For instance, it could be used to find  (boldface) tags in a chunk of HTML:

 <b>.*</b>

In plain English, this means to match a , then any amount of anything, then . This seems sensible enough. If you try it against this text:

 The <b>Bear</b> walked alone

you will find that the Bear part is what matches, which is what you would expect. However, if you try it against this text:

 The <b>Bear</b> and the <b>Fox</b> walked hand in hand.

it will match the Bear and the Fox part of the text. That is, the RegEx engine finds the first , then matches everything up to the last . What's going on? Although it might seem counterintuitive at first, it's important to understand that the .* part really does mean "any number of any characters." There's nothing in the .* expression that says that the .* part isn't supposed to match the characters in the  part. It's an important concept that is crucial to understand when crafting regular expressions.

By default, regular expressions are "greedy," which means that the processor is always willing to return the least rigorous interpretation of your RegEx as possible. Or, to put it another way, the engine will always assume that you want the longest possible match. The ColdFusion documentation refers to this as maximal matching, but most regular expression references call it greedy matching.

One way to fix the boldface-text example is to replace the .* with [^<]*, like so:

 <b>[^<]*</b>

See the difference? In plain English, this now means "match , then match any number of anything that isn't a <, then match ."

When used against the previous text sample, this version of the RegEx will correctly match Bear and Fox, making it a pretty good solution to the problem. However, it will fail if the text contains any < characters between the  and , like this:

 The <b><i>Bear</i></b> and the <b>Fox</b> walked hand in hand.

Using this text, the [^<]* expression will only match Fox. Bummer. All is not lost, though. You can tell the RegEx engine not to use greedy matching, which brings us to our next topic.

Using Minimal Matching (Non-Greedy) Quantifiers

As you have seen, the fact that regular expressions will match the longest possible substring by default, maximal matching (greedy matching) can sometimes be a problem. In such situations, you can use slightly different quantifiers to tell the RegEx engine to match the shortest possible substring instead. The ColdFusion documentation refers to this as minimal matching (as opposed to maximal matching), but most RegEx texts call it non-greedy matching.

There is a non-greedy version of each of the quantifiers shown in Table 13.8. To indicate that you want to use the non-greedy version, follow the quantifier with a ? character, as shown in Table 13.9.

Table 13.9. Minimal Matching (Non-Greedy) Quantifiers
QUANTIFIER	DESCRIPTION
`??`	Non-greedy version of `?`, which means that the preceding item is optional. The difference in the non-greedy version is that the RegEx engine will first try to match based on the item's absence. In other words, the item will only be included in the match if it is not possible to get a match without the item.
`+?`	Non-greedy version of `+`, which means that the preceding item will match at least once, but as few times as possible.
`*?`	Non-greedy version of `*`, which means that the preceding item can appear any number of times (including none at all), but the shortest possible string will always be found.
`{num,max}?`	Non-greedy version of `{num,max}`, which means that the preceding item will match between `num` and `max` times, but as few times as actually possible.
`{num,}?`	Non-greedy version of `{num,}`, which means that the preceding item will match at least `num` times, but as few times as actually possible.

Using your newfound knowledge of non-greedy quantifiers, the boldfaced-text problem becomes easy to solve:

 <b>(.*?)</b>

If you wanted to ensure that there was at least one character between the  and  tags, you could use the non-greedy version of + instead of *, like so:

 <b>(.+?)</b>

This expression will match all bold text, but not empty  tags.

NOTE

Non-greedy matching is sometimes called lazy matching, meaning that the RegEx engine is "lazily" trying to match as little text as possible.

Metacharacters 201: Alternation

Sometimes you might need to find matches that contain one string or pattern, or another string or pattern. That is, sometimes you need the conceptual equivalent of what would be called an "or" in normal programming languages, or the OR part of a SQL query.

To perform "or" matches with regular expressions, use the | character (usually called the pipe character). Each pipe represents the idea of "or." Just as in normal programming, the | character's effect can be constrained with parentheses, so Number (1|2) is different from Number 1|2. The first would match the string Number 1 or Number 2, whereas the second would match Number 1 or just the number 2.

The following RegEx would match the phrase My Red Fox, My Brown Fox, or My Beige Fox. It would also match My 1 Fox, My 2 Foxes, My 3 Foxes, or any other number of foxes:

 My ((Red|Brown|Beige|1) Fox|[0-9]+Foxes)\b

Metacharacters 202: Word Boundaries

Often, you will need to write regular expressions that are aware of word boundaries. ColdFusion supports the Perl-style \b and \B boundary sequences, as described in Table 13.10.

Table 13.10. Perl-Style Boundary Sequences
SEQUENCE	MEANING
`\b`	Matches what can generally be described in plain English as a word boundary. Technically, a boundary is defined as the transition between an alphanumeric character and a nonalphanumeric character.
`\B`	The opposite of `\b`, matching any character that is not a word boundary. Generally less useful than \b in most scenarios.

The \b boundary sequence is particularly handy for making sure that your regular expression matches only whole words. For instance, the regular expression \b[Cc]at\b would match cat or Cat, but not Cats, Catsup, or Scat.

Metacharacters 203: String Anchors

String anchors are conceptually similar to boundary sequences (see the preceding section), because they are another way of making sure that your regular expression doesn't find undesired "partial matches." Whereas boundaries are about making sure the match "bumps up" against the beginning or end of a word, string anchors are about making sure the match "bumps up" against the beginning or end of the entire chunk of text being searched.

The RegEx string anchors are listed in Table 13.11.

Table 13.11. String Anchors
ANCHOR	DESCRIPTION
`^`	Matches the beginning the chunk of text being searched. Or, in multiline mode, matches the beginning of a line (multiline mode is discussed next).
`$`	Matches the end of the text being searched. Or, in multiline mode, matches the end of a line.
`\A`	Always matches the beginning of the chunk of text being searched, regardless of whether multiline mode is being used.
`\Z`	Always matches the end of the text being searched, regardless of multiline mode.

For instance, perhaps you have a form field called ZipFieldPlus4, which you want to validate to make sure it contains a properly formatted U.S. postal ZIP code (the nine-digit "+4" variety). If you didn't know about string anchors, you might decide to use \d{5}-\d{4} as the regular expression, like so:

 <cfif reFind("\d{5}-\d{4}", FORM.zipCodePlus4)>  Okay <cfelse>  Not Valid </cfif>

This regular expression seems to do the job. It displays "Okay" if the user enters something like 01201-9809, and "Not Valid" if the user enters 01201-98 or just 01201.

However, it will also display "Okay" if the user types Foo 01201-9809 or 01201-9809Bar, because there is nothing about the regular expression that says the ZIP code must be the only thing the user enters. The solution is to anchor the regular expression to the beginning and end of the string using ^ and $, like so:

 <cfif reFind("^\d{5}-\d{4}$", FORM.zipCodePlus4)>  Okay <cfelse>  Not Valid </cfif>

Alternatively, you could use the \A and \Z sequences, like so:

 <cfif reFind("\A\d{5}-\d{4}\Z", FORM.zipCodePlus4)>  Okay <cfelse>  Not Valid </cfif>

These two snippets will perform the same way, because ^ is synonymous with \A (and $ is synonymous with \Z) unless the regular expression uses multiline mode.

Understanding Multiline Mode

If you start your regular expression with the special sequence (?m), the regular expression is processed in what the ColdFusion and Perl engines call multiline mode. Multiline mode means that the ^ and $ characters match the beginning and end of a line within the chunk of text being searched, rather than the beginning and end of the entire chunk of text (Figure 13.14).

Figure 13.14. Multiline mode anchors matches to lines in the text being searched.

Let's say you were going to search the following chunk of text:

 1 frog a leaping 2 foxes jumping 100 programs crashing 5 golden rings

The following regular expression would get only the first line; because multiline mode is not in effect, ^ will match only the very beginning of the text:

 ^\d+[[:print:]]+

This one matches all four lines; because multimode is on, ^ matches the beginning of a line:

 (?m)^\d+[[:print:]]+

This next one matches the first three lines (because they all end with ing), but not the last line (Figure 13.14):

  (?m)^\d+[[:print:]]+ing$

All this said, it is very important to understand what the definition of a line is for the purposes of multiline mode processing. When you use (?m) with ColdFusion, each linefeed character (that's ASCII character 10) is considered to start a new line; this is the Unix method of indicating new lines. Carriage return characters (ASCII code 13) are not considered the start of new lines, which means that

Multimode processing won't work correctly with chunks of text that originate on Macintosh computers, because the text might contain only carriage return characters and no linefeeds.
Chunks of text that originate on Windows/MS-DOS machines probably contain CRLF sequences (a carriage return followed by a linefeed), to separate the lines. As far as RegEx's multimode processing is concerned, a carriage return character sits at the very end of every line, which means that the $ will not work properly because it matches only linefeeds, not carriage returns.
Chunks of text that originate on Unix machines will work fine (but if the chunks of text are coming from the public, it's unlikely that they are using Unix browsers).

Therefore, if you are going to use multiline mode, I recommend that you use ColdFusion's normal replace() method to massage the chunk of text that you're going to search. First, replace each CRLF with a linefeed (that should take care of the Windows text), and then replace any remaining carriage returns with linefeeds (to deal with the Mac text). Assuming that the chunk of text you will be searching is in a string variable called str, the following two lines will do the job:

 <cfset str = reReplace(str, Chr(13)&Chr(10), Chr(10), "ALL")> <cfset str = reReplace(str, Chr(13), Chr(10), "ALL")>

Another option would be to use the adjustNewlinesToLinefeeds() function included in the RegExFunctions.cfm UDF library (Table 13.5), like so:

 <cfset str = adjustNewlinesToLinefeeds(str)>

Metacharacters 301: Match Modifiers

Perl 5 introduced a number of special modifiers that begin with the sequence (?, as listed in Table 13.12. Most of these modifiers are discussed elsewhere in this chapter, as indicated.

NOTE

The ColdFusion documentation implies that you can use only (?x) or (?m) or (?i) at the very beginning of a regular expression. Actually, you can use them anywhere in the expression, but they always affect the whole expression, ignoring parentheses. There is no way to say that you only want part of the expression to be affected by (?i), for instance. This is consistent with Perl's behavior. Just the same, I recommend putting these match modifiers at the beginning of the expression, because that's the documented usage.

As an example of using the (?x) modifier, consider the simple phone number RegEx that has been used elsewhere in this chapter. When used in a reFind(), it can look a bit unwieldy and somewhat inscrutable:

 <cfset match = reFind("(\([0-9]{3}\))([0-9]{3}-[0-9]{4})", text, 1, true)>

Using (?x), you can spread the regular expression over as many lines as you want, using whatever indention you want. You can also use the # sign to add comments, like this:

 <cfset match = reFind("(?x)  ( ## (begin capturing area code with subexpression)  \([0-9]{3}\) ## Area Code portion, surrounded by literal parentheses  ) ## (end capturing of area code)  ( ## (begin capturing actual phone number)  [0-9]{3} ## "Exchange" portion of phone number,  - ## then a hyphen,  [0-9]{4} ## then the last four digits of phone number  ) ## (end capturing of phone number) ", text, 1, True)>

Anything from a ## to the end of the line is considered to be a comment.

NOTE

Actually, the RegEx comment indicator is a single #, not ##, but because # has special meaning to ColdFusion, you need to use two pound signs together in order to get the # character into the RegEx string. This is the case anytime you need to embed # within a quoted string in CFML.

TIP

If you need to match a space character while using (?x), escape the space character by typing a \followed by a space. That tells the processor to consider the space as an actual part of the match criteria, rather than part of the indention and other decorative whitespace.

Metacharacters 302: Lookahead Matching

As noted in Table 13.12, you can use the positive lookahead modifier at the beginning of any parenthesized set of items. Positive lookahead means that you want to test that a pattern exists, but without it actually being considered part of the match. For instance, consider the following regular expression:

 \bBelinda (?=Foxile)

This expression will match Belinda in a chunk of text, but only if it is followed by Foxile. Belinda followed by Carlisle will not match.

Negative lookahead, conversely, means that you want to test that a pattern does not exist. Conceptually, it's kind of like being able to say "this but not that." The following expression will match any Belinda, as long as it's not Belinda Carlisle:

 \bBelinda (?!Carlisle)

Here's another example of using lookahead. Say you are using a simple regular expression such as the following to match telephone numbers in the form (999)999-9999:

 (\([0-9]{3}\))([0-9]{3}-[0-9]{4})

The following variation adds negative lookahead to match only the phone numbers that are not in the 212 area code (see Figure 13.15):

 (\((?!212)[0-9]{3}\))([0-9]{3}-[0-9]{4})

Figure 13.15. Lookahead matching allows for "this but not that" matches.

This last variation adds negative lookahead together with backreferences in the regular expression to match only the phone numbers that are not in the 212 area code, but where the phrase (new listing) appears after the number:

 (\((?!212)[0-9]{3}\))([0-9]{3}-[0-9]{4})\s+(?=\(new listing\))

NOTE

ColdFusion does not support lookbehind processing (Perl's (?<=) and (?<!) sequences).

Metacharacters 303: Backreferences Redux

Earlier in this chapter, you learned about using backreferences such as \1 and \2 in the replacement string when using REReplace(), which allowed you to perform replacements that were far more intelligent than with static replacement strings. You can also use backreferences within the regular expression itself: Each backreference is like a variable that holds the value of the corresponding subexpression.

For instance, let's look at our telephone number RegEx again. Here's the normal version of the expression:

 (\([0-9]{3}\))([0-9]{3}-[0-9]{4})

The following variation matches only those phone numbers where the last four digits are the same:

 (\([0-9]{3}\))([0-9]{3}-(\d)\3\3\3)

This variation adds negative lookahead (discussed in the preceding section) to match only phone numbers in which the last four digits are not the same:

 (\([0-9]{3}\))([0-9]{3}-(?!(\d)\3\3\3))

Metacharacters 304: Escape Sequences

ColdFusion supports the use of normal Perl escape sequences in regular expressions, as shown in Table 13.13. Previously, you needed to add these special characters to your RegEx string using the Chr() function. You can still do so, but these escape sequences are more standard and easier to type and read.

Table 13.13. RegEx Escape Sequences
ESCAPE SEQUENCE	DESCRIPTION
`\n`	Newline.
`\t`	Tab.
`\f`	Form feed.
`\r`	Carriage return.
`\x00`	Allows you to specify any character, using a two digit hexadecimal number. For instance, the ASCII code for an exclamation point is 33 using normal (decimal) numbers; this is 21 in hexadecimal, so you could use `\x21` to specify an exclamation point in a RegEx. (Clearly, there would be more point to this if it were a character that's not on your keyboard, but you get the idea.)
`\000`	Allows you to specify any character, using a three digit octal character. The octal version of 33 is 41, so you could also use `\041` to specify an exclamation point.

It's worth nothing that these escape sequences can be used in character classes, so [\x00-xC8] would match any of the first 200 characters in the character set (C8 is hexadecimal for what we humans call 200).

Understanding Literals and Metacharacters

Including Metacharacters Literally

Introducing the Cast of Metacharacters

Table 13.6. Metacharacter Types

Metacharacters 101: Character Classes

Specifying Character Classes with [ ]

Negating a Character Class with ^

Common Character Classes

Table 13.7. Common Character Classes and Their Shortcuts

Metacharacters 102: Quantifiers

Table 13.8. RegEx Quantifiers

Using Quantifiers

Making Certain Portions Be Optional with ?

Figure 13.13. The ? operator handles items that don't necessarily need to be present.

Including One or More Matches with +

Matching Any Number of Matches with *

Using Minimal Matching (Non-Greedy) Quantifiers

Table 13.9. Minimal Matching (Non-Greedy) Quantifiers

Metacharacters 201: Alternation

Metacharacters 202: Word Boundaries

Table 13.10. Perl-Style Boundary Sequences

Metacharacters 203: String Anchors

Table 13.11. String Anchors

Understanding Multiline Mode

Figure 13.14. Multiline mode anchors matches to lines in the text being searched.

Metacharacters 301: Match Modifiers

Metacharacters 302: Lookahead Matching

Figure 13.15. Lookahead matching allows for "this but not that" matches.

Metacharacters 303: Backreferences Redux

Metacharacters 304: Escape Sequences

Table 13.13. RegEx Escape Sequences

Specifying Character Classes with `[ ]`

Negating a Character Class with `^`

Making Certain Portions Be Optional with `?`

Figure 13.13. The `?` operator handles items that don't necessarily need to be present.

Including One or More Matches with `+`

Matching Any Number of Matches with `*`