Using Regular Expressions in ColdFusion

The next two portions of this chapter will teach you about two concepts:

How to use CFML's RegEx functions (reFind() and the others listed in Table 13.2) to actually perform regular expression operations within your ColdFusion pages.
How to craft the regular expression for a particular task, using the various RegEx wildcards available to you.

This is a kind of chicken-and-egg scenario for me. How can I explain how to incorporate regular expressions like ([\w._]+)\@([\w_]+(\.[\w_]+)+) in your CFML code if you don't yet understand what all those wildcards mean? On the other hand, wouldn't it be pretty boring to learn about all the wildcards before knowing how to put them to use?

To put it another way, it's hard for me to guess what kind of learner you are, or how much you already know about regular expressions. If you don't know anything at all about them, you might want to learn about the various wildcards first. If you've already used them in other tools, you probably just want to know how to use them in ColdFusion. So feel free to skip ahead to the "Crafting Your Own Regular Expressions" section if you don't like looking at all these wildcards without understanding what they mean.

Finding Matches with `reFind()`

Assuming you have already crafted the wildcard-laden RegEx criteria you want, you can use the reFind() function to tell ColdFusion to search a chunk of text with the criteria, like this:

 reFind(regex, string [, start] [, returnSubExpressions] )

Table 13.3 describes each of the reFind() arguments.

Table 13.3. `reFind()` Function Syntax
ARGUMENT	DESCRIPTION
`regex`	Required. The regular expression that describes the text that you want to find.
`string`	Required. The text that you want to search.
`start`	Optional. The starting position for the search. The default is `1`, meaning that the entire string is searched. If you provide a start value of `50`, then only the portion of the string after the first 49 characters is searched.
`returnSubExpressions`	Optional. A Boolean value indicating whether you want to obtain information about the position and length of the actual text that was found by the various portions of the regular expression. The default is False. You will learn more about this topic in the section "Getting the Matched Text Using returnSubExpressions" later in this chapter.

The reFind() function returns one of two things, depending on whether the returnSubExpressions argument is True or False:

Assuming that returnSubExpressions is False (the default), the function returns the character position of the text that's found (that is, the first substring that matches the search criteria). If no match is found in the text, the function returns 0 (zero). This behavior is consistent with the ordinary, non-RegEx find() function.
If returnSubExpressions is True, the function returns a CFML structure composed of two arrays called pos and len. These arrays contain the position and length of the first substring that matches the search criteria. The first value in the arrays (that is, pos[1] and len[1]) correspond to the match as a whole. The remaining values in the arrays correspond to any subexpressions defined by the regular expression.

The bit about the subexpressions might be confusing at this point, since you haven't learned what subexpressions actually are. Don't worry about it for the moment. Just think of the subexpressions argument as something you should set to True if you need to get the actual text that was found.

A Simple Example

For the moment, accept it on faith that the following regular expression will find a sensibly formed Internet email address (such as nate@nateweiss.com or nate@nateweiss.co.uk):

 ([\w._]+)\@([\w_]+(\.[\w_]+)+)

Listing 13.1 shows how to use this regular expression to find an email address within a chunk of text.

Listing 13.1. `RegExFindEmail1.cfm`A Simple Regular Expression Example

 <!---  Filename: RegExFindEmail1.cfm  Author: Nate Weiss (NMW)  Purpose: Demonstrates basic use of reFind() ---> <html> <head><title>Using a Regular Expression</title></head> <body> <!--- The text to search ---> <cfset text = "My email address is nate@nateweiss.com. Write to me anytime."> <!--- Attempt to find a match ---> <cfset foundPos = reFind("([\w._]+)@([\w_]+(\.[\w_]+)+)", text)> <!--- Display the result ---> <cfif foundPos gt 0>  <cfoutput>  <p>A match was found at position #foundPos#.</p>  </cfoutput> <cfelse>  <p>No matches were found.</p> </cfif> </body> </html>

If you visit this page with your browser, the character position of the email address is displayed (Figure 13.1). If you change the text variable so that it no longer contains an Internet-style email address, the listing displays "No matches were found."

Figure 13.1. Regular expressions can search for email addresses, phone numbers, and the like.

Ignoring Capitalization with `reFindNoCase()`

Internet email addresses aren't generally considered to be case-sensitive, so you might want to tell ColdFusion to perform the match without respect to case. To do so, use reFindNoCase() instead of reFind(). Both functions take the same arguments and are used in exactly the same way, so there's no need to provide a separate example listing for reFindNoCase().

In short, anywhere you see reFind() in this chapter, you could use reFindNoCase() instead, and vice-versa. Just use the one that's appropriate for the task at hand. Also, note that it is possible to use case-insensitive regular expressions, making reFindNoCase() unnecessary.

Getting the Matched Text Using the Found Position

Sometimes you just want to find out whether a match exists within a chunk of text. In such a case, you would use the reFind() function as it was used in Listing 13.1.

You can also use that form of reFind() if the nature of the RegEx is such that the actual match will always have the same length. For instance, if you were searching specifically for a U.S. telephone number in the form (999)999-9999 (where each of the 9s represents a number), you could use the following regular expression:

 \([0-9]{3}\)[0-9]{3}-[0-9]{4}

Because the length of a matched phone number will always be the same due to the nature of phone numbers, it's a simple matter to extract the actual phone number that was found. You use ColdFusion's built-in mid() function, feeding it the position returned by the reFind() function (as shown in Figure 13.1) as the start position, and the number 13 as the length.

Listing 13.2 puts these concepts together, displaying the actual phone number found in text (Figure 13.2).

Listing 13.2. `RegExFindPhone1.cfm`Using `mid()` to Extract the Matched Text

 <!---  Filename: RegExFindPhone1.cfm  Author: Nate Weiss (NMW)  Purpose: Demonstrates basic use of reFind() ---> <html> <head><title>Using a Regular Expression</title></head> <body> <!--- The text to search ---> <cfset text = "My phone number is (718)555-1212. Call me anytime."> <!--- Attempt to find a match ---> <cfset matchPos = reFind("(\([0-9]{3}\))([0-9]{3}-[0-9]{4})", text)> <!--- Display the result ---> <cfif matchPos gt 0>  <cfset foundString = mid(text, matchPos, 13)>  <cfoutput>  <p>A match was found at position #matchPos#.</p>  <p>The actual match is: #foundString#</p>  </cfoutput> <cfelse>  <p>No matches were found.</p> </cfif> </body> </html>

Figure 13.2. If you know its length ahead of time, it's easy to display the matched text.

Getting the Matched Text Using `returnSubExpressions`

If you want to adjust the email address example in Listing 13.1 so that it displays the actual email address found, the task is a bit more complicated because not all email addresses are the same length. What would you supply to the third argument of the mid() function? You can't use a constant number in the manner shown in Listing 13.2. Clearly, you need some way of telling reFind() to return the length, in addition to the position, of the match.

This is when the returnSubExpressions argument comes into play. If you set this argument to True when you use reFind(), the function will return a structure that contains the position and length of the match. (The structure also includes the position and length that correspond to any subexpressions in the structure, but don't worry about that right now.)

Listing 13.3 shows how to use this parameter of the reFind() function. It uses the first element in pos and len arrays to determine the position and length of the matched text and then displays the match (Figure 13.3).

Listing 13.3. `RegExFindEmail2.cfm`Using `reFind()'s returnSubExpressions` Argument

 <!---  Filename: RegExFindEmail2.cfm  Author: Nate Weiss (NMW)  Purpose: Demonstrates basic use of REFind() ---> <html> <head><title>Using a Regular Expression</title></head> <body> <!--- The text to search ---> <cfset text = "My email address is nate@nateweiss.com. Write to me anytime."> <!--- Attempt to find a match ---> <cfset matchStruct = reFind("([\w._]+)\@([\w_]+(\.[\w_]+)+)", text, 1, True)> <!--- Display the result ---> <cfif matchStruct.pos[1] gt 0>  <cfset foundString = mid(text, matchStruct.pos[1], matchStruct.len[1])>  <cfoutput>  <p>A match was found at position #matchStruct.pos[1]#.</p>  <p>The actual match is: #foundString#</p>  </cfoutput> <cfelse>  <p>No matches were found.</p> </cfif> </body> </html>

Figure 13.3. It's easy to display a matched substring, even if its length will vary at run time.

Working with Subexpressions

As exhibited by the last example, the first values in the pos and len arrays correspond to the position and length of the match found by the reFind() function. Those values (pos[1] and len[1]) will always exist. So why are pos and len implemented as arrays if the first value in each is the only interesting value? What other information do they hold?

The answer is this: If your regular expression contains any subexpressions, there will be an additional value in the pos and len arrays that corresponds to the actual text matched by the subexpression. If your regular expression has two subexpressions, pos[2] and len[2] are the position and length of the first subexpression's match, and pos[3] and len[3] are the position and length for the second subexpression.

So, what's a subexpression? When you are using regular expressions to solve specific problems (such as finding email addresses or phone numbers in a chunk of text), you are often looking for several different patterns of text, one after another. That is, the nature of the problem is often such that the regular expression is made up of several parts ("look for this, followed by that"), where all of the parts must be found in order for the whole regular expression to be satisfied. If you place parentheses around each of the parts, the parts become subexpressions.

Subexpressions do two things:

They make the overall RegEx criteria more flexible, because you can use many regular expression wildcards on each subexpression. This capability allows you to say that some subexpressions must be found while others are optional, or that a particular subexpression can be repeated multiple times, and so on. To put it another way, the parentheses allow you to work with the enclosed characters or wildcards as an isolated group. This isn't so different conceptually from the way parentheses work in <cfif> statements or SQL criteria.
The match for each subexpression is included in the len and pos arrays, so you can easily find out what specific text was actually matched by each part of your RegEx criteria. You get position and length information not only for the match as a whole, but for each of its constituent parts.

TIP

If you don't want a particular set of parentheses, or subexpressions, to be included in the len and pos arrays (that is, if you are only interested in the grouping properties of the parentheses and not in their returning-the-match properties), you can put a ?: right after the opening parenthesis. See Table 13.12 near the end of this chapter for details.

Table 13.12. Match Modifiers Supported in ColdFusion
MODIFIER	DESCRIPTION
`(?x)`	Allows you to write the rest of the expression with indentation, whitespace, and comments. A nice alternative to writing a very complex expression all on one long line (see example after this table).
`(?m)`	Tells the engine to use multiline mode for purposes of matching `^` and `$` (discussed in the preceding section, "Understanding Multiline Mode").
`(?i)`	Tells the engine to perform case-insensitive matches, regardless of whether you are using `reFind()` or `reFindNoCase()`or, for that matter, `reReplace()` versus `reReplaceNoCase()`.
`?:`	Used at the beginning of a set of parentheses, tells the engine not to consider the value as a subexpression. That is, `(?:)` means that the parentheses will not add an item to the `len` and `pos` arrays (see Listing 13.4). The parentheses still behave normally in all other respects (for instance, a quantifier after a set of the parentheses still applies to everything within the set).
`?=`	Used at the beginning of a set of parentheses, tells the engine to match whatever is inside the parentheses using positive lookahead. This means you want to make sure that the pattern exists but that you don't need it to be part of the actual match.
`?!`	Used at the beginning of a set of parentheses, tells the engine to match whatever is inside the parentheses using negative lookahead, which means that you want to make sure that the pattern does not exist.

In real-world use, most regular expressions contain subexpressionsit's the nature of the beast. In fact, each of the regular expressions in the example listings shown so far has included subexpressions because the problems they are trying to solve (finding email addresses and phone numbers) require that they look for strings that consist of a few different parts.

Take a look at the regular expression used in Listing 13.3, which matches email addresses:

 ([\w._]+)@([\w_]+(\.[\w_]+)+)

I know you haven't learned what all the wildcards mean yet; for now, just concentrate on the parentheses. It may help you to keep in mind that the plain-English meaning of each of the [\w_]+ sequences is "match one or more letters, numbers, or underscores."

By concentrating on the parentheses, you can easily recognize the three subexpressions in this RegEx. The first is at the beginning and matches the portion of the email address up to the @ sign. The second subexpression begins after the @ sign and continues to the end of the RegEx; it matches the "domain name" portion of the email address. Within this second subexpression is a third one, which says that the domain name portion of the email address can contain any number of subparts (but at least one), where each subpart is made up of a dot and some letters (such as .com or .uk).

Now take a look at the RegEx from Listing 13.2, which matches phone numbers:

 (\([0-9]{3}\))([0-9]{3}-[0-9]{4})

This one has two subexpressions. You might have thought it has three because there appear to be three sets of parentheses. But the parentheses characters that are preceded by backslash characters don't count, because the backslash is a special escape character that tells the RegEx engine to treat the next character literally. Here, the backslashes tell ColdFusion to look for actual parentheses in the text, rather than treating those parentheses as delimiters for subexpressions.

So the phone number example includes just two subexpressions. The first subexpression starts at the very beginning and ends just after the \) characters and it matches the area code portion of the phone number. The second subexpression contains the remainder of the phone number (three numbers followed by a hyphen, then four more numbers). See Listing 13.4.

Listing 13.4. `RegExFindEmail3.cfm`Getting the Matched Text for Each Subexpression

 <!---  Filename: RegExFindEmail3.cfm  Author: Nate Weiss (NMW)  Purpose: Demonstrates basic use of REFind() ---> <html> <head><title>Using a Regular Expression</title></head> <body> <!--- The text to search ---> <cfset text = "My email address is nate@nateweiss.com. Write to me anytime."> <!--- Attempt to find a match ---> <cfset matchStruct = reFind("([\w._]+)@([\w_]+(\.[\w_]+)+)", text, 1, True)> <!--- Display the result ---> <cfif matchStruct.pos[1] gt 0>  <!--- The first elements of the arrays represent the overall match --->  <cfset foundString = mid(text, matchStruct.pos[1], matchStruct.len[1])>  <!--- The subsequent elements represent each of the subexpressions --->  <cfset userNamePart = mid(text, matchStruct.pos[2], matchStruct.len[2])>  <cfset domainPart = mid(text, matchStruct.pos[3], matchStruct.len[3])>  <cfset suffixPart = mid(text, matchStruct.pos[4], matchStruct.len[4])>  <cfoutput>  <p>A match was found at position #matchStruct.pos[1]#.<br>  The actual email address is: <b>#foundString#</b><br>  The username part of the address is: #userNamePart#<br>  The domain part of the address is: #domainPart#<br>  The suffix part of the address is: #suffixPart#<br>  </p>  </cfoutput> <cfelse>  <p>No matches were found.</p> </cfif> </body> </html>

This listing is similar to the preceding one (Listing 13.3), except that instead of working with only the first values in the pos and len arrays, Listing 13.4 also works with the second, third, and fourth values. It displays the username, domain name, and domain suffix portions of the match, respectively (Figure 13.4).

Figure 13.4. Subexpressions are handy for matching portions of a RegEx.

TIP

If you need to know the number of subexpressions in a RegEx, you can use arrayLen() with either the pos or len array and then subtract 1 from the result (because the first values of the array is for the match as a whole). In Listing 13.4, you could output the value of arrayLen(MatchStruct.pos)-1 to find the number of subexpressions in the email RegEx (the answer would be 3).

Working with Multiple Matches

So far, this chapter's listings have shown you how to find the first match in a given chunk of text. Often, that's all you need to do. There are times, however, when you might need to match multiple phone numbers, email addresses, or something else.

The reFind() and reFindNoCase() functions don't specifically provide any means to find multiple matches at once, but you can use the start argument mentioned in Table 13.3 to achieve the same result. Listing 13.5 shows how.

Listing 13.5. `RegExFindEmail4.cfm`Finding Multiple Matches with a `<cfloop>` Block

 <!---  Filename: RegExFindEmail4.cfm  Author: Nate Weiss (NMW)  Purpose: Demonstrates basic use of REFind() ---> <html> <head><title>Using a Regular Expression</title></head> <body> <!--- The text to search ---> <cfset text = "My email address is nate@nateweiss.com. Write to me anytime. "  & "You can also use nate@nateweiss.co.uk or Weiss_Nate@nateweiss.com."> <!--- Start at the beginning of the text ---> <cfset startPos = 1> <!--- Continue looping indefinitely (until a <CFBREAK> is encountered) ---> <cfloop condition="True">  <!--- Attempt to find a match --->  <cfset matchStruct =  reFind("([\w._]+)@([\w_]+(\.[\w_]+)+)", text, startPos, True)>  <!--- Break out of the loop if no match was found --->  <cfif matchStruct.pos[1] eq 0>   <cfbreak>  <!--- Otherwise, display the match --->  <cfelse>    <!--- Advance the StartPos so the next iteration finds the next match --->    <cfset startPos = matchStruct.pos[1] + matchStruct.len[1]>    <!--- The first elements of the arrays represent the overall match --->    <cfset foundString = mid(text, matchStruct.pos[1], matchStruct.len[1])>    <!--- The subsequent elements represent each of the subexpressions --->    <cfset userNamePart = mid(text, matchStruct.pos[2], matchStruct.len[2])>    <cfset domainPart = mid(text, matchStruct.pos[3], matchStruct.len[3])>    <cfset suffixPart = mid(text, matchStruct.pos[4], matchStruct.len[4])>    <cfoutput>    <p>A match was found at position #matchStruct.pos[1]#.<br>    The actual email address is: <b>#foundString#</B><BR>    The username part of the address is: #userNamePart#<br>    The domain part of the address is: #domainPart#<br>    The suffix part of the address is: #suffixPart#<br>    </cfoutput>  </cfif> </cfloop> </body> </html>

The key difference between this listing and the preceding one is the addition of the startPos variable and the <cfloop> tags that now surround most of the code. (The loop uses a condition="True" attribute that causes the block to loop forever unless a <cfbreak> tag is encountered).

At the beginning, startPos is set to 1. Then, within the loop, startPos is fed to the reFind() function, meaning that the first iteration of the loop will find matches starting from the beginning of the text.

If no match is found, <cfbreak> is used to break out of the loop. Otherwise, the pos[1] and len[1] values are combined to set startPos to the character position immediately following the match.

So, if the first match is found at position 50 and is 15 characters long, the next iteration of the loop will use a startPos of 65, thereby finding the next match (if any) in the text. The process will repeat until no match is found after startPos, at which point the <cfbreak> kicks in to end the loop. The result is a simple page that finds and displays multiple email addresses (Figure 13.5).

Figure 13.5. Using simple loops, you can easily find multiple matches.

CAUTION

Be careful when you use <cfloop> tags that use condition="True" in this manner. If your code doesn't include a <cfbreak> that is guaranteed to execute at some point, your loop will go on forever, occupying more and more of ColdFusion's time and resources. You would probably need to restart the server as a result.

Replacing Text using `reReplace()`

As you learned from Table 13.2, ColdFusion provides reReplace() and reReplaceNoCase() functions in addition to the reFind() and reFindNoCase() functions you've seen so far.

The reReplace() and reReplaceNoCase() functions each take three required arguments and one optional argument, as follows:

 reReplace(string, regex, substring [, scope ])

The meaning of each argument is explained in Table 13.4.

Table 13.4. `reReplace()` Function Syntax
ARGUMENT	DESCRIPTION
`string`	Required. The string in which you want to find matches.
`regex`	Required. The regular expression criteria you want to use to find matches.
`substring`	Required. The string that you want each match to be replaced with. You can use backreferences in the string to include pieces of the original match in the replacement.
`scope`	Optional. The default is `ONE`, which means that only the first match is replaced. You can also set this argument to `ALL`, which will cause all matches to be replaced.

The function returns the altered version of the string (the original string is not modified). Think of it as being like the replace() function on steroids, since the text you're looking for can be expressed using RegEx wildcards instead of a literal substring.

NOTE

The syntax for both reReplace() and reReplaceNoCase() is the same. Anywhere you see one, you could use the other. Just use the function that's appropriate for the task, depending on how you want the replacement operation to behave in regard to capitalization. Again, though, do not forget that you can actually do case-insensitive regular-expression matching, so you need not ever use reReplaceNoCase().

Using `reReplace` to Filter Posted Content

The next few examples will implement an editable home page for the fictitious Orange Whip Studios company. The basic idea is for the application to maintain a text message in the APPLICATION scope; this message appears on the home page. An edit link allows the user to type a new message in a simple form (Figure 13.6). When the form is submitted, the new message is displayed on the home page from that point forward (Figure 13.7). Listing 13.6 shows the simple logic for this example.

Listing 13.6. `EditableHomePage1.cfm`Removing Text Based on a Regular Expression

 <!---  Filename: EditableHomePage1.cfm  Author: Nate Weiss (NMW)  Purpose: Example of altering text with regular expressions ---> <!--- Enable application variables ---> <cfapplication name="OrangeWhipIntranet"> <!--- Declare the HomePage variables and give them initial values ---> <cfparam name="APPLICATION.homePage" default="#structNew()#"> <cfparam name="APPLICATION.homePage.messageAsPosted" type="string" default=""> <CFPARAM NAME="APPLICATION.homePage.messageToDisplay" type="string" default=""> <!--- If the user is submitting an edited message ---> <cfif isDefined("FORM.messageText")>  <!--- First of all, remove all tags from the posted message --->  <cfset messageWithoutTags = reReplace(FORM.messageText,  "<[^>]*>", <!--- (matches tags) --->  "", <!--- (replace with empty string) --->  "ALL")>  <!--- Save the "before" version of the new message --->  <cfset APPLICATION.homePage.messageAsPosted = messageWithoutTags>  <!---  (other code will be added here in following examples)  --->  <!--- Save the "after" version of the new message --->  <cfset APPLICATION.homePage.messageToDisplay = messageWithoutTags> </cfif> <!--- This include file takes care of dispaying the actual page ---> <!--- (including the message) or the form for editing the message ---> <cfinclude template="EditableHomePageDisplay.cfm">

Figure 13.6. Users can edit the home page message with this simple form.

Figure 13.7. Regular expressions can be used to filter what gets displayed on the home page.

At the top of this listing, three application variables called homepage, homePage.messageAsPosted, and HomePage.messageToDisplay are established. If the user is currently posting a new message, the <cfif> block executes. This block is responsible for saving the edited message. Inside the <cfif> block, the reReplace() function is used to find all HTML (or XML, CFML, or any other type of tag) and replace the tags with an empty string. In other words, all tags are removed from the user's message in order to prevent users from entering HTML that might look bad or generally mess things up.

NOTE

Once again, you have to take it on faith that the <[^>]*> regular expression used in this example is an appropriate one to use for removing tags from a chunk of text. For details, see the section "Crafting Your Own Regular Expressions" in this chapter.

Once the tags have been removed, the resulting text is saved to the homePage.messageAsPosted and homePage.messageToDisplay variables, which will be displayed by the next listing. For now, the two variables will always hold the same value, but you will see a few different versions of this listing that save slightly different values in each.

Finally, a <cfinclude> tag is used to include the EditableHomePageDisplay.cfm template, shown in Listing 13.7. This code is responsible for displaying the message on the home page (as shown in Figure 13.6) or displaying the edit form (as shown in Figure 13.7) if the user clicks the edit link.

Listing 13.7. `EditableHomePageDisplay.cfm`Form and Display Portion of Editable Home Page

 <!---  Filename: EditableHomePageDisplay.cfm  Author: Nate Weiss (NMW)  Please Note Included by the EditableHomePage.cfm examples ---> <html> <head><title>Orange Whip Studios Home Page</title></head> <body> <cfoutput>  <!--- Orange Whip Studios logo and page title --->  <img src="/books/2/449/1/html/2/logo_c.gif" width="101" height="101" alt="" align="absmiddle">  <b>Orange Whip Studio Home Page</b><br clear="all">  <!--- Assuming that the user is not trying to edit the page --->  <cfif not isDefined("URL.edit")>    <!--- Display the home page message --->    <p>#paragraphFormat(APPLICATION.homePage.messageToDisplay)#    <!--- Provide a link to edit the message --->    <p>[<a href="#CGI.script_name#?edit=Yes">edit message</a>]</p>  <!--- If the user wants to edit the page --->  <cfelse>    <!--- Simple form to edit the home page message --->    <form action="#CGI.script_name#" method="post">    <!--- Text area for typing the new message --->    <textarea    name="messageText"    cols="60"    rows="10">#htmlEditFormat(APPLICATION.homePage.messageAsPosted)#</textarea><br>    <!--- Submit button to save the message --->    <input    type="submit"    value="Save Text">    </form>  </cfif> </cfoutput> </body> </html>

There is nothing particularly interesting about this listing. It's a simple file that either displays the home page or an edit form, as appropriate. Note that the homePage.messageToDisplay is what is normally displayed on the home page, whereas homePage.messageAsPosted is what appears in the edit form. Right now, these two values are always the same, but subsequent versions of Listing 13.6 will change that.

Clearly, you aren't limited to only removing the tags; you can replace them with any string you want. If you wanted the user to get a visual cue about the removal of any tags from the message, you could change the third argument of the reReplace() function so that the tags are replaced with a message such as [tags removed]. And in the next section, you'll learn how to use the RegEx backreference wildcard so that the actual match can be incorporated into the replacement string dynamically.

NOTE

Of course, in a real application you wouldn't allow just anyone to edit the message on the home page. At a minimum, you would require a username and password to make sure that only the proper people had access to the edit form.

Altering Text with Backreferences

Listing 13.6 showed you how to use reReplace() to replace any matches for a regular expression with a replacement string (in that example, the replacement was an empty string). Using a simple replacement string is fine when you want to remove matches from a chunk of text, or to replace all matches with the same replacement string.

But what if you want the replacements to be more flexible, so that the replaced text is based somehow on the actual match? The reReplace() function supports backreferences, which allow you to do just that. A backreference is a special RegEx wildcard that can be used in the replacement string to represent the actual value of a subexpression. Backreferences are commonly used to alter or reformat the substrings matched by a regular expression.

In ColdFusion, you include backreferences in your replacement strings using \1, \2, \3, and so on, where the number after the backslash indicates the number of a subexpression. If your replacement string contains a \1, the actual value matched by the first subexpression (that is, the first parenthe sized part of the RegEx) will appear in place of the \1. If the replacement includes \2, the result will have the value of the second subexpression in place of the \2, and so on.

TIP

Think of backreferences as a special kind of variable. For each actual match, these special variables are filled with the values of each subexpression that contributed to the match. The replacement is then made using the values of the special variables. The process is repeated for each match.

The next example listing is a new version of the earlier code (Listing 13.6) for tweaking the home page message submitted by users. This version uses backreferences to make two additional changes to the message posted by the user:

"Malformed" phone numbers are rearranged so that the area code appears in parentheses, in the form (999)999-9999. If the user enters a phone number as 800/555-1212 or 800 555 1212, it will be rearranged to read (800)555-1212.
Any email addresses in the text will be surrounded by "mailto" hyperlinks that activate the user's email client when clicked. If bfoxile@orangewhipstudios.com is found in the text, it will be changed to an <a> link that includes an href="mailto:bfoxile@orangewhipstudios.com" attribute.

The user can type a message that contains phone numbers and email addresses (Figure 13.8); the home page will display a version of the message that has been altered in a reasonably intelligent and consistent fashion (Figure 13.9). Listing 13.8 shows the code for this new version of the home page example.

Listing 13.8. `EditableHomePage2.cfm`Using Backreferences to Make Intelligent Alterations

 <!---  Filename: EditableHomePage2.cfm  Author: Nate Weiss (NMW)  Purpose: Example of altering text with regular expressions ---> <!--- Enable application variables ---> <cfapplication name="OrangeWhipIntranet"> <!--- Declare the HomePage variables and give them initial values ---> <cfparam name="APPLICATION.HomePage" default="#structNew()#"> <cfparam name="APPLICATION.homePage.messageAsPosted" type="string" default=""> <cfparam name="APPLICATION.homePage.messageToDisplay" type="string" default=""> <!--- If the user is submitting an edited message ---> <cfif isDefined("FORM.messageText")>  <!--- First of all, remove all tags from the posted message --->  <cfset FORM.messageText = reReplace(FORM.messageText,  "<[^>]*>", <!--- (matches tags) --->  "", <!--- (replace with empty string) --->  "ALL")>  <!--- Save the "before" version of the new message --->  <cfset APPLICATION.homePage.messageAsPosted = FORM.messageText>  <!--- Format any lazily-typed phone numbers in (999)999-999 format --->  <cfset FORM.MessageText = reReplaceNoCase(FORM.messageText,  "([0-9]{3})[-/ ]([0-9]{3})[- ]([0-9]{4})", <!--- (matches phone) --->  "(\1)\2-\3", <!--- (phone format) --->  "ALL")>  <!--- Surround all email addresses with "mailto" links --->  <cfset FORM.messageText = reReplaceNoCase(FORM.messageText,  "(([\w._]+)@([\w_]+(\.[\w_]+)+))", <!--- (matches email addresses) --->  "<a href=mailto:\1>\1</a>", <!--- (email address in link) --->  "ALL")>  <!--- Save the "after" version of the new message --->  <cfset APPLICATION.homePage.messageToDisplay = FORM.messageText> </cfif> <!--- This include file takes care of dispaying the actual page ---> <!--- (including the message) or the form for editing the message ---> <cfinclude template="EditableHomePageDisplay.cfm">

Figure 13.8. Regular expressions are used to scan for phone numbers and email addresses.

Figure 13.9. The phone numbers and email addresses are reformatted using RegEx backreferences.

Much of this listing is unchanged from the version in Listing 13.6. The difference is the addition of the second and third uses of reReplace() (the first reReplace() was in the previous version).

The second reReplace() is the one that reformats the phone numbers. This function contains three parenthesized subexpressions (which correspond to the area code, exchange, and last four digits of the phone number, respectively). Therefore, the \1 in the replacement string will contain the area code when an actual match is encountered, the \2 will contain the exchange portion of the phone number, and so on.

The final reReplace() does something similar except for email addresses. This replacement is interested in working only with the match as a whole, so an additional set of parentheses have been added around the entire regular expression, so that the entire RegEx is considered a subexpression. Therefore, the entire match will appear in place of the \1 in the replacement string when this code executes. This is different from the behavior of reFind and reFindNoCase where returnSubExpressions is true. These functions will return the entire match automatically. An alternative is to omit the extra set of parentheses and refer to each part of the email address separately in the replacement string, like so:

  <!--- Surround all email addresses with "mailto" links --->  <cfset FORM.messageText = reReplaceNoCase( FORM.messageText,  "([\w._]+)@([\w_]+(\.[\w_]+)+)", <!--- (matches email addresses) --->  "<a href=mailto:\1\@2\3>\1\@2\3</a>", <!--- (email address in link) --->  "ALL")>

NOTE

In Perl, you use $1, $2, and so on, rather than \1 and \2, because the $ is special to Perl.

NOTE

You can also use backreferences in the regular expression itself, often to match repeating patterns. For details, see the "Metacharacters 303: Backreferences Redux" section near the end of this chapter.

Altering Text Using a Loop

Sometimes you might want to make changes that are too complex to be made with a reReplace(), even using backreferences. In such a situation, you can use reFind() in its returnSubEexpressions form to loop over the matches (Listing 13.5), altering the original chunk of text as you go.

Listing 13.9 shows another distillation of the editable home-page logic. This code is similar to the last version (Listing 13.8), except that it now performs the replacement in a more manual fashion using CFML's removeChars() and insert() functions (at the end of the <cfloop> block).

Listing 13.9. `ditableHomePage3.cfm`Making Changes Based on `reFind()` Results

 <!---  Filename: EditableHomePage3.cfm  Author: Nate Weiss (NMW)  Purpose: Example of altering text with regular expressions ---> <!--- Enable application variables ---> <cfapplication name="OrangeWhipIntranet"> <!--- Declare the HomePage variables and give them initial values ---> <cfparam name="APPLICATION.homePage" default="#structNew()#"> <cfparam name="APPLICATION.homePage.messageAsPosted" type="string" default=""> <cfparam name="APPLICATION.homePage.messageToDisplay" type="string" default=""> <!--- If the user is submitting an edited message ---> <cfif isDefined("FORM.messageText")>  <!--- First of all, remove all tags from the posted message --->  <cfset FORM.messageText = reReplace(FORM.messageText,  "<[^>]*>", <!--- (matches tags) --->  "", <!--- (replace with empty string) --->  "ALL")>  <!--- Save the "before" version of the new message --->  <cfset APPLICATION.homePage.messageAsPosted = FORM.messageText>  <!--- Now work on any email addresses within the text --->  <!--- Start at the beginning of the text --->  <cfset startPos = 1>  <!--- Continue looping indefinitely (until a <CFBREAK> is encountered) --->  <cfloop condition="True">    <!--- Find email messages --->    <cfset MatchStruct = reFindNoCase("([\w._]+)@([\w_]+(\.[\w_]+)+)",    FORM.messageText, startPos, True)>    <!--- Break out of the loop if no match was found --->    <cfif matchStruct.pos[1] eq 0>      <cfbreak>    <!--- Otherwise, process this match --->    <cfelse>      <!--- The first elements of the arrays represent the overall match --->      <cfset FoundString =      mid(FORM.messageText, matchStruct.pos[1], matchStruct.len[1])>      <!--- Try to find email address in the database --->      <cfquery name="emailQuery" datasource="ows">      SELECT FirstName, LastName      FROM Contacts      WHERE EMail = '#foundString#'      </cfquery>      <!--- If the email address was found in the database --->      <cfif emailQuery.recordCount eq 1>        <cfset linkText = '<a href="mailto:#foundString#">'        & "#emailQuery.FirstName# #emailQuery.LastName#</a>">      <!--- If it was not found --->      <cfelse>        <cfset linkText =        '<a href="mailto:#foundString#">#foundString#</a>'>      </cfif>      <!--- Remove the matched email address from the message --->      <cfset FORM.messageText =      removeChars(FORM.messageText, matchStruct.pos[1], matchStruct.len[1])>      <!--- Insert the email link in its place --->      <cfset FORM.messageText =      insert(linkText, FORM.messageText, matchStruct.pos[1]-1)>      <!--- Advance the StartPos so the next iteration finds the next match --->      <cfset startPos = matchStruct.pos[1] + len(linkText)>    </cfif>  </cfloop>  <!--- Save the "after" version of the new message --->  <cfset APPLICATION.homePage.messageToDisplay = FORM.messageText> </cfif> <!--- This include file takes care of dispaying the actual page ---> <!--- (including the message) or the form for editing the message ---> <cfinclude template="EditableHomePageDisplay.cfm">

NOTE

The value of the startPos variable is now advanced based on the length of the replacement string, rather than the length of the original match. This is necessary because the replacement operations may change the overall length of the chunk of text as the loop does its work.

Within the loop, this version of the code checks each email address to see if it's in the Contacts table of the OWS example database. If so, the portion of the mailto link between the <a> tags will show the person's first and last names, rather than just the email address (Figure 13.10). If you want to test this feature, be sure to use ben@forta.com in your text.

Finding Matches with `reFind()`

Table 13.3. `reFind()` Function Syntax

A Simple Example

Listing 13.1. `RegExFindEmail1.cfm`A Simple Regular Expression Example

Figure 13.1. Regular expressions can search for email addresses, phone numbers, and the like.

Ignoring Capitalization with `reFindNoCase()`

Getting the Matched Text Using the Found Position

Listing 13.2. `RegExFindPhone1.cfm`Using `mid()` to Extract the Matched Text

Figure 13.2. If you know its length ahead of time, it's easy to display the matched text.

Getting the Matched Text Using `returnSubExpressions`

Listing 13.3. `RegExFindEmail2.cfm`Using `reFind()'s returnSubExpressions` Argument

Figure 13.3. It's easy to display a matched substring, even if its length will vary at run time.

Working with Subexpressions

Table 13.12. Match Modifiers Supported in ColdFusion

Listing 13.4. `RegExFindEmail3.cfm`Getting the Matched Text for Each Subexpression

Figure 13.4. Subexpressions are handy for matching portions of a RegEx.

Working with Multiple Matches

Listing 13.5. `RegExFindEmail4.cfm`Finding Multiple Matches with a `<cfloop>` Block

Figure 13.5. Using simple loops, you can easily find multiple matches.

Replacing Text using `reReplace()`

Table 13.4. `reReplace()` Function Syntax

Using `reReplace` to Filter Posted Content

Listing 13.6. `EditableHomePage1.cfm`Removing Text Based on a Regular Expression

Figure 13.6. Users can edit the home page message with this simple form.

Figure 13.7. Regular expressions can be used to filter what gets displayed on the home page.

Listing 13.7. `EditableHomePageDisplay.cfm`Form and Display Portion of Editable Home Page

Altering Text with Backreferences

Listing 13.8. `EditableHomePage2.cfm`Using Backreferences to Make Intelligent Alterations

Figure 13.8. Regular expressions are used to scan for phone numbers and email addresses.

Figure 13.9. The phone numbers and email addresses are reformatted using RegEx backreferences.

Altering Text Using a Loop

Listing 13.9. `ditableHomePage3.cfm`Making Changes Based on `reFind()` Results

Figure 13.10. Ben's email address is in the database, but Raymond's isn't.

Using Regular Expressions in ColdFusion

Finding Matches with reFind()

Table 13.3. reFind() Function Syntax

A Simple Example

Listing 13.1. RegExFindEmail1.cfmA Simple Regular Expression Example

Figure 13.1. Regular expressions can search for email addresses, phone numbers, and the like.

Ignoring Capitalization with reFindNoCase()

Getting the Matched Text Using the Found Position

Listing 13.2. RegExFindPhone1.cfmUsing mid() to Extract the Matched Text

Figure 13.2. If you know its length ahead of time, it's easy to display the matched text.

Getting the Matched Text Using returnSubExpressions

Listing 13.3. RegExFindEmail2.cfmUsing reFind()'s returnSubExpressions Argument

Figure 13.3. It's easy to display a matched substring, even if its length will vary at run time.

Working with Subexpressions

Table 13.12. Match Modifiers Supported in ColdFusion

Listing 13.4. RegExFindEmail3.cfmGetting the Matched Text for Each Subexpression

Figure 13.4. Subexpressions are handy for matching portions of a RegEx.

Working with Multiple Matches

Listing 13.5. RegExFindEmail4.cfmFinding Multiple Matches with a <cfloop> Block

Figure 13.5. Using simple loops, you can easily find multiple matches.

Replacing Text using reReplace()

Table 13.4. reReplace() Function Syntax

Using reReplace to Filter Posted Content

Listing 13.6. EditableHomePage1.cfmRemoving Text Based on a Regular Expression

Figure 13.6. Users can edit the home page message with this simple form.

Figure 13.7. Regular expressions can be used to filter what gets displayed on the home page.

Listing 13.7. EditableHomePageDisplay.cfmForm and Display Portion of Editable Home Page

Altering Text with Backreferences

Listing 13.8. EditableHomePage2.cfmUsing Backreferences to Make Intelligent Alterations

Figure 13.8. Regular expressions are used to scan for phone numbers and email addresses.

Figure 13.9. The phone numbers and email addresses are reformatted using RegEx backreferences.

Altering Text Using a Loop

Listing 13.9. ditableHomePage3.cfmMaking Changes Based on reFind() Results

Figure 13.10. Ben's email address is in the database, but Raymond's isn't.

Finding Matches with `reFind()`

Table 13.3. `reFind()` Function Syntax

Listing 13.1. `RegExFindEmail1.cfm`A Simple Regular Expression Example

Ignoring Capitalization with `reFindNoCase()`

Listing 13.2. `RegExFindPhone1.cfm`Using `mid()` to Extract the Matched Text

Getting the Matched Text Using `returnSubExpressions`

Listing 13.3. `RegExFindEmail2.cfm`Using `reFind()'s returnSubExpressions` Argument

Listing 13.4. `RegExFindEmail3.cfm`Getting the Matched Text for Each Subexpression

Listing 13.5. `RegExFindEmail4.cfm`Finding Multiple Matches with a `<cfloop>` Block

Replacing Text using `reReplace()`

Table 13.4. `reReplace()` Function Syntax

Using `reReplace` to Filter Posted Content

Listing 13.6. `EditableHomePage1.cfm`Removing Text Based on a Regular Expression

Listing 13.7. `EditableHomePageDisplay.cfm`Form and Display Portion of Editable Home Page

Listing 13.8. `EditableHomePage2.cfm`Using Backreferences to Make Intelligent Alterations

Listing 13.9. `ditableHomePage3.cfm`Making Changes Based on `reFind()` Results