Using Regular Expressions for Checking Input

For simple data validation, you can use code like the code I showed earlier, which used simple string compares. However, for complex data you need to use higher-level constructs, such as regular expressions. The following C# code shows how to use regular expressions to replace the C++ extension-checking code. This code uses the RegularExpressions namespace in the .NET Framework:

using System.Text.RegularExpressions; ... static bool IsOKExtension(string Filename) { Regex r = new Regex(@"txt rtf gif jpg bmp$", RegexOptions.IgnoreCase); return r.Match(Filename).Success; }

The same code in Perl looks like this:

sub isOkExtension($) { $_ = shift; return /txt rtf gif jpg bmp$/i ? -1 : 0; }

I'll go into language specifics later in this chapter. For now, let me explain how this works. The core of the expression is the string txt rtf gif jpg bmp$ . The components are described in Table 10-1.

Table 10-1. Some Simple Regular Expression Elements
Element	Comments
xxx yyy	Matches either xxx or yyy.
$	Matches the input end.

If the search string matches one of the file extensions and then the end of the filename, the expression returns true. Also note that the C# code sets the RegexOptions.IgnoreCase option, because filenames in Microsoft Windows are case-insensitive.

Table 10-2 offers a more complete regular expression elements list. Note that some of these elements are implemented in some programming languages and not in others.

Table 10-2. Common Regular Expression Elements
Element	Comments
^	Matches the start of the string.
$	Matches the end of the string.
*	Matches the preceding pattern zero or more times. Same as {0,}.
+	Matches the preceding pattern one time or more times. Same as {1,}.
?	Matches the preceding pattern zero times or one time. Same as {0,1}.
{n}	Matches the preceding pattern exactly n times.
{n,}	Matches the preceding pattern n or more times.
{,m}	Matches the preceding pattern no more than m times.
{n,m}	Matches the preceding pattern between n and m times.
.	Matches any single character, except \n.
(pattern)	Matches and stores (captures) the resulting data in a variable. The variable used to store the captured data is different depending on the programming language. Can also be used as a group for example, (xx)+ will find one or more instances of the pattern inside the parenthesis. If you wish to group, you can use the noncapture parenthesis syntax (?:xx) to instruct the regular expression engine not to capture the data.
aa bb	Matches aa or bb.
[abc]	Matches any one of the enclosed characters: a, b or c.
[^abc]	Matches any character not in the enclosed list.
[a-z]	A range of characters or values. Matches any character from a to z.
\	The escape character. Some escapes are special characters (\n and \/), and others represent predefined character sequences (\d). It can also be used as a reference to previously captured data (\1).
\b	Matches the position between a word and a space.
\B	Matches a nonword boundary.
\d	Matches a digit, same as [0-9].
\D	Matches a nondigit, same as [^0-9].
\n, \r, \f, \t, \v	Special formatting characters: new line, line feed, form feed, tab, and vertical tab.
\p{category}	Matches a Unicode category; this is covered in detail later in this chapter.
\s	Matches a white-space character; same as [ \f\n\r\t\v].
\S	Matches a non-white-space character; same as [^ \f\n\r\t\v].
\w	Matches a word character; same as [a-zA-Z0-9_].
\W	Matches a nonword character; same as [^a-zA-Z0-9_].
\xnn or \x{nn}	Matches a character represented by two hexadecimal digits, nn.
\unnnn or \x{nnnn}	Matches a Unicode code point, represented by four hexadecimal digits, nnnn. I use code point because of surrogate characters. Not every code point is a character surrogates use two code points to represent a character. Refer to Chapter 14, Internationalization Issues, for more information about surrogates.

Let's look at some examples in Table 10-3 to make this a little more concrete.

Table 10-3. Regular Expression Examples
Pattern	Comments
[a-fA-F0-9]+	Match one or more hexadecimal digits.
<(.)>.<\/\1>	Match an HTML tag. Note the first tag is captured (.) and used to check the closing tag using \1. So if (.) is form, then \1 is also form.
\d{5}(-\d{4})?	U.S. ZIP Code.
^\w{1,32}(?:\.\w{0,4})?$	A valid but restrictive filename. 1-32 word characters, followed by an optional period and 0-4 character extension. The opening and closing parentheses, ( and ), group the period and extension, but the extension is not captured because the ?: is used. Note: I have used the ^ and $ characters to define the start and end of the input. There's an explanation of why later in this chapter.

Be Careful of What You Find Did You Mean to Validate?

Regular expressions serve two main purposes. The first is to find data; the second, and the one we're mainly interested in, is to validate data. When someone enters a filename, I don't want to find the filename in the request; I want to validate that the request is for a valid filename. Allow me to explain. Look at this pseudocode that determines whether a filename is valid or not:

RegExp r = [a-z]{1,8}\.[a-z]{1,3}; if (r.Match(strFilename).Success) { //Cool! Allow access to strFilename; it's valid. } else { //Tut! tut! Trying to access an invalid file. }

This code will allow a request only for filenames comprised of 1 8 lowercase letters, followed by a period, followed by 1 3 lowercase letters (the file extension). Or will it? Can you spot the flaw in the regular expression? What if a user makes a request for the c:\boot.ini file? Will it pass the regular expression check? Yes, it will. The reason is because the expression looks for any instance in the filename request that matches the expression. In this case, the expression will find the series of letters boot.ini within c:\boot.ini. However, the request is clearly invalid.

The solution is to create an expression that parses the entire filename to look for a valid request. In which case, we need to change the expression to read as follows:

^[a-z]{1,8}\.[a-z]{1,3}$

The ^ means start of the input, and $ means end of the input. You can best think about the new expression as from the beginning to the end of the request, allow only 1 8 lowercase letters, followed by a period, followed by 1 3 lowercas letters, and nothing more. Obviously, c:\boot.ini is invalid because the : and \ characters are invalid and do not comply with the regular expression.