Regular Expressions


Advanced

A regular expression is used to define a pattern of characters. These patterns can be used for many purposes, including string manipulation, user input validation, and searching and replacing. Typically, the pattern represented by the regular expression is matched against a string. In the simplest example, the regular expression is a literal string of characters, for example, moth. This simple regular expression pattern matches against a string that contains it, for example, mothra. The regular expression fails to match a string that doesn’t contain it, such as godzilla.

Although this example seems simple enough, you shouldn’t underestimate the importance of regular expressions, which provide access to powerful algorithms that are the natural and easy answer to many programming problems. It’s also a great deal of fun to learn to program with regular expressions!

In JavaScript, regular expressions are deployed using methods of the String object and the regular expression object (the regular expression object is called RegExp). In this section, we’ll only look at using regular expressions with the String object. I won’t explain the methods associated with the RegExp object.

Regular expressions are literals enclosed within a pair of slash characters, for example:

 /harold/ 

In other words, regular expressions are literals delimited by slashes in the same way that strings are delimited by quotation marks. So /moth/ is to regular expressions as "moth" is to strings.

To Create a Regular Expression:

  1. Assign a pattern to a variable. For example:

     var pattern = /mothra/; 

Within regular expressions, alphabetic and numerical characters represent themselves. Here are some examples of this simple kind of regular expression:

 /mothra/  /m/  /1234/  /Ma65/ 

In addition, there are many special characters that have meaning as part of the grammar (or rules) of regular expressions. These special characters and the rules for creating regular expressions are fairly intricate, and I’m won’t go into them in great detail in this section. Instead, I’ll show you a few simple examples involving regular expressions so that you can get a feel for them and begin to have some appreciation for what they can do for you and your programs.

In other words, the whole topic of regular expressions is pretty involved. I hesitated to even bring it up in Learn How to Program. This is a tough topic, and I don’t want you to get confused. But ultimately, I decided you should at least be exposed to the concept of a regular expression and get some feeling for what it can do.

If you look at the rules for creating regular expressions presented later in this section and decide that they’re too complicated to be fun, fine! You can skip the material. However, you should at least get a feeling for what regular expressions can do so that if you encounter a programming problem that cries out for their use, you can then figure out how to use them.

On the other hand, if you like regular expressions, you can go ahead and learn them in depth!

You should also know that when you’ve seen the regular expression engine in one language, you’ve pretty much seen the regular expression engine in all computer languages. Regular expressions work the same way in JavaScript as in Java as in Perl as in Visual Basic as in C#.

start sidebar
The Man Who Invented Regular Expressions

The concept of the regular expression was invented in the 1950s by Stephen Kleene, a lanky mathematician and logician. Kleene, who was a remarkably tall man and enjoyed mountain climbing in his spare time, was inspired in his work by G del and Turing.

The first use of regular expressions in computers was as part of compilers. A compiler is a program that converts code written in a high-level computer language to instructions that the computer can understand. Regular expressions were (and are) useful in compilers because they help the compiler to recognize the elements that it’s processing and to make sure they’re syntactically correct, also called well-formed.

end sidebar

String Regular Expression Methods

The String object has four methods that are used with regular expressions. Table 9-3 describes these methods.

Table 9-3: String Object Regular Expression Methods

Method

Description

match()

Performs pattern matching with a string.

replace()

Searches and replaces within a string using a regular expression.

search()

Searches within a string using a regular expression.

split()

Splits a string into an array using a delimiter (as explained earlier in this chapter). The delimiter can be a regular expression, which is very cool.

Basic Regular Expression Syntax

An alphanumeric character within a regular expression matches itself as in the moth examples at the beginning of this section. In another example, the regular expression v matches the v in the string love.

This basic character matching, called a literal character match, is at the core of the concept of a regular expression. But let’s kick it up several notches! Things are about to get more complicated.

Besides alphanumeric characters, you can match many nonalphanumeric characters using escape sequences.

Table 9-4 shows regular expression literal character matches for both alphanumeric characters and nonalphanumeric characters.

Table 9-4: Regular Expression Characters and Character Sequences, and Their Matches

Character/ Character Sequence

Matches

Alphabetic (a–z and A–Z)

Itself

Numeric (0-9)

Itself

\b

Backspace within a [] character class (character classes are discussed shortly); outside a character class but within a regular expression, it means a word boundary

\f

Form feed

\n

New line

\r

Carriage return

\t

Tab

\/

Slash (literal /)

\\

For-slash (literal \)

\.

.

\*

*

\+

+

\?

?

\|

|

\(

(

\ )

)

\[

[

\]

]

\{

{

\}

}

\xxx

The character specified by the octal number xxx

\xnn

The character specified by the hexadecimal number nn

Attributes

There’s an exception to the rule that regular expression patterns appear with the forward slash delimiters. Two regular expression attributes — which may be combined—can be placed after the final forward slash. These are as follows:

  • i means perform a case-insensitive match.

  • g means find all occurrences of the pattern match, not just the first. This is termed a global match.

As an example:

 /mOTh/i 

matches mothra because the regular expression is case insensitive.

Character Classes

Individual literal characters can be combined into character classes in regular expressions. Character classes are contained within square brackets. A match occurs when one or more of the characters contained in the character class produces a match with the comparison string.

To Use a Character Class:

Place the characters that are the members of the class within square brackets to create a regular expresssion. For example:

 /[1234]/ 

The characters within a class can be specified using ranges, rather than by specific enumeration. A hyphen is used to indicate a range. Here are some examples:

 /[a-z]/            // means all lowercase characters from a to z  /[a-zA-L]/         // means all lowercase characters from a to z and all                     // uppercase characters between A and L  /[a-zA-Z0-9]/      // means all lowercase and uppercase letters, and all                     // numerals 

Thus, /[a-zA-L]/ wouldn’t produce a match with the string XYZ. But it would match the string xyz. By the way, you may have noticed that you could use the case attribute instead of separately listing uppercase and lowercase ranges. /[a-z]/i is the equivalent of /[a-zA-Z]/.

Negating a Character Class

A character class can be negated. A negated class matches any character except those defined within brackets.

To Negate a Character Class:

Place a caret (^) as the first character inside the left bracket of the class. For example:

 /[^a-zA-Z]/ 

In this example, the regular expression /[^a-zA-Z]/ will match if and only if the comparison string contains at least one nonalphabetic character. abcABC123 is a match, but abcABC is not.

Common Character Class Representations

Because some character classes are frequently used, JavaScript regular expression syntax provides special sequences that are shorthand representations of these classes. Square brackets aren’t used with most of these special character “abbreviations.”

Table 9-5 shows the sequences that can be used for character classes.

Table 9-5: Character Class Sequences and Their Meanings

Character Sequence

Matches

[...]

Any one character between the square brackets.

[^...]

Any one character not between the brackets.

.

Any one character other than new line. Equivalent to [^\n].

\w

Any one letter, number or underscore. Is equivalent to [a-zA-Z0-9_].

\W

Any one character other than a letter, number, or underscore. Equivalent to [^a-zA-Z0-9_].

\s

Any one space character or other white space character. Equivalent to [ \t\n\r\f\v].

\S

Any one character other than a space or other white space character. Equivalent to [^ \t\n\r\f\v].

\d

Any one digit. Equivalent to [0-9].

\D

Any one character that is not a digit. This is equivalent to [^0-9].

For example, the pattern /\W/ matches a string containing a hyphen (–), but it fails against a string containing only letters (such as abc).

In another example, /\s/ matches a string containing a space, such as mothra and godzilla. But /\s/ fails against strings that don’t contain white space characters, such as antidisestablishmentarianism.

Repeating Elements

So far, if you wanted to match a multiple number of characters, the only way to achieve this using a regular expression would be to enumerate each character. For example, /\d\d/ would match any two-digit number. And /\w\w\w\w/ would match any four-letter alphanumeric string such as love or 1234.

This isn’t good enough. In addition to being cumbersome, it doesn’t allow complex pattern matches involving varied numbers of characters. For example, you might want to match a number between two and six digits in length or a pair of letters followed by a number of any length.

This kind of “wildcard” pattern is specified in JavaScript regular expressions using curly braces ({}). The curly braces follow the pattern element that’s to be repeated and specify the number of times the pattern element is to be repeated.

In addition, there are some special characters that are used to specify common types of repetition.

Table 9-6 shows both the curly brace syntax and the special repetition characters.

Table 9-6: Syntax for Repeating Pattern Elements

Repetition Syntax

Meaning

{n,m}

Match the preceding element at least n times but no more than m times.

{n,}

Match the preceding element n or more times.

{n}

Match the preceding element exactly n times.

?

Match the preceding element zero or one times. In other words, the element is optional. Equivalent to {0,1}.

+

Match one or more occurrences of the preceding element. Equivalent to {1,}.

*

Match zero or more occurrences of the preceding element. In other words, the element is optional but can also appear multiple times. Equivalent to {0,}.

Organizing Patterns

The JavaScript regular expression syntax provides special characters that allow you to organize patterns. These characters are shown in Table 9-7 and are explained in a little more detail following the table.

Table 9-7: Alternation, Grouping, and Reference Characters

Character

Meaning

|

Alternation. This matches the character or subexpression to the left or right of the | character.

(...)

Groups several items into a unit, or subexpression, that can be used with repeated syntax and referred to later in an expression.

\n

Matches the same characters that were matched when the subexpression \n was first matched.

Alternation

The pipe character (|) is used to indicate an alternative. For example, the regular expression /jaws|that|bite/ matches the three strings jaws, that, or bite. In another example, /\d{2}|[A-Z]{4}/ matches either two digits or four capital letters.

Grouping

Parentheses are used to group elements in a regular expression. Once items have been grouped into subexpressions, they can be treated as a single element using repetition syntax. For example, /visual(basic)?/ matches visual followed by the optional basic.

Referring to Subexpressions

Parentheses are also used to refer back to a subexpression that’s part of a regular expression. Each subexpression that has been grouped in parentheses is internally assigned an identification number. The subexpressions are numbered from left to right, using the position of the left parenthesis to determine order. Nesting of subexpressions is allowed.

Subexpressions are referred to using a backslash followed by a number. So \1 means the first subexpression, \2 the second, and so on.

A reference to a subexpression matches the same characters that were originally matched by the subexpression.

For example, the regular expression:

 /['"][^'"]*['"]/ 

matches a string that starts with a single or double quote and ends with a single or double quote. (The middle element, [^'"]*, matches any number of characters, provided they’re not single or double quotes.)

This expression doesn’t distinguish between the two kinds of quotes. A comparison string that started with a double quote and ended with a single quote would match this expression. For example:

 "Ohana means family. Nobody gets left behind or forgotten.' 

This, which starts with a double quote and ends with a single quote, matches the regular expression pattern I just showed you, even though it isn’t symmetrical in respect to the kinds of quotation marks used.

Try This at Home Note

An improved result would be to have a match depend on the kind of quote with which the match began. If it begins with a double quote, it should end with a double quote; likewise, if it starts with a single quote, it should end with a single quote. As an exercise, if you’re having fun with regular expressions, go ahead and implement this!

If you’ve worked your way through the material I’ve just shown you, you now know everything you always wanted to know about regular expressions but were afraid to ask!

It’s time to hit the pedal to the metal and the rubber to the road. Let’s work through a few examples that show how regular expressions can be used.

Matching a Date

Suppose you have a program that asks the user to input a date. The problem is that you need to be sure that the user has actually input a date in the format that your program requires. The answer is to use a regular expression to make sure the text string entered by the user matches the format your program needs.

Note

In the real world, there are a number of other ways to handle this problem. The best solution might be to only allow visual inputting of dates via a calendar interface. That way, not only could you make sure that the format was right, you could also make sure that the data entered was actually a date.

To Match a Date in mm/dd/yyyy Format:

  1. Use \/ to represent a slash within the pattern (see Table 9-4 for an explanation of this literal character sequence).

  2. Use parentheses to group the digits that represent month, days, and year:

     (\d{2})  (\d{2})  (\d{4}) 

  3. Combine the slash sequence with the month, days, and year:

     (\d{2})\/(\d{2})\/(\d{4}) 

  4. Add a \b at the beginning and end of the regular expression to make sure that the date string starts and the fourth-year digit ends a “word”:

     /\b(\d{2})\/(\d{2})\/(\d{4})\b/ 

  5. The regular expression is now complete. To use it, first create a function to check the date format against a text string passed to the function:

     function checkDate(testStr) {  } 

  6. Within the function, assign the regular expression to a variable named pattern:

     var pattern = /\b(\d{2})\/(\d{2})\/(\d{4})\b/; 

  7. Use the match method of the string passed to the function, with the regular expression, to check to see if there’s a match and display a message accordingly:

     var result = testStr.match(pattern);  if (result != null)     return "Well done. This look likes a date in the specified  format!";  else  return "For shame! You didn't input a date in the specified  pattern."; 

  8. Set up an HTML form with a text box for the user to input a string to test for “dateness” and a button with an onClick event handler to launch the checking function as shown in Listing 9-8.

    Listing 9.8: Using a Regular Expression to Match a Date Format

    start example
     <HTML> <HEAD> <TITLE>Can I have a date, please?</TITLE> <SCRIPT>  function checkDate(testStr) {     var pattern = /\b(\d{2})\/(\d{2 })\/(\d{4})\b/;     var result = testStr.match(pattern);     if (result != null)        return "Well done. This look likes a date in the specified format!";     else        return "For shame! You didn't input a date in the specified  pattern.";  }  </SCRIPT> </HEAD> <BODY> <H1>  Check a date format today!  </H1> <FORM name="theForm"> <TABLE> <tr> <td colspan=4>  Enter a date in mm/dd/yyyy format:  </td> </tr> <tr> <td colspan = 4> <INPUT type=text name=testStr size=20 maxlength=10> </td> </tr> <tr> <td colspan=4> <INPUT type=button name="theButton" value="Verify the Format"  onClick="alert(checkDate(document.theForm.testStr.value))";> </td> </tr> </TABLE> </FORM> </BODY> </HTML> 
    end example

  9. Open the HTML page that includes the regular expression and user interface, shown completely in Listing 9-8, in a Web browser.

  10. Enter something that is manifestly not a date, and click Verify the Format. The appropriate message will display (see Figure 9-8).

    click to expand
    Figure 9-8: Using the regular expression pattern shows that the user input isn’t in date format.

  11. Enter a string correctly formatted as a mm/dd/yyyy date, and click Verify the Format. This time, the input will be recognized as being in “date” format (see Figure 9-9).

    click to expand
    Figure 9-9: It’s easy to use regular expressions to make sure that user input is in the proper format.

Trimming a String

You might think that trimming a string is something like trimming a tree or providing a holiday meal with all the trimmings! But no, in fact, trimming a string means to remove leading and/or trailing space characters from a string. This is an operation that often takes place within programs. In fact, it’s so common that many languages (but not JavaScript) have built-in trimming functions. Fortunately, it’s easy to create your own trimming functions using regular expressions.

Tip

You could create trimming functions without using regular expressions, but regular expressions make it very easy!

A left trim removes leading space characters. A right trim removes trailing space characters. If you simply trim, you remove both leading and trailing spaces from a string.

I’ve organized the trim example so that the trim function calls both the right trim function and the left trim function, thus removing both leading and trailing blanks. Although the sample application is used for full trimming, you could easily break out the left trim or right trim functions if you needed to use them.

Here’s the regular expression used for a left trim (it matches the first nonspace character and then continues to the end of the string):

 var pattern = /[^\s]+.*/; 

Here’s the regular expression used for a right trim (it matches everything up to the trailing spaces):

 var pattern = /.*[\S]/; 

To understand this example, you should know that the match method of the String object, which accepts a regular expression to make the matches, returns a results array. The first element of the results array (its zero element) contains the first pattern match made with the regular expression, which is what’s used in this example to return the trimmed string.

Have fun trimming!

To Trim a String Using Regular Expressions:

  1. Create a form with two text inputs, one for an input string and the other for the trimmed version of the input string:

     <FORM name="theForm"> <TABLE> <tr> <td colspan=4>  Enter string for trimming:  </td> </tr> <tr> <td colspan = 4> <INPUT type=text name=testStr size=60> </td> </tr>  ...  <tr> <td colspan=4> <br> <hr>  Here's the trimmed string:  <br> </td> </tr> <tr> <td colspan = 4> <INPUT type=text name=display size=60> </td> </tr> </TABLE> </FORM> 

  2. Create an input button that invokes a function named trim using the input string as the function argument and displaying the trimmed result:

     <INPUT type=button name="theButton" value="Trim"     onClick="document.theForm.display.value =     trim(document.theForm.testStr.value)";> 

  3. Create a trim function that returns the trimmed value using a left trim (ltrim) and right trim (rtrim) function:

     function trim(testStr) {     return rtrim(ltrim(testStr));  } 

  4. Create the scaffolding for the ltrim function:

     function ltrim(testStr) {  } 

  5. Add code to ltrim() that returns an empty string if the input was empty:

     function ltrim(testStr) {     if (testStr == "")        return "";     else {     }  } 

    Without this code, the function will generate a run-time syntax error when the pattern match is attempted against an empty string.

  6. Construct the regular expression that matches anything that’s not blank and then continues to the end of the string:

     var pattern = /[^\s]+.*/; 

  7. Use the match method of the String object to match the regular expression against the input string and obtain a result array:

     function ltrim(testStr) {     if (testStr == "")        return "";     else {        var pattern = /[^\s]+.*/;        result = testStr.match(pattern);     }  } 

  8. Return the string that matches the pattern (the [0] element of the result array):

     function ltrim(testStr) {     if (testStr == "")        return "";     else {        var pattern = /[^\s]+.*/;        result = testStr.match(pattern);        return result[0];     }  } 

  9. Construct the scaffolding for the rtrim function, along with the check for empty input:

     function rtrim(testStr) {     if (testStr == "")        return "";     else {     }  } 

  10. Create a pattern that matches anything and everything up until the first of the trailing blanks:

     var pattern = /.*[\S]/; 

  11. Use the match method to obtain a regular expression match and return the first element (result[0]) of the result array:

     function rtrim(testStr) {     if (testStr == "")        return "";     else {        var pattern = /.*[\S]/;        result = testStr.match(pattern);        return result[0];     }  } 

  12. Save the page and open it in a browser (the complete code is shown in Listing 9-9).

    Listing 9.9: Trimming a String Using Regular Expressions

    start example
     <HTML> <HEAD> <TITLE>Strim a tring</TITLE> <SCRIPT>  function ltrim(testStr) {     if (testStr == "")        return "";     else {        var pattern = /[^\s]+.*/;        result = testStr.match(pattern);        return result[0];     }  }  function rtrim(testStr) {     if (testStr == "")        return "";     else {        var pattern = /.*[\S]/;        result = testStr.match(pattern);        return result[0];     }  }  function trim(testStr) {     return rtrim(ltrim(testStr));  }  </SCRIPT> </HEAD> <BODY> <H1>  Trim a string today!  </H1> <FORM name="theForm"> <TABLE> <tr> <td colspan=4>  Enter string for trimming:  </td> </tr> <tr> <td colspan = 4> <INPUT type=text name=testStr size=60> </td> </tr> <tr> <td colspan=3> <INPUT type=button name="theButton" value="Trim"     onClick="document.theForm.display.value =     trim(document.theForm.testStr.value)";> </td> <td> <INPUT type=button name="theButton" value="Clear"     onClick='document.theForm.testStr.value="";  document.theForm.display.value=""'> </td> </tr> <tr> <td colspan=4> <br> <hr>  Here's the trimmed string:  <br> </td> </tr> <tr> <td colspan = 4> <INPUT type=text name=display size=60> </td> </tr> </TABLE> </FORM> </BODY> </HTML> 
    end example

  13. Enter a text string for trimming.

  14. Click Trim. The results of trimming will appear as shown in Figure 9-10.

    click to expand
    Figure 9-10: Regular expressions can be used to trim a string.

Many Other Uses for Regular Expressions!

There are many, many other uses for regular expressions.

Here are a couple of examples off the top of my head.

The following regular expression will match any URL that includes a protocol, for example, http://www.apress.com or http://www.bearhome.com:

 /\w+:\/\/[\w.]+\/\S*/ 

But suppose you need to parse out the parts of a URL, for example, to determine a domain name from a URL. This kind of chore is often done to provide information about Web traffic, to analyze Web server logs, to create Web crawlers, or to perform automated navigation on behalf of a user.

How would you write a regular expression that could be used to parse out the component parts of a URL?

Another place that regular expressions are useful is in parsing words out of a text string. In the Playing with strings example earlier in this chapter, I showed you how to parse the words out of a text string using a space character as the delimiter between words. I noted that this was a pretty crude way to determine whether something was or wasn’t a word and that using a space as a delimiter failed to also recognize “words.” You can use a regular expression as a better way to match words in a string.

To Match a Word:

Use the regular expression:

 /\b[A-Za-z'-]+\b/ 

You still have to think carefully about what exactly a “word” is. This regular expression defines a word to include apostrophes and hyphens but not digits or underscores (so 42nd would fail the match).

Regular expressions are also powerful tools to use when you need to search for text. Most of the search engines you use in everyday life—for example, eBay, Google, and Yahoo—are largely powered by regular expressions.




Learn How to Program Using Any Web Browser
Learn How to Program Using Any Web Browser
ISBN: 1590591135
EAN: 2147483647
Year: 2006
Pages: 115
Authors: Harold Davis

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net