Advanced Regular Expressions

   

Practical Programming in Tcl & Tk, Third Edition
By Brent B. Welch

Table of Contents
Chapter 11.  Regular Expressions


The syntax added by advanced regular expressions is mostly just short hand notation for constructs you can make with the basic syntax already described. There are also some new features that add additional power: nongreedy quantifiers, back references, look-ahead patterns, and named character classes. If you are just starting out with regular expressions, you can ignore most of this section, except for the one about backslash sequences. Once you master the basics, of if you are already familar with regular expressions in Tcl (or the UNIX vi editor or grep utility), then you may be interested in the new features of advanced regular expressions.

Compatibility with Patterns in Tcl 8.0

Advanced regular expressions add syntax in an upward compatible way. Old patterns continue to work with the new matcher, but advanced regular expressions will raise errors if given to old versions of Tcl. For example, the question mark is used in many of the new constructs, and it is artfully placed in locations that would not be legal in older versions of regular expressions. The added syntax is summarized in Table 11-2 on page 145.

If you have unbraced patterns from older code, they are very likely to be correct in Tcl 8.1 and later versions. For example, the following pattern picks out everything up to the next newline. The pattern is unbraced, so Tcl substitutes the newline character for each occurrence of \n. The square brackets are quoted so that Tcl does not think they delimit a nested commmand:

 regexp "(\[^\n\]+)\n" $input 

The above command behaves identically when using advanced regular expressions, although you can now also write it like this:

 regexp {([^\n]+)\n} $input 

The curley braces hide the brackets from the Tcl parser, so they do not need to be escaped with backslash. This saves us two characters and looks a bit cleaner.

Backslash Escape Sequences

The most significant change in advanced regular expression syntax is backslash substitutions. In Tcl 8.0 and earlier, a backslash is only used to turn off special characters such as: . + * ? [ ]. Otherwise it was ignored. For example, \n was simply n to the Tcl 8.0 regular expression engine. This was a source of confusion, and it meant you could not always quote patterns in braces to hide their special characters from Tcl's parser. In advanced regular expressions, \n now means the newline characer to the regular expression engine, so you should never need to let Tcl do backslash processing.

Again, always group your pattern with curley braces to avoid confusion.

Advanced regular expressions add a lot of new backslash sequences. They are listed in Table 11-4 on page 146. Some of the more useful ones include \s, which matches space-like characters, \w, which matches letters, digit, and the underscore, \y, which matches the beginning or end of a word, and \B, which matches a backslash.

Character Classes

Character classes are names for sets of characters. The named character class syntax is valid only inside a bracketed character set. The syntax is

 [:identifier:] 

For example, alpha is the name for the set of uppercase and lowercase letters. The following two patterns are almost the same:

 [A-Za-z] [[:alpha:]] 

The difference is that the alpha character class also includes accented characters like è. If you match data that contains nonASCII characters, the named character classes are more general than trying to name the characters explicitly.

There are also backslash sequences that are shorthand for some of the named character classes. The following patterns to match digits are equivalent:

 [0-9] [[:digit:]] \d 

The following patterns match space-like characters including backspace, form feed, newline, carriage return, tag, and vertical tab:

 [ \b\f\n\r\t\v] [:space:] \s 

The named character classes and the associated backslash sequence are listed in Table 11-3 on page 146.

You can use character classes in combination with other characters or character classes inside a character set definition. The following patterns match leters, digits, and underscore:

 [[:digit:][:alpha:]_] [\d[:alpha:]_] [[:alnum:]_] \w 

Note that \d, \s and \w can be used either inside or outside character sets. When used outside a bracketed expression, they form their own character set. There are also \D, \S, and \W, which are the complement of \d, \s, and \w. These escapes (i.e., \D for not-a-digit) cannot be used inside a bracketed character set.

There are two special character classes, [[:<:] and [[:>:]], that match the beginning and end of a word, respectively. A word is defined as one or more characters that match \w.

nongreedy Quantifiers

The *, +, and ? characters are quantifiers that specify repetition. By default these match as many characters as possible, which is called greedy matching. A nongreedy match will match as few characters as possible. You can specify nongreedy matching by putting a question mark after these quantifiers. Consider the pattern to match "one or more of not-a-newline followed by a newline." The not-a-newline must be explicit with the greedy quantifier, as in:

 [^\n]+\n 

Otherwise, if the pattern were just

 .+\n 

then the "." could well match newlines, so the pattern would greedily consume everything until the very last newline in the input. A nongreedy match would be satisfied with the very first newline instead:

 .+?\n 

By using the nongreedy quantifier we've cut the pattern from eight characters to five Another example that is shorter with a nongreedy quantifier is the HTML example from page 138. The following pattern also matches everything between <td> and </td>:

 <td>(.*?)</td> 

Even ? can be made nongreedy, ??, which means it prefers to match zero instead of one. This only makes sense inside the context of a larger pattern. Send me e-mail if you have a compelling example for it!

Bound Quantifiers

The {m,n} syntax is a quantifier that means match at least m and at most n of the previous matching item. There are two variations on this syntax. A simple {m} means match exactly m of the previous matching item. A {m,} means match m or more of the previous matching item. All of these can be made nongreedy by adding a ? after them.

Back References

A back reference is a feature you cannot easily get with basic regular expressions. A back reference matches the value of a subpattern captured with parentheses. If you have several sets of parentheses you can refer back to different captured expressions with \1, \2, and so on. You count by left parentheses to determine the reference.

For example, suppose you want to match a quoted string, where you can use either single or double quotes. You need to use an alternation of two patterns to match strings that are enclosed in double quotes or in single quotes:

 ("[^"]*"|'[^']*') 

With a back reference, \1, the pattern becomes simpler:

 ('|").*?\1 

The first set of parenthesis matches the leading quote, and then the \1 refers back to that particular quote character. The nongreedy quantifier ensures that the pattern matches up to the first occurrence of the matching quote.

Look-ahead

Look-ahead patterns are subexpressions that are matched but do not consume any of the input. They act like constraints on the rest of the pattern, and they typically occur at the end of your pattern. A positive look-ahead causes the pattern to match if it also matches. A negative look-ahead causes the pattern to match if it would not match. These constraints make more sense in the context of matching variables and in regular expression subsitutions done with the regsub command. For example, the following pattern matches a filename that begins with A and ends with .txt

 ^A.*\.txt$ 

The next version of the pattern adds parentheses to group the file name suffix.

 ^A.*(\.txt)$ 

The parentheses are not strictly necessary, but they are introduced so that we can compare the pattern to one that uses look-ahead. A version of the pattern that uses look-ahead looks like this:

 ^A.*(?=\.txt)$ 

The pattern with the look-ahead constraint matches only the part of the filename before the .txt, but only if the .txt is present. In other words, the .txt is not consumed by the match. This is visible in the value of the matching variables used with the regexp command. It would also affect the substitutions done in the regsub command.

There is negative look-ahead too. The following pattern matches a filename that begins with A and does not end with .txt.

 ^A.*(?!\.txt)$ 

Writing this pattern without negative look-ahead is awkward.

Character Codes

The \nn and \mmm syntax, where n and m are digits, can also mean an 8-bit character code corresponding to the octal value nn or mmm. This has priority over a back reference. However, I just wouldn't use this notation for character codes. Instead, use the Unicode escape sequence, \unnnn, which specifies a 16-bit value. The \xnn sequence also specifies an 8-bit character code. Unfortunately, the \x escape consumes all hex digits after it (not just two!) and then truncates the hexadecimal value down to 8 bits. This misfeature of \x is not considered a bug and will probably not change even in future versions of Tcl.

The \Uyyyyyyyy syntax is reserved for 32-bit Unicode, but I don't expect to see that implemented anytime soon.

Collating Elements

Collating elements are characters or long names for characters that you can use inside character sets. Currently, Tcl only has some long names for various ASCII punctuation characters. Potentially, it could support names for every Unicode character, but it doesn't because the mapping tables would be huge. This section will briefly mention the syntax so that you can understand it if you see it. But its usefulness is still limited.

Within a bracketed expression, the following syntax is used to specify a collating element:

 [.identifier.] 

The identifier can be a character or a long name. The supported long names can be found in the generic/regc_locale.c file in the Tcl source code distribution. A few examples are shown below:

 [.c.] [.#.] [.number-sign.] 

Equivalence Classes

An equivalence class is all characters that sort to the same position. This is another feature that has limited usefulness in the current version of Tcl. In Tcl, characters sort by their Unicode character value, so there are no equivalence classes that contain more than one character! However, you could imagine a character class for 'o', 'ò', and other accented versions of the letter o. The syntax for equivalence classes within bracketed expressions is:

 [=char=] 

where char is any one of the characters in the character class. This syntax is valid only inside a character class definition.

Newline Sensitive Matching

By default, the newline character is just an ordinary character to the matching engine. You can make the newline character special with two options: lineanchor and linestop. You can set these options with flags to the regexp and regsub Tcl commands, or you can use the embedded options described later in Table 11-5 on page 147.

The lineanchor option makes the ^ and $ anchors work relative to newlines. The ^ matches immediately after a newline, and $ matches immediately before a newline. These anchors continue to match the very beginning and end of the input,too. With or without the lineanchor option, you can use \A and \Z to match the beginning and end of the string.

The linestop option prevents . (i.e., period) and character sets that begin with ^ from matching a newline character. In otherwords, unless you explicitly include \n in your pattern, it will not match across newlines.

Embedded Options

You can start a pattern with embedded options to turn on or off case sensitivity, newline sensitivity, and expanded syntax, which is explained in the next section. You can also switch from advanced regular expressions to a literal string, or to older forms of regular expressions. The syntax is a leading:

 (?chars) 

where chars is any number of option characters. The option characters are listed in Table 11-5 on page 147.

Expanded Syntax

Expanded syntax lets you include comments and extra white space in your patterns. This can greatly improve the readability of complex patterns. Expanded syntax is turned on with a regexp command option or an embeded option.

Comments start with a # and run until the end of line. Extra white space and comments can occur anywhere except inside bracketed expressions (i.e., character sets) or within multicharacter syntax elements like (?=. When you are in expanded mode, you can turn off the comment character or include an explicit space by preceeding them with a backslash. Example 11-1 shows a pattern to match URLs. The leading (?x) turns on expanded syntax. The whole pattern is grouped in curly braces to hide it from Tcl. This example is considered again in more detail in Example 11-3 on page 150:

Example 11-1 Expanded regular expressions allow comments.
 regexp {(?x)              # A pattern to match URLS        ([^:]+):           # The protocol before the initial colon        //([^:/]+)         # The server name        (:([0-9]+))?       # The optional port number        (/.*)              # The trailing pathname } $input 

       
    Top
     



    Practical Programming in Tcl and Tk
    Practical Programming in Tcl and Tk (4th Edition)
    ISBN: 0130385603
    EAN: 2147483647
    Year: 1999
    Pages: 478

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net