Section 5.2. A Few Short Examples | Mastering Regular Expressions

5.2. A Few Short Examples

5.2.1. Continuing with Continuation Lines

With the continuation-line example from the previous chapter (˜178), we found that ^\w+= .* (\\\n.*)* applied with a Traditional NFA doesnt properly match both lines of:

 SRC=array.c builtin.c eval.c field.c gawkmisc.c io.c main.c \ missing.c msg.c node.c re.c version.c

The problem is that the first .* matches past the backslash, pulling it out from under the (\n.*)* that we want it to be matched by. Well, heres the first lesson of the chapter: if we don't want to match past the backslash, we should say that in the regex. We can do this by changing each dot to [^\n\\] . (Notice how Ive made sure to include \n in the negated class? You'll remember that one of the assumptions of the original regex was that dot didn't match a newline, and we don't want its replacement to match a newline either ˜119.)

Making that change, we get:

  ^\w+ =

 * (\\\n

*)*

This now works, properly matching continuation lines, but in solving one problem, we have perhaps introduced another: we've now disallowed backslashes other than those at the end of lines. This is a problem if the data to which it will be applied could possibly have other backslashes. We'll assume it could, so we need to accommodate the regex to handle them.

So far, our approaches have been along the lines of "match the line, then try to match a continuation line if there." Let's change that approach to one that I find often works in general: concentrate on what is really allowed to match at any particular point. As we match the line, we want either normal (non-backslash, non-newline) characters , or a backslash-anything combination. If we use \\. for the backslash-anything combination, and apply it in a dot-matches-all mode, it also can match the backslash-newline combination.

So, the expression becomes ^\w+= ( [^\n\\]\\. )* in a dot-matches-all mode. Due to the leading ^ , an enhanced line anchor match mode (˜112) may be useful as well, depending on how this expression is used.

But, we're not quite done with this example yetwe'll pick it up again in the next chapter where we work on its efficiency (˜270).

5.2.2. Matching an IP Address

As another example that we'll take much further, let's match an IP (Internet Protocol) address: four numbers separated by periods, such as 1.2.3.4. Often, the numbers are padded to three digits, as in 001.002.003.004 . If you want to check a string for one of these, you could use [0-9]*\.[0-9]*\.[0-9]*\.[0-9]* , but thats so vague that it even matches ' and then ..? '. Look at the regex: it doesn't even require any numbersits only requirements are three periods (with nothing but digits, if anything , between).

To fix this regex, we first change the star to a plus, since we know that each number must have at least one digit. To ensure that the entire string is only the IP address, we wrap the regex with ^‹$ . This gives us:

  ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$

Using \d instead of [0-9] , it becomes ^\d+\.\d+\.\d+\.\d+$ , which you may find to be more easily readable, ^{[ ]} but it still matches things that arent IP addresses, like ' 1234.5678.9101112.131415 '. (IP addresses have each number in the range of 0255.) As a start, you can enforce that each number be three digits long, with ^\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d$ . but now we are too specific. We still need to allow one- and two-digit numbers (as in 1.234.5.67 ). If the flavor supports { min,max } , you can use ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ . If not, you can always use \d\d?\d? or \d(\d\d?)? for each part. These allow one to three digits, each in a slightly different way.

^{[ ]} Or maybe not it depends on what you are used to. In a complex regex, I find \d more readable than [0-9] , but note that on some systems, the two might not be exactly the same. Systems that support Unicode, for example, may have their \d match non-ASCII digits as well (˜120).

Depending on your needs, you might be happy with some of the various degrees of vagueness in the expressions so far. If you really want to be strict, you have to worry that \d{1,3} can match 999 , which is above 255, and thus an invalid component of an IP address.

Several approaches would ensure that only numbers from 0 to 255 appear. One silly approach is 0123‹253254255 . Actually, this doesnt allow the zeropadding that is allowed, so you really need 000000101001‹ , whose length becomes even more ridiculous. For a DFA engine, it is ridiculous only in that its so long and verbose it still matches just as fast as any regex describing the same text. For an NFA, however, all the alternation kills efficiency.

A realistic approach concentrates on which digits are allowed in a number, and where. If a number is only one or two digits long, there is no worry as to whether the value is within range, so \d\d\d takes care of it. Theres also no worry about the value for a three-digit number beginning with a or 1 , since such a number is in the range 000199 and is perfectly valid. This lets us add [01]\d\d , leaving us with \d\d\d[01]\d\d . You might recognize this as being similar to the time example in Chapter 1 (˜28), and date example of the previous chapter (˜177).

Continuing with our regular expression, a three-digit number beginning with a 2 is allowed if the number is 255 or less, so a second digit less than 5 means the number is valid. If the second digit is 5 , the third must be less than 6 . This can all be expressed as 2[0-4]\d25[0-5] .

This may seem confusing at first, but the approach should make sense upon reflection. The result is \d\d\d[01]\d\d2[0-4]\d25[0-5] . Actually, we can combine the first three alternatives to yield 2[0-4]\d25[0-5] . Doing so is more efficient for an NFA, since any alternative that fails results in a backtrack. Note that using \d\d ? in the first alternative, rather than \d ? \d , allows an NFA to fail just a bit more quickly when there is no digit at all. Ill leave the analysis to youwalking through a simple test case with both should illustrate the difference. We could do other things to make this part of the expression more efficient, but I'll leave that for the next chapter.

Now that we have a subexpression to match a single number from 0 through 255, we can wrap it in parentheses and insert it in place of each \d{1,3} in the earlier regex. This gives us (broken across lines to fit the width of the page):

  ^  (  [01]?\d\d?   2[0-4]\d   25[0-5]  )  \.  (  [01]?\d\d?   2[0-4]\d   25[0-5]  )  \.  (  [01]?\d\d?   2[0-4]\d   25[0-5]  )  \.  (  [01]?\d\d?   2[0-4]\d   25[0-5]  )  $

Quite a mouthful! Was it worth the trouble? You have to decide for yourself based upon your own needs. It matches only syntactically correct IP addresses, but it can still match semantically incorrect ones, such as 0.0.0.0 (invalid because all the digits are zero). With lookahead (˜133), you can disallow that specific case by putting (?!0+\.0+\.0+\.0+$) after ^ , but at some point, you have to decide when being too specific causes the cost/benefit ratio to suffer from diminishing returns. Sometimes its better to take some of the work out of the regex. For example, if you go back to ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ and wrap each component in parentheses to stuff the numbers into the programs version of $1, $2, $3 , and $4 , you can then validate them by other programming constructs.

5.2.2.1. Know your context

It's important to realize that the two anchors, ^ and $ , are required to make this regex work. Without them, it can match ip=72 3 , or for a Traditional NFA, even ip= 3 .

In that second case, the expression does not even fully match the final 223 that should have been allowed. Well, it is allowed , but there's nothing (such as a separating period, or the trailing anchor) to force that match. The final group 's first alternative, [01]?\d\d? , matched the first two digits, and without the trailing $ , thats the end of the regex. As with the date-matching problem in the previous chapter (˜176), we can arrange the order of the alternatives to achieve the desired result. In this case, we would put the alternatives matching three digits first, so any proper three-digit number is matched in full before the two-digit-okay alternative is given a chance. (DFAs and POSIX NFAs don't require the reordering , of course, since they choose the longest match, regardless.)

Rearranged or not, that first mistaken match is still a problem. "Ah!" you might think, "I can use word boundary anchors to solve this problem." Unfortunately, that's probably not enough, since such a regex could still match .5.6. To disallow embedded matches, you must ensure that the surrounding context has at least no alphanumerics or periods. If you have lookaround, you can wrap the regex in (?<! [\w.] )‹(?! [\w.] ) to disallow matches that follow just after (or end just before) where [\w.] can match. If you dont have lookaround, simply wrapping it in (^)‹($) might be satisfactory for some situations.

5.2.3. Working with Filenames

Working with file and path names , like /usr/local/bin/perl on Unix, or perhaps something like \Program Files\Yahoo!\Messenger on Windows, can provide many good regular-expression examples. Since "using" is more interesting than "reading," I'll sprinkle in a few examples coded in Perl, PHP (preg routines), Java, and VB.NET. If you're not interested in these particular languages, feel free to skip the code snippetsit's the regex concepts used in them that are important.

5.2.3.1. Removing the leading path from a filename

As a first example, let's remove the leading path from a filename, turning /usr/local/bin/gcc , for instance, into gcc . Stating problems in a way that makes solutions amenable is half of the battle. In this case, we want to remove anything up to (and including) the final slash (backslash for Windows pathnames). If there is no slash, it's fine as is, and nothing needs to be done. I've said a number of times that .* is often overused , but its greediness is desired here. With ^.*/ , the .* consumes the whole line, but then backs off (that is, backtracks) to the last slash to achieve the match.

Here's code to do it in our four test languages, ensuring that a filename in the variable f has no leading path. First, for Unix filenames:

Language	Code Snippet
Perl	$f =~ s{ }{} ;
PHP	$f = preg_replace( '{ }', '', $f ) ;
java.util.regex	f = f. replaceFirst (" ", "");
VB.NET	f = Regex.Replace (f, " ", "")

The regular expression (or string to be interpreted as a regular expression) is underlined , and regex components are bold.

For comparison, here are versions for Windows filenames, which use a backward slash rather than a forward slash, making the regex ^.*\\ . The need to double the backward slash in the regex to match a single backward-slash character in the target is a regular-expression issue, but as the middle two examples show, some languages require additional string- related backslashes as well:

Language	Code Snippet
Perl	$f =~ s/ // ;
PHP	$f = preg_replace( '/ /', '', $f ) ;
java.util.regex	f = f. replaceFirst (" ", "");
VB.NET	f = Regex.Replace (f, " ", "")

It's interesting to compare each language's difference between the examples, particularly the quadruple backslashes needed in Java (˜101).

Please keep in mind this key point: always consider what will happen when there is no match. In this case, no match means no slash in the string, so no substitution and the string is left unchanged. That's just what we want, in this case.

For efficiency's sake, it's important to remember how an NFA-based regex engine goes about its work. Let's consider what happens if we omit the leading caret (something that's easy to forget) and match against a string that doesn't happen to have a slash. As always, the regex engine starts the search at the beginning of the string. The .* races to the end of the string, but must back off to find a match for the slash or backslash. It eventually backs off everything that .* had gobbled up, yet theres still no match. So, the regex engine decides that there is no possible match when starting from the beginning of the string , but it's not done yet!

The transmission kicks in and retries the whole regex from the second character position. In fact, it needs (in theory) to go through the whole scan-and-backtrack routine for each possible starting position in the string. Filenames tend to be short, so it's probably not such a big deal in this case, but the principle applies to many situations. Were the string long, there's a potential for a lot of backtracking. (A DFA has no such problem, of course.)

In practice, a reasonably optimized transmission realizes that almost any regex starting with .* that fails at the beginning of the string can never match when started from anywhere else, so it can shift gears and attempt the regex only the one time, at the start of the string (˜246). Still, its smarter to write that into our regex in the first place, as we originally did.

5.2.3.2. Accessing the filename from a path

Another approach is to bypass the path and simply match the trailing filename part without the path. The final filename is everything at the end that's not a slash: [^/]*$ . This time, the anchor is not just an optimization; we really do need dollar at the end. We can now do something like this, shown with Perl:

 $WholePath =~ m{([^/]*)$}; #  Check variable $WholePath with regex  . $FileName = $1; #  Note text matched

You'll notice that I don't check to see whether the regex actually matches, because I know it matches every time. The only requirement of that expression is that the string has an end to match dollar, and even an empty string has an end. Thus, when I use $1 to reference the text matched within the parenthetical subexpression, I'm assured it will have some value (although that value will be empty when the filename ends with a slash).

Another comment on efficiency: with an NFA, [^/]*$ is very inefficient. Carefully run through how the NFA engine attempts the match and you see that it can involve a lot of backtracking. Even the short sample ' /usr/local/bin/perl ' backtracks over 40 times before finally matching. Consider the attempt that starts at ‹ local/ ‹ . Once [^/]* matches through to the second l and fails on the slash, the $ is tried (and fails) for each l , a , c , o , l saved state. If thats not enough, most of it is repeated with the attempt that starts at ‹ l ocal/ ‹, and then again ‹ lo cal/ ‹, and so on.

It shouldn't concern us too much with this particular example, as filenames tend to be short. (And 40 backtracks is nothing 40 million is when they really matter!) Again, it's important to be aware of the issues so the general lessons here can be applied to your specific needs.

This is a good time to point out that even in a book about regular expressions, regular expressions aren't always The Best Answer. For example, most programming languages provide non-regex routines for dealing with filenames. But, for the sake of discussion, I'll forge ahead.

5.2.3.3. Both leading path and filename

The next logical step is to pick apart a full path into both its leading path and filename component. There are many ways to do this, depending on what we want. Initially, you might want to use ^(.*)/(.*)$ to fill $1 and $2 with the requisite parts . It looks like a nicely balanced regular expression, but knowing how greediness works, we are guaranteed that the first .* does what we want, never leaving anything with a slash for $2 . The only reason the first .* leaves anything at all is due to the backtracking done in trying to match the slash that follows . This leaves only that "backtracked part for the later .* . Thus, $1 is the full leading path and $2 the trailing filename.

One thing to note: we are relying on the initial (.*)/ to ensure that the second (.*) does not capture any slash. We understand greediness, so this is OK. Still I like to be specific when I can, so Id rather use [^/]* for the filename part. That gives us ^(.*)/ ( [^/]* ) $ . Since it shows exactly what we want, it acts as documentation as well.

One big problem is that this regex requires at least one slash in the string, so if we try it on something like file.txt , there's no match, and thus no information. This can be a feature if we deal with it properly:

 if ($WholePath =~ m!^(.*)/([^/]*)$!) { #  Have a match -- $1 and $2 are valid  $LeadingPath = $1; $FileName = $2; } else { #  No match, so there's no '/' in the filename  $LeadingPath = "."; #  so "file.txt" looks like ". / file.txt" ("." is the current directory)  $FileName = $WholePath; }

5.2.4. Matching Balanced Sets of Parentheses

Matching balanced sets of parentheses, brackets, and the like presents a special difficulty. Wanting to match balanced parentheses is quite common when parsing many kinds of configuration files, programs, and such. Imagine, for example, that you want to do some processing on all of a function's arguments when parsing a language like C. Function arguments are wrapped in parentheses following the function name , and may themselves contain parentheses resulting from nested function calls or math grouping. At first, ignoring that they may be nested, you might be tempted to use \bfoo , but it wont work.

In hallowed C tradition, I use foo as the example function name. The marked part of the expression is ostensibly meant to match the function's arguments. With examples such as foo and foo , it works as expected. Unfortunately, it also matches foo , 3.7 ), which is not as we want. This calls for something a bit "smarter" than [^)]* .

To match the parenthesized expression part, you might consider the following regular expressions, among others:

1 . $.*$ literal parentheses with anything in between
2 . $ [ ^) ] *$ from an opening parenthesis to the next closing parenthesis
3 . $ [ ^() ] *$ from an opening parenthesis to the next closing parenthesis, but no other opening parentheses allowed in between

Figure 5-1 illustrates where these match against a sample line of code.

Figure 5-1. Match locations of our sample regexes

We see that regex #1 matches too much, ^{[ ]} and regex #2 matches too little. Regex #3 doesnt even match successfully. In isolation, #3 would match '( this )', but because it must come immediately after the foo , it fails. So, none of these work.

^{[ ]} The use of .* should set off warning alarms. Always pay particular attention to decide whether dot is really what you want to apply star to. Sometimes that is exactly what you need, but .* is often used inappropriately.

The real problem is that on the vast majority of systems, you simply can't match arbitrarily nested constructs with regular expressions . For a long time, this was universally true, but now Perl, .NET, and PCRE/PHP all offer constructs that make it possible. (See pages 328, 436, and 475, respectively.) But, even without these special constructs, you can still build a regex to match things nested to a certain depth , but not to an arbitrary level of nesting. Just one level of nesting requires

  \([^()]*  (  \([^()]*\)[^()]*  )*  \)

so the thought of having to worry about further levels of nesting is frightening. But, here's a little Perl snippet that, given a $depth , creates a regex to match up to that many levels of parentheses beyond the first. It uses Perl's " string x count " operator, which replicates string by count times:

 $regex = '\(' . '(?:[^()]\(' x $depth . '[^()]*' . '\))*' x $depth . '\)';

I'll leave the analysis for your free time.

5.2.5. Watching Out for Unwanted Matches

It's easy to forget what happens if the text is not formed just as you expect. Let's say you are writing a filter to convert a text file to HTML, and you want to replace a line of hyphens by <HR> , which represent a horizontal rule (a line across the page). If you used a s/-*/<HR>/ search-and-replace command, it would replace the sequences you wanted, but only when they're at the beginning of the line. Surprised? In fact, s/-*/<HR>/ adds <HR> to the beginning of every line, whether they begin with a sequence of hyphens or not!

Remember, anything that isn't required is always considered successful. The first time -* is attempted at the start of the string, it matches any hyphens that might be there. However, if there arent any, it is still happy to successfully match nothing. That's what star is all about.

Let's look at a similar example I once saw in a book by a respected author, in which he describes a regular expression to match a number, either integer or floating-point. As his expression is constructed , such a number has an optional leading minus sign, any number of digits, an optional decimal point, and any number of digits that follow. His regex is -?[0-9]*\.?[0-9]* .

Indeed, this matches such examples as 1, -272.37, 129238843., .191919 , and even something like -.0 . This is all good, and as expected.

However, how do you think it matches in a string like ' this has no number ', ' nothing here ', or even an empty string? Look at the regex closely everything is optional. If a number is there, and if it is at the beginning of the string, it is matched, but nothing is required . This regex can match all three non-number examples, matching the nothingness at the beginning of the string each time. In fact, it even matches nothingness at the beginning of an example like ' num 123 ', since that nothingness matches earlier than the number would.

So, it's important to say what you really mean. A floating-point number must have at least one digit in it, or it's not a number(!). To construct our regex, let's first assume there is at least one digit before the decimal point. (We'll remove this requirement later.) If so, we need to use plus for those digits: -?[0-9]+ .

Writing the subexpression to match an optional decimal point (and subsequent digits) hinges on the realization that any numbers after the decimal point are contingent upon there being a decimal point in the first place. If we use something na ve like \.?[0-9]* , the [0-9]* gets a chance to match regardless of the decimal points presence.

The solution is, again, to say what we mean. A decimal point (and subsequent digits, if any) is optional: ( \.[0-9]* ) ? . Here, the question mark no longer quantifies (that is, governs or controls) only the decimal point, but instead the entire combination of the decimal point plus any following digits. Within that combination, the decimal point is required; if it is not there, [0-9]* is not even reached.

Putting this all together, we have -?[0-9]+ ( \.[0-9]* ) ? . This still doesnt allow something like ' .007 ', since our regex requires at least one digit before the decimal point. If we change that part to allow zero digits, we have to change the other so it doesn't, since we can't allow all digits to be optional (the problem we are trying to correct in the first place).

The solution is to add an alternative that allows for the uncovered situation: -?[0-9]+(\.[0-9]*)? . This now also allows just a decimal point followed by (this time not optional) digits. Details, details. Did you notice that I allowed for the optional leading minus in the second alternative as well? Thats easy to forget. Of course, you could instead bring the -? out of the alternation, as in -?( [0-9]+(\.[0-9]*)? \.[0-9]+ ) .

Although this is an improvement on the original, it's still more than happy to match at ' .12 '. Knowing the context in which a regex is intended to be used is an important part of striking the balance between matching what you want, and not matching what you don't want. Our regex for floating-point numbers requires that it be constrained somehow by being part of a larger regex, such as being wrapped by ^‹$ , or perhaps num\s*=\s*‹$ .

5.2.6. Matching Delimited Text

Matching a double-quoted string and matching an IP address are just two examples of a whole class of matching problem that often arises: the desire to match text delimited (or perhaps separated) by some other text. Other examples include:

Matching a C comment, which is surrounded by ' /* ' and ' */ '.
Matching an HTML tag, which is text wrapped by <‹> , such as <CODE> .
Extracting items between HTML tags, such as the ' super exciting ' of the HTML ' a <I>super exciting</I> offer! '
Matching a line in a .mailrc file. This file gives email aliases, where each line is in the form of
```
 alias  shorthand full-address  
```
such as ' alias jeff jfriedl@regex. info '. (Here, the delimiters are the whitespace between each item, as well as the ends of the line.)
Matching a quoted string, but allowing it to contain quotes if they are escaped, as in 'a passport needs a of the holder .'
Parsing CSV (comma-separated values) files.

In general, the requirements for such tasks can be phrased along the lines of:

1 . Match the opening delimiter .
2 . Match the main text (which is really "match anything that is not the closing delimiter").
3 . Match the closing delimiter.

As I mentioned earlier, satisfying these requirements can become complicated when the closing delimiter has more than one character, or when it may appear within the main text.

5.2.6.1. Allowing escaped quotes in double-quoted strings

Let's look at the 2\"x3\" example, where the closing delimiter is a quote, yet can appear within the main part if escaped. It's easy enough to match the opening and closing quotes; the trick is to match the main text without overshooting the closing quote.

Thinking carefully about which items are allowed in the main text, we know that if a character is not a double quote, that is, if it matches [^"] , its certainly OK. However, if it is a double quote, it is OK if preceded by a backslash. Translating that literally, using lookbehind (˜133) for the "if preceded" part, it becomes " ( [^"] (?<=\\)" ) *" , which indeed properly matches our 2\"x3\" example.

This is a perfect example to show how unintended matches can sneak into a seemingly proper regex, because as much as it seems to be correct, it doesn't always work. We want it to match the marked part of this silly example:

 Darth Symbol:

 or "[^-^]"

but it actually matches:

 Darth Symbol:

 [^-^]"

This is because the final quote of the first string indeed has a backslash before it. That backslash is itself escaped, so it doesn't escape the quote that follows (which means the quote that follows does end the string). Our lookbehind doesn't recognize that the preceding backslash has been itself escaped, and considering that there may be any number of preceding '\\' sequences, it's a can of worms to try to solve this with lookbehind. The real problem is that a backslash that escapes a quote is not being recognized as an escaping backslash when we first process it, so let's try a different approach that tackles it from that angle.

Concentrating again at what kinds of things we want to match between the opening and closing delimiter, we know that something escaped is OK ( \\. ), as well as anything else other than the closing quote ( [^"] ). This yields " ( \\. [^"] ) *" . Wonderful, weve solved the problem! Unfortunately, not yet. Unwanted matches can still creep in, such as with this example for which we expect no match because the closing quote has been forgotten:

 photo.

Why does it match? Recall the lessons from "Greediness and Laziness Always Favor a Match" (˜167). Even though our regex initially matches past that last quote, as we want, it still backtracks after it finds that there is no ending quote, to:

at ' ‹2x\"3 \" ‹ '

matching (\\. [^"])

From that point, the [^"] matches the backslash, leaving us at what the regex can consider an ending quote.

An important lesson to take from this example is:

When backtracking can cause undesired matches in relation to alternation, it's likely a sign that any success is just a happenstance due to the ordering of the alternatives.

In fact, had our original regex had its alternatives reversed , it would match incorrectly in every string containing an escaped double quote. The problem is that one alternative can match something that is supposed to be handled by the other.

So, how can we fix it? Well, just as in the continuation-lines example on page 186, we must make sure that there's no other way for that backslash to be matched, which means changing [^"] , to [^\\"] ,. This recognizes that both a double quote and a backslash are "special in this context, and must be handled accordingly . The result is " ( \\. [^\\"] ) *" , which works just fine. (Although this regex now works, it can still be improved so that it is much more efficient for NFA engines; well see this example quite a bit in the next chapter ˜222.)

This example shows a particularly important moral:

Always consider the "odd" cases in which you don't want a regex to match, such as with "bad" data.

Our fix is the right one, but it's interesting to note that if you have possessive quantifiers (˜142) or atomic grouping (˜139), this regex can be written as " ( \\. [^"] )*+ " and " (?> (\\. [^"])* ) " respectively. They dont really fix the problem so much as hide it, disallowing the engine from backtracking back to where the problem could show itself. Either way, they get the job done well.

Understanding how possessive quantifiers and atomic grouping help in this situation is extremely valuable , but I would still go ahead and make the previous fix anyway, as it is more descriptive to the reader. Actually, in this case, I would want to use possessive quantifiers or atomic grouping as well not to solve the previous problem, but for efficiency, so that a failure fails more quickly.

5.2.7. Knowing Your Data and Making Assumptions

This is an opportune time to highlight a general point about constructing and using regular expressions that I've briefly mentioned a few times. It is important to be aware of the assumptions made about the kind of data with which, and situations in which, a regular expression will be used. Even something as simple as a assumes that the target data is in the same character encoding (˜105) as the author intends. This is pretty much common sense, which is why I havent been too picky about saying these things.

However, many assumptions that might seem obvious to one person are not necessarily obvious to another. For example, the solution in the previous section assumes that escaped newlines shouldn't be matched, or that it will be applied in a dot-matches-all mode (˜111). If we really want to ensure that dot can match a newline, we should write that by using (?s:.) , if supported by the flavor.

Another assumption made in the previous section is the type of data to which the regex will be applied, as it makes no provisions for any other uses of double quotes in the data. If you apply it to source code from almost any programming language, for example, you'll find that it breaks because there can be double quotes within comments.

There is nothing wrong with making assumptions about your data, or how you intend a regex to be used. The problems, if any, usually lie in overly optimistic assumptions and in misunderstandings between the author's intentions and how the regex is eventually used. Documenting the assumptions can help.

5.2.8. Stripping Leading and Trailing Whitespace

Removing leading and trailing whitespace from a line is not a challenging problem, but it's one that seems to come up often. By far the best all-around solution is the simple use of two substitutions:

 s/^\s+//; s/\s+$//;

As a measure of efficiency, these use + instead of * , since theres no benefit to doing the substitution unless there is actually whitespace to remove.

For some reason, it seems to be popular to try to find a way to do it all in one expression, so I'll offer a few methods for comparison. I don't recommend them, but it's educational to understand why they work, and why they're not desirable.

s/\s*(.*?)\s*$/$1/s

This used to be given as a great example of lazy quantifiers, but not any more, because people now realize that it's so much slower than the simple approach. (In Perl, it's about 5x slower). The lack of speed is due to the need to check \s*$ before each application of the lazy- quantified dot. That requires a lot of backtracking.

s/^\s* ( (?:.*\S)? ) \s*$/$1/s

This one looks more complex than the previous example, but its matching is more straightforward, and is only twice as slow as the simple approach. After the initial ^\s* has bypassed any leading whitespace, the .* in the middle matches all the way to the end of the text. The \S that follows forces it to backtrack to the last non-whitespace in the text, thereby leaving the trailing whitespace matched by the final \s*$ , outside of the capturing parentheses.

The question mark is needed so that this expression works properly on a line that has only whitespace. Without it, it would fail to match, leaving the whitespace-filled line unchanged.

s/^\s+\s+$//g

This is a commonly thought-up solution that, while not incorrect (none of these are incorrect), it has top-level alternation that removes many optimizations (covered in the next chapter) that might otherwise be possible.

The /g modifier is required to allow each alternative to match, to remove both leading and trailing whitespace. It seems a waste to use /g when we know we intend at most two matches, and each with a different subexpression. This is about 4x slower than the simple approach.

I've mentioned the relative speeds as I tested them, but in practice, the actual relative speeds are dependent upon the tool and the data. For example, if the target text is very, very long, but has relatively little whitespace on either end, the middle approach can be somewhat faster than the simple approach. Still, in my programs, I use the language's equivalent of

 s/^\s+//; s/\s+$//;

because it's almost always fastest , and is certainly the easiest to understand.