Section 1.5. Expanding the Foundation | Mastering Regular Expressions

1.5. Expanding the Foundation

I hope the examples and explanations so far have helped to establish the basis for a solid understanding of regular expressions, but please realize that what I've provided so far lacks depth. There's so much more out there.

1.5.1. Linguistic Diversification

I mentioned a number of regular expression features that most versions of egrep support. There are other features, some of which are not supported by all versions, which I'll leave for later chapters.

Unfortunately , the regular expression language is no different from any other in that it has various dialects and accents. It seems each new program employing regular expressions devises its own "improvements." The state of the art continually moves forward, but changes over the years have resulted in a wide variety of regular expression "flavors." We'll see many examples in the following chapters.

1.5.2. The Goal of a Regular Expression

From the broadest top-down view, a regular expression either matches within a lump of text (with egrep , each line) or it doesn't. When crafting a regular expression, you must consider the ongoing tug-of-war between having your expression match the lines you want, yet still not matching lines you don't want.

Also, while egrep doesn't care where in the line the match occurs, this concern is important for many other regular-expression uses. If your text is something such as

 ...zip is 44272. If you write, send .95 to cover postage and...

and you merely want to find lines matching [0-9]+ , you dont care which numbers are matched. However, if your intent is to do something with the number (such as save to a file, add, replace, and suchwe will see examples of this kind of processing in the next chapter), you'll care very much exactly which numbers are matched.

1.5.3. A Few More Examples

As with any language, experience is a very good thing , so I'm including a few more examples of regular expressions to match some common constructs.

Half the battle when writing regular expressions is getting successful matches when and where you want them. The other half is to not match when and where you don't want. In practice, both are important, but for the moment, I would like to concentrate on the "getting successful matches" aspect. Even though I don't take these examples to their fullest depths, they still provide useful insight.

1.5.3.1. Variable names

Many programming languages have identifiers (variable names and such) that are allowed to contain only alphanumeric characters and underscores, but which may not begin with a digit. They are matched by [a-zA-Z_][a-zA-Z_0-9]* . The first character class matches what the first character can be, the second (with its accompanying star) allows the rest of the identifier. If there is a limit on the length of an identifier, say 32 characters, you might replace the star with {0,31} if the {min,max} notation is supported. (This construct, the interval quantifier, was briefly mentioned on page 20.)

1.5.3.2. A string within double quotes

A simple solution to matching a string within double quotes might be: "[^"]*"

The double quotes at either end are to match the opening and closing double quotes of the string. Between them, we can have anything... except another double quote! So, we use [^"] to match all characters except a double quote, and apply using a star to indicate we can have any number of such non double-quote characters.

A more useful (but more complex) definition of a double-quoted string allows double quotes within the string if they are escaped with a backslash, such as in " nail the 2\"x4\" plank" . We'll see this example several times in future chapters while covering the many details of how a match is actually carried out.

1.5.3.3. Dollar amount (with optional cents )

One approach to matching a dollar amount is: \$[0-9]+(\.[0-9][0-9])?

From a top-level perspective, this is a simple regular expression with three parts : \$ and ‹+ and (‹)? , which might be loosely paraphrased as "a literal dollar sign, a bunch of one thing, and finally perhaps another thing." In this case, the "one thing is a digit (with a bunch of them being a number), and "another thing" is the combination of a decimal point followed by two digits.

This example is a bit na ve for several reasons. For example, it considers dollar amounts like $1000 , but not $1,000 . It does allow for optional cents, but frankly, that's not really very useful when applied with egrep . egrep never cares exactly how much is matched, but merely whether there is a match. Allowing something optional at the end never changes whether there's an overall match to begin with.

But, if you need to find lines that contain just a price, and nothing else, you can wrap the expression with ^‹$ . In this case, the optional cents part becomes important since it might or might not come between the dollar amount and the end of the line, and allowing or disallowing it makes the difference in achieving an overall match.

One type of value our expression doesn't match is ' $.49 '. To solve this, you might be tempted to change the plus to a star, but that doesn't work. As to why, I'll leave it as a teaser until we look at a similar example in Chapter 5 (˜194).

1.5.3.4. An HTTP/HTML URL

The format of web URLs can be complex, so constructing a regular expression to match any possible URL can be equally complex. However, relaxing your standards slightly can allow you to match most common URLs with a fairly simple expression. One common reason I might do this, for example, would be to search my email archive for a URL that I vaguely remember having received, but which I think I might recognize when I see it.

The general form of a common HTTP/HTML URL is along the lines of

  http:  //  hostname  /  path.html

although ending with .htm is common as well.

The rules about what can and can't be a hostname (computer name , such as www.yahoo.com) are complex, but for our needs we can realize that if it follows ' http:// ', it's probably a hostname, so we can make do with something simple, such as [-a-z0-9_.]+ . The path part can be even more varied, so we'll use [-a-z0-9_:@&?=+,.!/~*%$]* for that. Notice that these classes have the dash first, to ensure that its taken as a literal character and included in the list, as opposed to part of a range (˜9).

Putting these all together, we might use as our first attempt something like:

 % egrep      -i '\<  http://  [-a-z0-9_.:]+/[-a-z0-9_:@&?=+,.!/~*%$]*\  .html?  \>'  files

Again, since we've taken liberties and relaxed what we'll match, we could well match something such as ' http:// .... /foo.html ', which is certainly not a valid URL. Do we care about this? It all depends on what you're trying to do. For my scan of my email archive, it doesn't really matter if I get a few false matches. Heck, I could probably get away with even something as simple as:

 % egrep -i '\<http://[^ ]*\.html?\>'  files  ...

As we'll learn when getting deeper into how to craft an expression, knowing the data you'll be searching is an important aspect of finding the balance between complexity and completeness. We'll visit this example again, in more detail, in the next chapter.

1.5.3.5. An HTML tag

With a tool like egrep , it doesn't seem particularly common or useful to simply match lines with HTML tags. But, exploring a regular expression that matches HTML tags exactly can be quite fruitful, especially when we delve into more advanced tools in the next chapter.

Looking at simple cases like ' <TITLE> ' and ' <HR> ', we might think to try <.*> . This simplistic approach is a frequent first thought, but its certainly incorrect. Converting <.*> into English reads "match a ' < ', followed by as much of anything as can be matched, followed by ' > '." Well, when phrased that way, it shouldnt be surprising that it can match more than just one tag, such as the marked portion of ' this example '.

This might have been a bit surprising, but we're still in the first chapter, and our understanding at this point is only superficial. I have this example here to highlight that regular expressions are not a difficult subject, but they can be tricky if you don't truly understand them. Over the next few chapters, we'll look at all the details required to understand and solve this problem.

1.5.3.6. Time of day, such as "9:17 am" or "12:30 pm"

Matching a time can be taken to varying levels of strictness. Something such as

  [0-9]?[0-9]:[0-9][0-9](ampm)

picks up both 9:17 am and 12:30 pm , but also allows something nonsensical like 99:99 pm .

Looking at the hour , we realize that if it is a two-digit number, the first digit must be a one. But, 1?[0-9] still allows an hour of 19 (and also an hour of 0), so maybe it is better to break the hour part into two possibilities: 1[012] for two-digit hours and [1-9] for single-digit hours. The result is (1[012][1-9]) .

The minute part is easier. The first digit should be [0-5] . For the second, we can stick with the current [0-9] . This gives (1[012];[1-9]):[0-5][0-9] (ampm) when we put it all together.

Using the same logic, can you extend this to handle 24-hour time with hours from tHRough 23? As a challenge, allow for a leading zero, at least through to 09:59 . Try building your solution, and then turn the page to check mine.

1.5.4. Regular Expression Nomenclature

1.5.4.1. Regex

As you might guess, using the full phrase "regular expression" can get a bit tiring, particularly in writing. Instead, I normally use "regex." It just rolls right off the tongue (it rhymes with "FedEx," with a hard g sound like "regular" and not a soft one like in "Regina") and it is amenable to a variety of uses like "when you regex...," "budding regexers," and even "regexification." ^{[ ]} I use the phrase "regex engine to refer to the part of a program that actually does the work of carrying out a match attempt.

^{[ ]} You might also come across the decidedly unsightly "regexp." Im not sure how one would pronounce that, but those with a lisp might find it a bit easier.

1.5.4.2. Matching

When I say a regex "matches" a string, I really mean that it matches in a string. Technically, the regex a doesnt match cat , but matches the a in . It's not something that people tend to confuse, but it's still worthy of mention.

1.5.4.3. Metacharacter

Whether a character is a metacharacter (or "metasequence"I use the words interchangeably) depends on exactly where in the regex it's used. For example, * is a metacharacter, but only when its not within a character class and when not escaped. "Escaped" means that it has a backslash in front of itusually. The star is escaped in \* , but not in \\* (where the first backslash escapes the second), although the star "has a backslash in front of it in both examples.

Depending upon the regex flavor, there are various situations when certain characters are and aren't metacharacters. Chapter 3 discusses this in more detail.

1.5.4.4. Flavor

As I've hinted, different tools use regular expressions for many different things, and the set of metacharacters and other features that each support can differ . Let's look at word boundaries again as an example. Some versions of egrep support the \<‹\> notation we've seen. However, some do not support the separate word-start and word-end, but one catch-all \b metacharacter (which we havent seen yetwe'll see it in the next chapter). Still others support both, and many others support neither .

I use the term "flavor" to describe the sum total of all these little implementation decisions. In the language analogy, it's the same as a dialect of an individual speaker. Superficially, this concept refers to which metacharacters are and aren't supported, but there's much more to it. Even if two programs both support \<‹\> , they might disagree on exactly what they do and dont consider to be a word. This concern is important when you use the tool.

Extending the Time Regex to Handle a 24-Hour Clock

Answer to the question on page 26 .

There are various solutions, but we can use similar logic as before. This time, I'll break the task into three groups: one for the morning (hours 00 through 09 , with the leading zero being optional), one for the daytime (hours 10 through 19 ), and one for the evening (hours 20 through 23 ). This can be rendered in a pretty straightforward way: 0?[0-9]1[0-9]2[0-3] .

Actually, we can combine the first two alternatives, resulting in the shorter [01]?[0-9]<2[0-3] . You might need to think about it a bit to convince yourself that theyll really match exactly the same text, but they do. The figure below might help, and it shows another approach as well. The shaded groups represent numbers that can be matched by a single alternative.

Don't confuse "flavor" with "tool." Just as two people can speak the same dialect, two completely different programs can support exactly the same regex flavor. Also, two programs with the same name (and built to do the same task) often have slightly (and sometimes not-so-slightly) different flavors. Among the various programs called egrep , there is a wide variety of regex flavors supported.

In the late 1990s, the particularly expressive flavor offered by the Perl programming language was widely recognized for its power, and soon other languages were offering Perl-inspired regular expressions (many even acknowledging the inspirational source by labeling themselves "Perl-compatible"). The adopters include PHP, Python, many Java regex packages, Microsoft's .NET Framework, Tcl, and a variety of C libraries, to name a few. Yet, all are different in important respects. On top of this, Perl's regular expressions themselves are evolving and growing (sometimes, now, in response to advances seen with other tools). As always, the overall landscape continues to become more varied and confusing.

1.5.4.5. Subexpression

The term "subexpression" simply refers to part of a larger expression, although it often refers to some part of an expression within parentheses, or to an alternative of . For example, with ^(SubjectDate): , the SubjectDate is usually referred to as a subexpression. Within that, the alternatives Subject and Date are each referred to as subexpressions as well. But technically, S is a subexpression, as is u , and b , and j ,...

Something such as 1-6 isn't considered a subexpression of H[1-6] * , since the ' 1-6 ' is part of an unbreakable "unit," the character class. But, H , [1-6] , and * are all subexpressions of H[1-6] * .

Unlike alternation , quantifiers (star, plus, and question mark) always work with the smallest immediately- preceding subexpression. This is why with mis+pell , the + governs the s , not the mis or is . Of course, when what immediately precedes a quantifier is a parenthesized subexpression, the entire subexpression (no matter how complex) is taken as one unit.

1.5.4.6. Character

The word "character" can be a loaded term in computing. The character that a byte represents is merely a matter of interpretation. A byte with such-and-such a value has that same value in any context in which you might wish to consider it, but which character that value represents depends on the encoding in which it's viewed . As a concrete example, two bytes with decimal values 64 and 53 represent the characters " @ " and " 5 " respectively, if considered in the ASCII encoding, yet on the other hand are completely different if considered in the EBCDIC encoding (they are a space and some kind of a control character).

On the third hand, if those two bytes are considered in one of the popular encodings for Japanese characters, together they represent the single character . Yet, to represent this same character in another of the Japanese encodings requires two completely different bytes. Those two different bytes, by the way, yield the two characters " ¼" in the popular Latin-1 encoding, but yield the one Korean character in one of the Unicode encodings. ^{[ ]} The point is this: how bytes are to be interpreted is a matter of perspective (called an encoding ), and to be successful, you've got to make sure that your perspective agrees with the perspective taken by the tool you're using.

^{[ ]} The definitive book on multiple-byte encodings is Ken Lundes CJKV Information Processing , also published by O'Reilly. The CJKV stands for Chinese , Japanese , Korean , and Vietnamese , which are languages that tend to require multiple-byte encodings. Ken and Adobe kindly provided many of the special fonts used in this book.

Until recently, text-processing tools generally treated their data as a bunch of ASCII bytes, without regard to the encoding you might be intending. Recently, however, more and more systems are using some form of Unicode to process data internally (Chapter 3 includes an introduction to Unicode ˜105). On such systems, if the regular-expression subsystem has been implemented properly, the user doesn't normally have to pay much attention to these issues. That's a big "if," which is why Chapter 3 looks at this issue in depth.

1.5.5. Improving on the Status Quo

When it comes down to it, regular expressions are not difficult. But, if you talk to the average user of a program or language that supports them, you will likely find someone that understands them "a bit," but does not feel secure enough to really use them for anything complex or with any tool but those they use most often.

Traditionally, regular expression documentation tends to be limited to a short and incomplete description of one or two metacharacters, followed by a table of the rest. Examples often use meaningless regular expressions like a*((ab)*b*) , and text like ' a xxx ce xxxxxx ci xxx d '. They also tend to completely ignore subtle but important points, and often claim that their flavor is the same as some other well-known tool, almost always forgetting to mention the exceptions where they inevitably differ. The state of regex documentation needs help.

Now, I don't mean to imply that this chapter fills the gap for all regular expressions, or even for egrep regular expressions. Rather, this chapter merely provides the foundation upon which the rest of this book is built. It may be ambitious, but I hope this book does fill the gaps for you. I received many gratifying responses to the first edition, and have worked very hard to make this one even better, both in breadth and in depth.

Perhaps because regular-expression documentation has traditionally been so lacking, I feel the need to make the extra effort to make things particularly clear. Because I want to make sure you can use regular expressions to their fullest potential, I want to make sure you really, really understand them.

This is both good and bad.

It is good because you will learn how to think regular expressions. You will learn which differences and peculiarities to watch out for when faced with a new tool with a different flavor. You will know how to express yourself even with a weak, stripped-down regular expression flavor. You will understand what makes one expression more efficient than another, and will be able to balance tradeoffs among complexity, efficiency, and match results. When faced with a particularly complex task, you will know how to work through an expression the way the program would, constructing it as you go. In short, you will be comfortable using regular expressions to their fullest.

The problem is that the learning curve of this method can be rather steep, with three separate issues to tackle:

How regular expressions are used Most programs use regular expressions in ways that are more complex than egrep . Before we can discuss in detail how to write a really useful expression, we need to look at the ways regular expressions can be used. We start in the next chapter.
Regular expression features Selecting the proper tool to use when faced with a problem seems to be half the battle, so I don't want to limit myself to only using one utility throughout this book. Different programs, and often even different versions of the same program, provide different features and metacharacters. We must survey the field before getting into the details of using them. This is the subject of Chapter 3.
How regular expressions really work Before we can learn from useful (but often complex) examples, we need to "look under the hood" to understand just how a regular expression search is conducted . As we'll see, the order in which certain metacharacters are checked can be very important. In fact, regular expression engines can be implemented in different ways, so different programs sometimes do different things with the same expression. We examine this meaty subject in Chapters 4, 5, and 6.

This last point is the most important and the most difficult to address. The discussion is unfortunately sometimes a bit dry, with the reader chomping at the bit to get to the fun parttackling real problems. However, understanding how the regex engine really works is the key to really understanding .

You might argue that you don't want to be taught how a car works when you simply want to know how to drive. But, learning to drive a car is a poor analogy for learning about regular expressions. My goal is to teach you how to solve problems with regular expressions, and that means constructing regular expressions. The better analogy is not how to drive a car, but how to build one. Before you can build a car, you have to know how it works.

Chapter 2 gives more experience with driving. Chapter 3 takes a short look at the history of driving, and a detailed look at the bodywork of a regex flavor. Chapter 4 looks at the all-important engine of a regex flavor. Chapter 5 shows some extended examples, Chapter 6 shows you how to tune up certain kinds of engines, and the chapters after that examine some specific makes and models. Particularly in Chapters 4, 5, and 6, we'll spend a lot of time under the hood, so make sure to have your coveralls and shop rags handy.

1.5.6. Summary

Table 1-3 summarizes the egrep metacharacters we've looked at in this chapter.

Table 1-3. Egrep Metacharacter Summary

Items to Match a Single Character
Metacharacter		Matches
`.`	dot	Matches any one character
`[‹]`	character class	Matches any one character listed
`[^‹]`	negated character class	Matches any one character not listed
`\char`	escaped character	When char is a metacharacter, or the escaped combination is not otherwise special, matches the literal char
Items Appended to Provide "Counting": The Quantifiers
`?`	question	One allowed, but it is optional
`*`	star	Any number allowed, but all are optional
`+`	plus	At least one required; additional are optional
{ min,max }	specified range	Min required, max allowed
Items That Match a Position
`^`	caret	Matches the position at the start of the line
`$`	dollar	Matches the position at the end of the line
`\<`	word boundary	Matches the position at the start of a word
`\>`	word boundary	Matches the position at the end of a word
Other
	alternation	Matches either expression it separates
`(‹)`	parentheses	Limits scope of alternation, provides grouping for the quantifiers, and "captures" for backreferences
`\1, \2, ...`	backreference	Matches text previously matched within first, second, etc., set of parentheses.
^{[ ]}

^{[ ]} not supported by all versions of egrep

In addition, be sure that you understand the following points:

Not all egrep programs are the same. The metacharacters supported, as well as their exact meanings, are often different see your local documentation (˜23).
Three reasons for using parentheses are constraining alternation (˜13), grouping (˜14), and capturing (˜21).
Character classes are special, and have their own set of metacharacters totally distinct from the "main" regex language (˜10).
Alternation and character classes are fundamentally different, providing unrelated services that appear, in only one limited situation, to overlap (˜13).
A negated character class is still a "positive assertion"even negated, a character class must match a character to be successful. Because the listing of characters to match is negated, the matched character must be one of those not listed in the class (˜12).
The useful -i option discounts capitalization during a match (˜15).
There are three types of escaped items:
- 1 . The pairing of \ and a metacharacter is a metasequence to match the literal character (for example, \* matches a literal asterisk).
- 2 . The pairing of \ and selected non-metacharacters becomes a metasequence with an implementation-defined meaning (for example, \< often means "start of word).
- 3 . The pairing of \ and any other character defaults to simply matching the character (that is, the backslash is ignored).
  
  Remember, though, that a backslash within a character class is not special at all with most versions of egrep , so it provides no "escape services" in such a situation.
Items governed by a question mark or star don't need to actually match any characters to "match successfully." They are always successful, even if they don't match anything (˜17).