Section 1.4. Egrep Metacharacters


1.4. Egrep Metacharacters

Let's start to explore some of the egrep metacharacters that supply its regular-expression power. I'll go over them quickly with a few examples, leaving the detailed examples and descriptions for later chapters.

Typographical Conventions Before we begin, please make sure to review the typographical conventions explained in the preface, on page xxi . This book forges a bit of new ground in the area of typesetting, so some of my notations may be unfamiliar at first.

1.4.1. Start and End of the Line

Probably the easiest metacharacters to understand are ^ ( caret ) and $ ( dollar ), which represent the start and end, respectively, of the line of text as it is being checked. As we've seen, the regular expression cat finds c·a·t anywhere on the line, but ^cat matches only if the c·a·t is at the beginning of the linethe ^ is used to effectively anchor the match (of the rest of the regular expression) to the start of the line. Similarly, cat$ finds c·a·t only at the end of the line, such as a line ending with scat .

It's best to get into the habit of interpreting regular expressions in a rather literal way. For example, don't think

^cat matches a line with cat at the beginning

but rather:

^cat matches if you have the beginning of a line, followed immediately by c , followed immediately by a , followed immediately by t .

They both end up meaning the same thing, but reading it the more literal way allows you to intrinsically understand a new expression when you see it. How would egrep interpret ^cat$ , ^$ , or even simply ^ alone? Turn the page to check your interpretations.

The caret and dollar are special in that they match a position in the line rather than any actual text characters themselves . Of course, there are various ways to actually match real text. Besides providing literal characters like cat in your regular expression, you can also use some of the items discussed in the next few sections.

1.4.2. Character Classes

1.4.2.1. Matching any one of several characters

Let's say you want to search for "grey," but also want to find it if it were spelled "gray." The regular-expression construct [‹] , usually called a character class , lets you list the characters you want to allow at that point in the match. While e matches just an e , and a matches just an a , the regular expression [ea] matches either. So, then, consider gr[ea]y : this means to find " g , followed by r , followed by either an e or an a , all followed by y ." Because I'm a really poor speller , I'm always using regular expressions like this against a huge list of English words to figure out proper spellings. One I use often is sep[ea]r[ea]te , because I can never remember whether the word is spelled "seperate," "separate," "separete," or what. The one that pops up in the list is the proper spelling; regular expressions to the rescue.

Notice how outside of a class, literal characters (like the g and r of gr[ea]y ) have an implied "and then between them"match g and then match r ..." Its completely opposite inside a character class. The contents of a class is a list of characters that can match at that point, so the implication is "or."

As another example, maybe you want to allow capitalization of a word's first letter, such as with [Ss]mith . Remember that this still matches lines that contain smith (or Smith ) embedded within another word, such as with blacksmith . I dont want to harp on this throughout the overview, but this issue does seem to be the source of problems among some new users. I'll touch on some ways to handle this embedded-word problem after we examine a few more metacharacters.

You can list in the class as many characters as you like. For example, [123456] matches any of the listed digits. This particular class might be useful as part of <H[123456]> , which matches <H1> , <H2> , <H3> , etc. This can be useful when searching for HTML headers.

Within a character class, the character-class metacharacter ' - ' ( dash ) indicates a range of characters: <H[1-6]> is identical to the previous example. [0-9] and [a-z] are common shorthands for classes to match digits and English lowercase letters , respectively. Multiple ranges are fine, so [0123456789abcdefABCDEF] can be written as [0-9a-fA-F] (or, perhaps, [A-Fa-f0-9] , since the order in which ranges are given doesnt matter). These last three examples can be useful when processing hexadecimal numbers . You can freely combine ranges with literal characters: [0-9A-Z_!.?] matches a digit, uppercase letter, underscore , exclamation point, period, or a question mark.

Note that a dash is a metacharacter only within a character class otherwise it matches the normal dash character. In fact, it is not even always a metacharacter within a character class. If it is the first character listed in the class, it can't possibly indicate a range, so it is not considered a metacharacter. Along the same lines, the question mark and period at the end of the class are usually regular-expression metacharacters, but only when not within a class (so, to be clear, the only special characters within the class in [0-9A-ZR!.?] are the two dashes).

Reading ^cat$ , ^$ , and ^

Answers to the questions on page 8 .



^cat$

Literally means: matches if the line has a beginning-of-line (which, of course, all lines have), followed immediately by c·a·t , and then followed immediately by the end of the line.

Effectively means: a line that consists of only cat no extra words, spaces, punctuation...just ' cat '.



^$

Literally means: matches if the line has a beginning-of-line, followed immediately by the end of the line.

Effectively means: an empty line (with nothing in it, not even spaces).



^

Literally means: matches if the line has a beginning-of-line.

Effectively meaningless ! Since every line has a beginning, every line will matcheven lines that are empty!


Consider character classes as their own mini language. The rules regarding which metacharacters are supported (and what they do) are completely different inside and outside of character classes.

We'll see more examples of this shortly.

1.4.2.2. Negated character classes

If you use [^‹] instead of [‹] , the class matches any character that isn't listed. For example, [^1-6] matches a character thats not 1 through 6 . The leading ^ in the class "negates" the list, so rather than listing the characters you want to include in the class, you list the characters you don't want to be included.

You might have noticed that the ^ used here is the same as the start-of-line caret introduced on page 8. The character is the same, but the meaning is completely different. Just as the English word "wind" can mean different things depending on the context (sometimes a strong breeze , sometimes what you do to a clock), so can a metacharacter. We've already seen one example, the range-building dash. It is valid only inside a character class (and at that, only when not first inside the class). ^ is a line anchor outside a class, but a class metacharacter inside a class (but, only when it is immediately after the class's opening bracket ; otherwise, it's not special inside a class). Don't fear these are the most complex special cases; others we'll see later aren't so bad.

As another example, let's search that list of English words for odd words that have q followed by something other than u . translating that into a regular expression, it becomes q[^u] . I tried it on the list I have, and there certainly werent many. I did find a few, including a number of words that I didn't even know were English.

Here's what happened . (What I typed is in bold.)

 %  egrep 'q[^u]' word.list  Iraqi Iraqian miqra qasida qintar qoph zaqqum% 

Two notable words not listed are "Qantas", the Australian airline, and "Iraq". Although both words are in the word.list file, neither were displayed by my egrep command. Why? Think about it for a bit, and then turn the page to check your reasoning.

Remember, a negated character class means "match a character that's not listed" and not "don't match what is listed." These might seem the same, but the Iraq example shows the subtle difference. A convenient way to view a negated class is that it is simply a shorthand for a normal class that includes all possible characters except those that are listed.

1.4.3. Matching Any Character with Dot

The metacharacter . (usually called dot or point ) is a shorthand for a character class that matches any character. It can be convenient when you want to have an "any character here" placeholder in your expression. For example, if you want to search for a date such as 03/19/76 , 03-19-76 , or even 03.19.76 , you could go to the trouble to construct a regular expression that uses character classes to explicitly allow ' / ', ' - ', or ' . ' between each number, such as 03[-./]19[-./]76 . However, you might also try simply using 03.19.76 .

Quite a few things are going on with this example that might be unclear at first. In 03[-./]19[-./]76 , the dots are not metacharacters because they are within a character class. (Remember, the list of metacharacters and their meanings are different inside and outside of character classes.) The dashes are also not class metacharacters in this case because each is the first thing after [ or [^ . Had they not been first, as with [.-/] , they would be the class range metacharacter, which would be a mistake in this situation.

Quiz Answer

Answer to the question on page 11 .

Why doesn't q[^u] match 'Qantas or 'Iraq'?

Qantas didn't match because the regular expression called for a lowercase q , whereas the Q in Qantas is uppercase. Had we used Q[^u] instead, we would have found it, but not the others, since they dont have an uppercase Q . The expression [Qq][^u] would have found them all.

The Iraq example is somewhat of a trick question. The regular expression calls for q followed by a character that's not u , which precludes matching q at the end of the line . Lines generally have newline characters at the very end, but a little fact I neglected to mention (sorry!) is that egrep strips those before checking with the regular expression, so after a line-ending q , there's no non- u to be matched.

Don't feel too bad because of the trick question. [ ] Let me assure you that had egrep not automatically stripped the newlines (many other tools don't strip them), or had Iraq been followed by spaces or other words or whatnot, the line would have matched. It is important to eventually understand the little details of each tool, but at this point what I'd like you to come away with from this exercise is that a character class, even negated, still requires a character to match .


[ ] Once, in fourth grade, I was leading the spelling bee when I was asked to spell " miss ." My answer was " m·i·s·s. " Miss Smith relished in telling me that no, it was " M·i·s·s. " with a capital M , that I should have asked for an example sentence , and that I was out. It was a traumatic moment in a young boys life. After that, I never liked Miss Smith, and have since been a very poor speler.

With 03.19.76 , the dots are metacharacters ones that match any character (including the dash, period, and slash that we are expecting). However, it is important to know that each dot can match any character at all, so it can match, say, ' lottery numbers: '.

So, 03[-./]19[-./]76 is more precise, but its more difficult to read and write. 03.19.76 is easy to understand, but vague. Which should we use? It all depends upon what you know about the data being searched, and just how specific you feel you need to be. One important, recurring issue has to do with balancing your knowledge of the text being searched against the need to always be exact when writing an expression. For example, if you know that with your data it would be highly unlikely for 03.19.76 to match in an unwanted place, it would certainly be reasonable to use it. Knowing the target text well is an important part of wielding regular expressions effectively.

1.4.4. Alternation

1.4.4.1. Matching any one of several subexpressions

A very convenient metacharacter is , which means "or." It allows you to combine multiple expressions into a single expression that matches any of the individual ones. For example, Bob and Robert are separate expressions, but BobRobert is one expression that matches either. When combined this way, the subexpressions are called alternatives .

Looking back to our gr[ea]y example, it is interesting to realize that it can be written as greygray , and even gr(ae)y . The latter case uses parentheses to constrain the alternation. (For the record, parentheses are metacharacters too.) Note that something like gr[ae]y is not what we want within a class, the ' ' character is just a normal character, like a and e .

With gr(ae)y , the parentheses are required because without them, graey means " gra or ey ," which is not what we want here. Alternation reaches far, but not beyond parentheses. Another example is (First1st) [Ss]treet . [ ] Actually, since both First and 1st end with st , the combination can be shortened to (Fir1)st [Ss]treet . Thats not necessarily quite as easy to read, but be sure to understand that (first1st) and (fir1)st effectively mean the same thing.

[ ] Recall from the typographical conventions on page xxii that " " is how I sometimes show a space character so it can be seen easily.

Here's an example involving an alternate spelling of my name . Compare and contrast the following three expressions, which are all effectively the same:

JeffreyJeffery

Jeff(reyery)

Jeff(reer)y

To have them match the British spellings as well, they could be:

(GeoffJeff)(reyery)

(GeoJe)ff(reyery)

(GeoJe)ff(reer)y

Finally, note that these three match effectively the same as the longer (but simpler) JeffreyGeofferyJefferyGeoffrey . Theyre all different ways to specify the same desired matches.

Although the gr[ea]y versus gr(ae)y examples might blur the distinction, be careful not to confuse the concept of alternation with that of a character class. A character class can match just a single character in the target text. With alternation, since each alternative can be a full-fledged regular expression in and of itself, each alternative can match an arbitrary amount of text. Character classes are almost like their own special mini-language (with their own ideas about metacharacters, for example), while alternation is part of the "main" regular expression language. You'll find both to be extremely useful.

Also, take care when using caret or dollar in an expression that has alternation. Compare ^FromSubjectDate: with ^(FromSubjectDate): . Both appear similar to our earlier email example, but what each matches (and therefore how useful it is) differs greatly. The first is composed of three alternatives, so it matches " ^From or Subject or Date: ," which is not particularly useful. We want the leading caret and trailing : to apply to each alternative. We can accomplish this by using parentheses to "constrain the alternation:

^(From;Subject;Date):

The alternation is constrained by the parentheses, so literally, this regex means "match the start of the line, then one of From , Subject , or Date , and then match : ." Effectively, it matches:

  • 1) start-of-line, followed by F·r·o·m , followed by ' : '

  • or 2) start-of-line, followed by S·u·b·j·e·c·t , followed by ' : '

  • or 3) start-of-line, followed by D·a·t·e , followed by ' : '

Putting it less literally, it matches lines beginning with ' From: ', ' Subject: ', or ' Date: ', which is quite useful for listing the messages in an email file.

Here's an example:

 %  egrep '^(FromSubjectDate): ' mailbox  From: elvis@tabloid.org (The King) Subject: be seein' ya around Date: Mon, 23 Oct 2006 11:04:13 From: The Prez <president@whitehouse.gov> Date: Wed, 25 Oct 2006 8:36:24 Subject: now, about your vote...  

1.4.5. Ignoring Differences in Capitalization

This email header example provides a good opportunity to introduce the concept of a case-insensitive match. The field types in an email header usually appear with leading capitalization, such as "Subject" and "From," but the email standard actually allows mixed capitalization, so things like "DATE" and "from" are also allowed. Unfortunately, the regular expression in the previous section doesn't match those.

One approach is to replace From with [Ff][Rr][Oo][Mm] to match any form of "from," but this is quite cumbersome, to say the least. Fortunately, there is a way to tell egrep to ignore case when doing comparisons, i.e., to perform the match in a case insensitive manner in which capitalization differences are simply ignored. It is not a part of the regular-expression language, but is a related useful feature many tools provide. egrep 's command-line option " -i " tells it to do a case-insensitive match. Place -i on the command line before the regular expression:

 % egrep  -i  '^(FromSubjectDate): ' mailbox 

This brings up all the lines we matched before, but also includes lines such as:

 SUBJECT: MAKE MONEY FAST 

I find myself using the -i option quite frequently (perhaps related to the footnote on page 12!) so I recommend keeping it in mind. We'll see other convenient support features like this in later chapters.

1.4.6. Word Boundaries

A common problem is that a regular expression that matches the word you want can often also match where the "word" is embedded within a larger word. I mentioned this briefly in the cat , gray , and Smith examples. It turns out, though, that some versions of egrep offer limited support for word recognition: namely the ability to match the boundary of a word (where a word begins or ends).

You can use the (perhaps odd looking) metasequences \< and \> if your version happens to support them (not all versions of egrep do). You can think of them as word-based versions of ^ and $ that match the position at the start and end of a word, respectively. Like the line anchors caret and dollar, they anchor other parts of the regular expression but don't actually consume any characters during a match. The expression \<cat\> literally means "match if we can find a start-of-word position, followed immediately by c·a·t , followed immediately by an end-of-word position." More naturally, it means "find the word cat ." If you wanted, you could use \<cat or cat\> to find words starting and ending with cat .

Note that < and > alone are not metacharacters when combined with a back-slash, the sequences become special. This is why I called them "metasequences." It's their special interpretation that's important, not the number of characters, so for the most part I use these two meta-words interchangeably.

Remember, not all versions of egrep support these word-boundary metacharacters, and those that do don't magically understand the English language. The "start of a word" is simply the position where a sequence of alphanumeric characters begins; "end of word" is where such a sequence ends. Figure 1-2 on the next page shows a sample line with these positions marked .

Figure 1-2. Start and end of "word" positions

The word-starts (as egrep recognizes them) are marked with up arrows, the word-ends with down arrows. As you can see, "start and end of word" is better phrased as "start and end of an alphanumeric sequence," but perhaps that's too much of a mouthful.

1.4.7. In a Nutshell

Table 1-1 summarizes the metacharacters we have seen so far.

Table 1-1. Summary of Metacharacters Seen So Far.

Metacharacter

Name

Matches

.

dot

any one character

[‹]

character class

any character listed

[^‹]

negated character class

any character not listed

^

caret

the position at the start of the line

$

dollar

the position at the end of the line

\<

backslash less-than

the position at the start of a word

\>

backslash greater-than

the position at the end of a word

 

 

       not supported by all versions of egrep

or ; bar

matches either expression it separates

(‹)

parentheses

used to limit scope of , plus additional uses yet to be discussed


In addition to the table, important points to remember include:

  • The rules about which characters are and aren't metacharacters (and exactly what they mean) are different inside a character class. For example, dot is a metacharacter outside of a class, but not within one. Conversely, a dash is a metacharacter within a class (usually), but not outside. Moreover, a caret has one meaning outside, another if specified inside a class immediately after the opening [ , and a third if given elsewhere in the class.

  • Don't confuse alternation with a character class. The class [abc] and the alternation (abc) effectively mean the same thing, but the similarity in this example does not extend to the general case. A character class can match exactly one character, and thats true no matter how long or short the specified list of acceptable characters might be.

    Alternation, on the other hand, can have arbitrarily long alternatives, each textually unrelated to the other: \<(1,000,000millionthousand thou)\> . However, alternation cant be negated like a character class.

  • A negated character class is simply a notational convenience for a normal character class that matches everything not listed. Thus, [^x] doesnt mean "match unless there is an x ," but rather "match if there is something that is not x ." The difference is subtle, but important. The first concept matches a blank line, for example, while [^x] does not.

  • The useful -i option discounts capitalization during a match (˜15). [ ]

    [ ] Recall from the typographical conventions (page xxii ) that something like "˜15" is a shorthand for a reference to another page of this book.

What we have seen so far can be quite useful, but the real power comes from optional and counting elements, which we'll look at next.

1.4.8. Optional Items

Let's look at matching color or colour . Since they are the same except that one has a u and the other doesn't, we can use colo u? r to match either. The metacharacter ? ( question mark ) means optional . It is placed after the character that is allowed to appear at that point in the expression, but whose existence isn't actually required to still be considered a successful match.

Unlike other metacharacters we have seen so far, the question mark attaches only to the immediately- preceding item. Thus, colou?r is interpreted as " c then o then l then o then u? then r ."

The u? part is always successful: sometimes it matches a u in the text, while other times it doesnt. The whole point of the ? -optional part is that it's successful either way. This isn't to say that any regular expression that contains ? is always successful. For example, against ' semicolon ', both colo and u? are successful (matching colo and nothing, respectively). However, the final r fails, and thats what disallows semicolon , in the end, from being matched by colou?r .

As another example, consider matching a date that represents July fourth, with the "July" part being either July or Jul , and the "fourth" part being fourth , 4th , or simply 4 . Of course, we could just use (July;Jul) (fourth4th4) , but lets explore other ways to express the same thing.

First, we can shorten the (July;Jul) to (Jul y? ) . Do you see how they are effectively the same? The removal of the means that the parentheses are no longer really needed. Leaving the parentheses doesnt hurt, but with them removed, July? is a bit less cluttered. This leaves us with July? (fourth4th4) .

Moving now to the second half, we can simplify the 4th4 to 4(th)? . As you can see, ? can attach to a parenthesized expression. Inside the parentheses can be as complex a subexpression as you like, but "from the outside it is considered a single unit. Grouping for ? (and other similar metacharacters which Ill introduce momentarily) is one of the main uses of parentheses.

Our expression now looks like July? (fourth4(th)?) . Although there are a fair number of metacharacters, and even nested parentheses, it is not that difficult to decipher and understand. This discussion of two essentially simple examples has been rather long, but in the meantime we have covered tangential topics that add a lot, if perhaps only subconsciously, to our understanding of regular expressions. Also, its given us some experience in taking different approaches toward the same goal. As we advance through this book (and through to a better understanding), you'll find many opportunities for creative juices to flow while trying to find the optimal way to solve a complex problem. Far from being some stuffy science, writing regular expressions is closer to an art.

1.4.9. Other Quantifiers: Repetition

Similar to the question mark are + ( plus ) and * (an asterisk, but as a regular-expression metacharacter, I prefer the term star ). The metacharacter + means "one or more of the immediately-preceding item," and * means "any number, including none, of the item." Phrased differently, ‹* means "try to match it as many times as possible, but its OK to settle for nothing if need be." The construct with plus, ‹+ , is similar in that it also tries to match as many times as possible, but different in that it fails if it cant match at least once. These three metacharacters, question mark, plus, and star, are called quantifiers because they influence the quantity of what they govern .

Like ‹? , the ‹* part of a regular expression always succeeds, with the only issue being what text (if any) is matched. Contrast this to ‹+ , which fails unless the item matches at least once.

For example, ? allows a single optional space, but * allows any number of optional spaces. We can use this to make page 9's <H[1-6]> example flexible. The HTML specification [ ] says that spaces are allowed immediately before the closing > , such as with <H3 > and <H4 > . Inserting * into our regular expression where we want to allow (but not require) spaces, we get <H[1-6] *> . This still matches <H1> , as no spaces are required, but it also flexibly picks up the other versions.

[ ] If you are not familiar with HTML, never fear. I use these as real-world examples, but I provide all the details needed to understand the points being made. Those familiar with parsing HTML tags will likely recognize important considerations I dont address at this point in the book.

Exploring further, let's search for an HTML tag such as <HR SIZE =14> , which indicates that a line (a Horizontal Rule) 14 pixels thick should be drawn across the screen. Like the <H3> example, optional spaces are allowed before the closing angle bracket. Additionally, they are allowed on either side of the equal sign. Finally, one space is required between the HR and SIZE , although more are allowed. To allow more, we could just add * to the already there, but instead lets change it to * . The plus allows extra spaces while still requiring at least one, so its effectively the same as * , but more concise . All these changes leave us with <HR + SIZE * = * 14 * > .

Although flexible with respect to spaces, our expression is still inflexible with respect to the size given in the tag. Rather than find tags with only one particular size such as 14 , we want to find them all. To accomplish this, we replace the 14 with an expression to find a general number. Well, in this case, a "number is one or more digits. A digit is [0-9] , and "one or more adds a plus, so we end up replacing 14 by [0-9]+ . (A character class is one "unit," so can be subject directly to plus, question mark, and so on, without the need for parentheses.)

This leaves us with <HR + SIZE * = * [0-9] + * > , which is certainly a mouthful even though Ive presented it with the metacharacters bold, added a bit of spacing to make the groupings more apparent, and am using the "visible space" symbol ' ' for clarity. (Luckily, egrep has the -i case-insensitive option, ˜15, which means I don't have to use [Hh][Rr] instead of HR .) The unadorned regular expression <HR +SIZE *= *[0-9]+ *> likely appears even more confusing. This example looks particularly odd because the subjects of most of the stars and pluses are space characters, and our eye has always been trained to treat spaces specially. Thats a habit you will have to break when reading regular expressions, because the space character is a normal character, no different from, say, j or 4 . (In later chapters, we'll see that some other tools support a special mode in which white-space is ignored, but egrep has no such mode.)

Continuing to exploit a good example, let's consider that the size attribute is optional, so you can simply use <HR> if the default size is wanted. (Extra spaces are allowed before the > , as always.) How can we modify our regular expression so that it matches either type? The key is realizing that the size part is optional (that's a hint). Turn the page to check your answer.

Take a good look at our latest expression (in the answer box) to appreciate the differences among the question mark, star, and plus, and what they really mean in practice. Table 1-2 on the next page summarizes their meanings.

Note that each quantifier has some minimum number of matches required to succeed, and a maximum number of matches that it will ever attempt. With some, the minimum number is zero; with some, the maximum number is unlimited.

Making a Subexpression Optional

Answer to the question on page 19 .

In this case, "optional" means that it is allowed once, but is not required. That means using ? . Since the thing thats optional is larger than one character, we must use parentheses: (‹)? . Inserting into our expression, we get:

 
 

Note that the ending * is kept outside of the (‹)? . This still allows something such as <HR > . Had we included it within the parentheses, ending spaces would have been allowed only when the size component was present.

Similarly, notice that the + before SIZE is included within the parentheses. Were it left outside them, a space would have been required after the HR, even when the SIZE part wasn't there. This would cause ' <HR> ' to not match.


Table 1-2. Summary of Quantifier "Repetition Metacharacters"

 

Minimum Required

Maximum to Try

Meaning

?

none

1

one allowed; none required (" one optional ")

*

none

no limit

unlimited allowed; none required (" any amount OK ")

+

1

no limit

unlimited allowed; one required (" at least one ")


1.4.9.1. Defined range of matches: intervals

Some versions of egrep support a metasequence for providing your own minimum and maximum: ‹{ min,max } . This is called the interval quantifier. For example, ‹{3,12} matches up to 12 times if possible, but settles for three. One might use [a-zA-Z]{1,5} to match a US stock ticker (from one to five letters). Using this notation, {0,1} is the same as a question mark.

Not many versions of egrep support this notation yet, but many other tools do, so it's covered in Chapter 3 when we look in detail at the broad spectrum of metacharacters in common use today.

1.4.10. Parentheses and Backreferences

So far, we have seen two uses for parentheses: to limit the scope of alternation, , and to group multiple characters into larger units to which you can apply quantifiers like question mark and star. Id like to discuss another specialized use that's not common in egrep (although GNU's popular version does support it), but which is commonly found in many other tools.

In many regular-expression flavors, parentheses can "remember" text matched by the subexpression they enclose. We'll use this in a partial solution to the doubled-word problem at the beginning of this chapter. If you knew the the specific doubled word to find (such as "the" earlier in this sentencedid you catch it?), you could search for it explicitly, such as with the the . In this case, you would also find items such as , but you could easily get around that problem if your egrep supports the word-boundary metasequences \<‹\> mentioned on page 15: \<the the\> . We could use + for the space for even more flexibility.

However, having to check for every possible pair of words would be an impossible task. Wouldn't it be nice if we could match one generic word, and then say "now match the same thing again"? If your egrep supports backreferencing , you can. Backreferencing is a regular-expression feature that allows you to match new text that is the same as some text matched earlier in the expression.

We start with \<the +the\> and replace the initial the with a regular expression to match a general word, say [A-Za-z]+ . Then, for reasons that will become clear in the next paragraph, lets put parentheses around it. Finally, we replace the second ' the ' by the special metasequence \1 . This yields \<([A-Za-z]+) +\1\> .

With tools that support backreferencing, parentheses "remember" the text that the subexpression inside them matches, and the special metasequence \1 represents that text later in the regular expression, whatever it happens to be at the time.

Of course, you can have more than one set of parentheses in a regular expression. Use \1 , \2 , \3 , etc., to refer to the first, second, third, etc. sets. Pairs of parentheses are numbered by counting opening parentheses from the left, so with ([a-z])([0-9])\1\2 , the \1 refers to the text matched by [a-z] , and \2 refers to the text matched by [0-9]

With our ' the the ' example, [A-Za-z]+ matches the first ' the '. It is within the first set of parentheses, so the ' the ' matched becomes available via \1 . If the following + matches, the subsequent \1 will require another ' the '. If \1 is successful, then \> makes sure that we are now at an end-of-word boundary (which we wouldnt be were the text ' the theft '). If successful, we've found a repeated word. It's not always the case that that is an error (such as with "that" in this sentence), but that's for you to decide once the suspect lines are shown.

When I decided to include this example, I actually tried it on what I had written so far. (I used a version of egrep that supports both \<‹\> and backreferencing.) To make it more useful, so that ' The the ' would also be found, I used the case-insensitive -i option mentioned on page 15. [ ]

[ ] Be aware that some versions of egrep , including older versions of popular GNU offering, have a bug with the -i option such that it doesn't apply to backreferences. Thus, it finds "the the" but not "The the."

Here's the command I ran:

 % egrep -i '\<([a-z]+) +\1\>'  files‹  

I was surprised to find fourteen sets of mistakenly ' doubled doubled ' words! I corrected them, and since then have built this type of regular-expression check into the tools that I use to produce the final output of this book, to ensure none creep back in.

As useful as this regular expression is, it is important to understand its limitations. Since egrep considers each line in isolation, it isn't able to find when the ending word of one line is repeated at the beginning of the next. For this, a more flexible tool is needed, and we will see some examples in the next chapter.

1.4.11. The Great Escape

One important thing I haven't mentioned yet is how to actually match a character that a regular expression would normally interpret as a metacharacter. For example, if I searched for the Internet hostname ega.att.com using ega.att.com , it could end up matching something like . Remember, . is a metacharacter that matches any character, including a space.

The metasequence to match an actual period is a period preceded by a backslash: ega\.att\.com . The sequence \. is described as an escaped period or escaped dot , and you can do this with all the normal metacharacters, except in a characterclass. [ ]

[ ] Most programming languages and tools allow you to escape characters within a character class as well, but most versions of egrep do not, instead treating ' \ ' within a class as a literal backslash to be included in the list of characters.

A backslash used in this way is called an "escape"when a metacharacter is escaped, it loses its special meaning and becomes a literal character. If you like, you can consider the sequence to be a special metasequence to match the literal character. It's all the same.

As another example, you could use \([a-zA-Z]+\) to match a word within parentheses, such as ' (very) '. The backslashes in the \( and \) sequences remove the special interpretation of the parentheses, leaving them as literals to match parentheses in the text.

When used before a non-metacharacter, a backslash can have different meanings depending upon the version of the program. For example, we have already seen how some versions treat \< , \> , \1 , etc. as metasequences. We will see many more examples in later chapters.



Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net