Project78.Be Clever with Regular Expressions | Mac OS X Unix 101 Byte-Sized Projects

Project 78. Be Clever with Regular Expressions

"How do I search for text that matches a specific pattern?"

This project shows you how to write advanced regular expressions. A regular expression is formed to match a particular text pattern. Project 77 introduces regular expressions.

If you're not familiar with regular expressions, read Project 77, on which this project builds. This project introduces advanced techniques such as:

Repeaters with bounds to state more precisely how many times a preceding atom must repeat
Subexpressions to turn regular expressions into atoms, thereby making them subject to repeaters
Branches to form choices more complex than the simple character alternatives offered by bracket expressions

Repeaters with Bounds

Project 77 introduced regular expressions and showed you how to use an atom followed by a simple repeater to say match multiple occurrences of the specified atom. But the alternatives offered by the simple repeaters *, +, and ? are not always adequate. We can specify to match one or more letters by using the expression

'[[:alpha:]]+'

but not exactly nine letters or between five and nine letters, inclusive.

To specify a precise number of matches, use a bounded repeater, which has the syntax {n,m}. You can use a bounded repeater wherever you'd otherwise use a simple repeater. We'll demonstrate the use of bounded repeaters by matching words of a particular length and words that fall within a particular length range. First, let's use egrep and a regular expression to match words of exactly nine letters. The input file contains a list of words, one per line.

Tip

Attempting to specify a repeater such as {,9} to mean 9 or fewer is not legal syntax. Instead, use either {1,9} or {0,9} as appropriate.

To match all nine-letter words, we employ a bounded repeater in a regular expression such as

$ egrep '^[[:alpha:]]{9}$' /usr/share/dict/web2 ... pinealism pinealoma pineapple pinedrops pinewoods pinheaded ...

(The file /usr/share/dict/web2 contains a handy word list.)

The syntax element {9} is a bounded repeater that matches exactly nine occurrences of the preceding atom: a letter. Note that we've used a caret symbol and a dollar symbol to ensure that the expression matches a complete line; otherwise, the expression would also match a portion of all words more than nine characters in length.

To extract all words five to nine characters in length, we supply two comma-separated bounds.

'^[[:alpha:]]{5,9}$'

Whereas the first example matched words like pineapple, this example matches from apple through dappled to pineapple.

To search for nine or more occurrences, supply only the lower bound. The next example matches space-separated numbers of nine or more digits.

' [[:digit:]]{9,} '

Subexpressions

By enclosing a regular expression in parentheses, we turn it into an atom (see Project 77). Such an expression is termed a subexpression. A subexpression is seen as a single entity and, therefore, can be made the subject of a repeater.

Here's an example in which we check for valid IP addresses, which look like 10.0.2.120 or 217.155.168.147. We first construct a regular expression that matches one to three digits, followed by a dot.

'[[:digit:]]{1,3}\.'

Then we turn the regular expression into a subexpression, which allows us to repeat the whole expression three times with a repeater.

'([[:digit:]]{1,3}\.){3}'

Tip

Any regular expression enclosed in parentheses becomes a subexpression. A subexpression is an atom and can be treated just like a simple character, which may be incorporated into a new regular expression. The new expression may be enclosed in parentheses and reduced in its turn to an atom. There is no effective limit to this processat least not until your head starts to hurt!

Finally, we add the original expression to the end, but without the trailing dot. For good measure, we also assume an IP address to be surrounded by nondigit characters. This prevents matching an invalid address such as 1111111.2.3.4444444. Here's the final regular expression.

'[^[:digit:]]([[:digit:]]{1,3}\.){3}¬     [[:digit:]]{1,3}[^[:digit:]]'

Note

We must extend the definition of an atom given in Project 77 to include a subexpression.

If you try this expression, you'll notice that it fails on IP addresses that fall at the start or end of a line. We need to delimit an IP address by start of line OR not a digit and not a digit OR end of line. We can achieve this by using branches, introduced in the next section.

Branches

Branches define sets of alternative matches. A regular expression may specify one or more branches separated by vertical-bar (|) symbols and will match anything that matches one of the branches. Each branch is itself a regular expression.

Here's a regular expression with seven branches that matches any one of the days of the week.

'monday|tuesday|wednesday|thursday|friday|saturday|sunday'

Regular Expressions in a Nutshell

Technically, a regular expression is one or more branches separated by |. A branch is one or more pieces concatenated, and a piece is an atom optionally followed by a simple or bounded repeater. A subexpression encloses a regular expression in parentheses and makes it an atom.

This alone is limited, and an attempt to match a full date will not work. The following regular expression, for example, doesn't do what we probably intended.

'saturday|sunday jan|feb [[:digit:]]{1,2}'

It actually specifies a line that matches any of the three alternatives.

'saturday' OR 'sunday jan' OR 'feb [[:digit:]]{1,2}'.

Tip

Don't get confused by the two meanings of the caret symbol. Outside a bracket expression, it's a start-of-line anchor, and as the first character inside a bracket expression, it negates the sense of the match.

To get around this problem, we employ subexpressions. Combining multiple branches as subexpressions within larger regular expressions enables complex and highly useful matches. We might use the following to pull out weekend events for January and February from an activities list.

'(saturday|sunday) (jan|feb) ([[:digit:]]{1,2})'

We might match days of the week by using the shorter regular expression

'(mon|tues|wednes|thurs|fri|satur|sun)day'

We'll conclude our look at branches by completing the IP address-matching example started in the preceding section. Recall that we wanted to delimit an IP address by start of line OR not a digit and not a digit OR end of line. We specify the former by using a two-branch subexpression such as

'(^|[^[:digit:]])'

Here's the full regular expression, split across three lines for clarity. It should be entered in Terminal on a single line and, obviously, as part of a command.

'(^|[^[:digit:]])  ([[:digit:]]{1,3}\.){3}[[:digit:]]{1,3}  ([^[:digit:]]|$)'

Capture Patterns

Suppose that we need to match a particular pattern and that that pattern must be occur twice. That's easy to do; we use the repeater {2}. However, if our requirement is for the text that matched the first time to be repeated verbatim the second time, that's not so easy. (Imagine a search that'd match Monty Monty and Sugar Sugar but not Monty Python or Sugar Babes.)

Tip

When subexpressions are nested, capture and playback gets a bit confusing and is best avoided.

To pull off such a trick, we use capture and playback. Whenever a subexpression is matched, the matched string is captured in a buffer. The first string to be captured is held in buffer 1; the second, in buffer 2; and so on. This happens automatically. To replay a buffer, simply specify \1 or \2, and so on.

Here's an example in which we capture the entire expression and replay it.

'(b[aeiou]g)\1'

This expression will match bigbig and bagbag, but not bigbag. Remember that a pattern is captured only when it's a subexpressionthat is, it's enclosed in parentheses.

Search and Replace

Capture patterns play an important role in search and replace. Editing tools such as sed support the capture-and-playback technique, allowing a pattern captured from the search string to be played back into the replacement string.

Here's an example in which we process a file that contains information about books. The entry for a book occupies one line in the file (shown split into three shorter lines in this book) and has the following format.

Level: Beginning/Intermediate/Advanced, "101 Projects",  CBS Category: Macintosh/Unix, Covers: Mac OS X 10.4 Tiger,  Price: $34.99, Author: Mayo.

Our mission, should we choose to accept it, is not so impossible. We must extract the quoted title and price, and report them in the following format.

Cost 34.99 Title 101 Projects

Learn More

Projects 59 and 61 cover the sed text editor.

To realize this, we match an entire line, capturing the title and price, and replace the line with Cost <price> Title <title>.

Let's build the regular expression piece by piece. Start with .* to match everything up to the title. Match the title with ".*", and capture it with (".*"). Then match intervening information with .*, and match and capture the price with (\$[0-9]{1,3}\.[0-9]{2}). Note that we escape $ and. because they are special characters. Finally, match the remainder of the line with .*.

The sed command's syntax for search and replace is

s/search-pattern/replace-pattern/

Our replace pattern is Cost \2 Title \1.

Basic Regular Expressions

Basic regular expressions differ in that branches are not supported.

Also, () and {} are not special characters and represent themselves. For subexpressions and bounds, we must type  and \{ \}. This reverses the meaning of () versus  and {} versus \{\} that we see in extended regular expressions. One day, it'll catch you out!

Putting this together, we get the following command.

$ sed -E 's/.*(".*").*(\$[0-9]{1,3}\.[0-9]{2}).*/Cost \2 ¬     Title \1/'

Option -E to sed tells it to switch on extended regular expressions. Let's try this command, adding a little extra sophistication to display only matching lines with option -n (don't display input lines) and flag p (display matching lines) placed at the end of the substitute function.

$ sed -En 's/.*(".*").*(\$[0-9]{1,3}\.[0-9]{2}).*¬    /Cost \2 Title \1/p' Level: Beginning/Intermediate/Advanced, "101 Projects",    CBS Category: Macintosh/Unix, Covers: Mac OS X 10.4 Tiger, Price: $34.99, Author: Mayo. Cost $34.99 Title "101 Projects" TEST"TITLE"TEST$111.22TEST Cost $111.22 Title "TITLE" <Control-d> $