Project77.Learn Regular Expressions | Mac OS X Unix 101 Byte-Sized Projects

Project 77. Learn Regular Expressions

"How do I search for text that matches a specific pattern?"

This project shows you how to write regular expressions. A regular expression is formed to match a particular text pattern. Project 78 covers advanced use of regular expressions.

Note

Regular expressions are not the same as globbing (covered in Projects 11 and 12). Globbing is implemented by the shell and by commands such as find, and matches a pattern against a list of filenamesusually, the files in the current directory. Regular expressions are more powerful and are used by text-processing commands to match against lines of textusually, to search for and replace text.

The Match Game

Regular expressions are widely used in Unix, and most text-processing tools support them. The most common uses include:

Searching a text file for lines containing particular text
Filtering the output from other commands for relevant lines
Performing search and replace in text editors such as nano and TextWrangler, and in text-editing tools such as sed and awk
Performing text manipulation in a programming language such as Perl or PHP

The simplest regular expressions are plain text sequences (such as index.html) that match other instances of themselves. More often, regular expressions contain a mix of wildcards, repetitions, and alternatives.

Unix supports three types of regular expressions, which unfortunately don't share a compatible syntax. The three forms are modern (also termed extended); obsolete (also termed basic); and Perl regular expressions (introduced by the Perl programming language). This project focuses on extended regular expressions, but a section at the end highlights how extended expressions differ from basic expressions. Perl regular expressions, the most powerful of all, are not generally supported by the Unix tools covered in this book.

Basic regular expressions are supported by the grep and sed commands. Extended regular expressions are supported by the awk command and by the extended variants of grep and sednamely, egrep (or grep -E) and sed -E.

Learn More

Refer to Project 23 for examples of using the grep command.

Regular expressions are employed in many of the projects in this book. Read this project to brush up on the theory, and you'll be ready to apply it in a more practical way to other projects.

Learn More

Refer to Projects 59 to 62 for more information on the sed and awk commands.

Basic Rules

Depending on context, regular-expression matching is performed on a string (a sequence of characters) or a line of text. Matched text cannot span lines but must be wholly contained within one line. Matching is normally done in a case-sensitive manner, but most tools let you specify that matching should be case insensitive.

Tip

Remember that the escaping character \ is a special character itself. To use it literally, escape it by typing \\.

Regular expressions are greedy: Given a choice of several possible matches, they always choose the longest one. Consider the text

backup.user.sh

A regular-expression match against "anything followed by dot" will return backup.user. but the shorter match backup. will not be returned.

Regular-Expression Syntax

A regular expression consists of a sequence of atoms and repeaters.

An atom is any of the following:

A character (most characters match themselves)
. (matches any single character)
^ (matches the start of a line or string)
$ (matches the end of a line or string)
[...] (called a bracketed expression; represents exactly one instance from a group of possible characters and is explained more fully later in this project)

A repeater is any of the following:

* (matches zero or more occurrences of the preceding atom)
+ (matches one or more occurrences of the preceding atom)
? (matches zero or one occurrence of the preceding atom)

The syntax is explained by examples in the rest of the project. Project 78 covers advanced regular expressions, extending the syntax shown here.

Tip

When you enter a regular expression on the command line, remember that characters such as star have a special meaning to the shell and must be escaped from it. It's good practice always to surround regular expressions with single quotes.

To match a character such as star (*), which normally has a special meaning, you must escape its special meaning by preceding it with a backslash (\). The special characters that must be escaped in extended regular expressions are

. ^ $ * ? + \ [ { () |

Simple Regular Expressions

Let's form a very simple regular expression that we might use to match an incomplete crossword entry: a p blank l blank. In regular-expression language, a single-character blank is represented by a dot, so here's our regular expression.

'ap.l.'

When applied to a list of words, one per line, this expression will match lines that contain apple, apply, and aptly. It will also match lines that contain words such as appliance, pineapple, and inapplicable.

When applied to lines (or long strings) of text, the regular expression 'ap.l.' will match lines such as an apple a day and clap loudly because those lines contain matches. It's not necessary to match the entire line or string.

Tip

A simple method of dry-running a regular expression uses the command egrep (or grep for basic regular expressions). Type

$ egrep 'the-regular-¬     expression'

but give no filename. You can now experiment by typing lines of text, which egrep will read from standard input. Lines that match the regular expression will be echoed back when you press Return; those that don't, won't. Press Control-d when you're finished.

Anchors

The special symbol caret (^) matches the start of a line or string; it matches a position rather than a character. Repeating our example from the previous section, we find that the regular expression

'^ap.l.'

matches lines that start with ap.l. and won't match pineapple, inapplicable, or clap loudly.

Tip

To match empty lines or strings, use the regular expression '^$'.

Similarly, the special symbol dollar ($) matches the end of a line or string, so the regular expression

'ap.l.$'

matches words that end with ap.l. and won't match appliance or inapplicable.

It's important to realize that anchoring applies to the whole line (or string), not to individual words. If we pass the line red apple, it will not match ^apple because caret anchors to the start of the line. It will match the line apple mac. Similarly, apple$ will match red apple but not apple mac.

Tip

Pass the -w option to grep to tell it to match only whole words. " apple" would match the string " an apple a day" but not the string " a pineapple a day".

Finally, we match an entire line or string by applying both anchors. To match only apple, apply, and aptly, use the regular expression

'^ap.l.$'

Repeaters

To search for fixed patterns of text separated by arbitrary text, we must specify any number of any character. We do this by combining the atom dot (.) to mean any character and the repeater star (*) to mean zero or more repetitions thereof. Here are some examples that use a text file, paren.

$ cat paren Here is (some text) in parentheses. Here we have () empty parentheses. Here we have (a) letter in parentheses. Here we have no parentheses.

Let's search for lines that contain anything, including nothing, enclosed in parentheses. To do so, we create a regular expression that means (, followed by anything or nothing, followed by). We must escape the parentheses (and braces, too) because they are special characters (a topic discussed at greater length in Project 78).

Tip

You may employ any number of repeaters in a regular expression.

$ egrep '\(.*\)' paren Here is (some text) in parentheses. Here we have () empty parentheses. Here we have (a) letter in parentheses.

To exclude the empty parentheses, we specify one or more repetitions of any character by using the special character plus (+) instead of star.

Learn More

Project 78 shows you how to apply finer control to repeaters and how to repeat constructs that are more complex than a single character.

$ egrep '\(.+\)' paren Here is (some text) in parentheses. Here we have (a) letter in parentheses.

To specify zero or one repetitions, we use the special character query (?).

$ egrep '\(.?\)' paren Here we have () empty parentheses. Here we have (a) letter in parentheses.

Repeaters can be applied to specific characters as well as to special characters like dot. Here are two regular expressions, the first matching two or more consecutive dashes (-); the second matching star, then one or two dots, and then star.

$ egrep -- '--+' test.txt $ egrep '\*\.\.?\*' test.txt

The first example uses a trick to prevent the egrep command from thinking the regular expression is an option because it begins with a dash. A double-dash option preceding the regular expression signifies that no more options follow. The second example uses the special character \ to escape the star and dot characters.

Repeaters are summarized in "Regular-Expression Syntax" earlier in this project.

Bracket Expressions

To match any digit 0 to 9, or perhaps any letter, we list the alternative characters and have the text match exactly one of those characters. Regular expressions provide bracket expressions for just such a purpose, whereby we list the alternative characters in square brackets. For example, the regular expression

'b[aeiou]g'

matches bag, beg, big, bog, and bug. It does not match byg or boog.

Learn More

Project 78 shows you how to choose alternatives that are more complex than a single character.

The following regular expression will match any line that starts with a, b, or c (uppercase or lowercase) immediately followed by a two-digit number.

'^[aAbBcC][0123456789][0123456789]'

To match all characters except a particular set, enclose the characters to be excluded in brackets, preceded by a caret (^) symbol. To match any character except a digit, specify the regular expression

'[^0123456789]'

Tip

All special characters lose their meaning inside bracketed expressions, where they should not (and in fact cannot) be escaped.

Character Ranges

A character range is a bracketed expression with a start point and an end point separated by a dash. Here are some simple examples to illustrate this.

All digits is '[0-9]' and equivalent to '[0123456789]'.
All letters is '[a-zA-Z]'.
All letters plus [ ] ^ and - is '[][a-zA-Z^-]'. To clarify, we specify the character set ][a-zA-Z^- enclosed in square brackets.

In the last example, we employed a few tricks to include the special characters [, -, and ^ in the list. To include a ] character, make it first in the bracketed list (or the second when you're negating the list with a caret symbol). A caret must not be the first in the list, and a dash character should be the last in the list.

Character Classes

Regular expressions provide special character classes to prevent the need to list many characters in bracketed expressions. To match all letters and digits, for example, we specify the class alnum (alphanumeric). A class name should be surrounded by [: :] and enclosed in brackets.

Tip

The sequence [[:alpha:]][[:digit:]] differs from [[:alpha:][:digit:]]. The former specifies a letter followed by a digit; the latter specifies either a letter or a digit.

Let's pose a matching problem and solve it by using character classes. We want to match lines starting with one or more digits, followed by one or more letters, followed by a colon, followed by anything. The line may optionally start with a white space. Here's an example.

        42HHGG: Life, the universe, and everything.

We might describe our matching criteria by using a regular expression such as

'^[[:space:]]*[[:digit:]]+[[:alpha:]]+:'

The regular expression uses the character classes space (any white space, including tab), digit (0-9), and alpha (a-z, A-Z). The rest of the expression is formed with the now-familiar repeaters and anchors.

The following character classes are defined.

alnum alpha blank cntrl digit graph lower print punct space upper zdigit

Tip

To discover exactly which characters are included in a particular class, read the Section 3 man page for the corresponding library function. The library function is named like the class but starts with is. To read about character class [:space:], for example, look at the man page for isspace by typing

$ man 3 isspace

Basic Regular Expressions

Basic regular expressions do not support the repeaters ? and +. The expression 'a+' equivalent to 'aa*', however. 'a?' has an equivalent functionality using bounds (see Project 78). Also, () and {} are not special characters and need not be escaped.