Project 77. Learn Regular Expressions"How do I search for text that matches a specific pattern?" This project shows you how to write regular expressions. A regular expression is formed to match a particular text pattern. Project 78 covers advanced use of regular expressions. Note
The Match GameRegular expressions are widely used in Unix, and most text-processing tools support them. The most common uses include:
The simplest regular expressions are plain text sequences (such as index.html) that match other instances of themselves. More often, regular expressions contain a mix of wildcards, repetitions, and alternatives. Unix supports three types of regular expressions, which unfortunately don't share a compatible syntax. The three forms are modern (also termed extended); obsolete (also termed basic); and Perl regular expressions (introduced by the Perl programming language). This project focuses on extended regular expressions, but a section at the end highlights how extended expressions differ from basic expressions. Perl regular expressions, the most powerful of all, are not generally supported by the Unix tools covered in this book. Basic regular expressions are supported by the grep and sed commands. Extended regular expressions are supported by the awk command and by the extended variants of grep and sednamely, egrep (or grep -E) and sed -E. Learn More
Regular expressions are employed in many of the projects in this book. Read this project to brush up on the theory, and you'll be ready to apply it in a more practical way to other projects. Learn More
Basic RulesDepending on context, regular-expression matching is performed on a string (a sequence of characters) or a line of text. Matched text cannot span lines but must be wholly contained within one line. Matching is normally done in a case-sensitive manner, but most tools let you specify that matching should be case insensitive. Tip
Regular expressions are greedy: Given a choice of several possible matches, they always choose the longest one. Consider the text backup.user.sh A regular-expression match against "anything followed by dot" will return backup.user. but the shorter match backup. will not be returned. Regular-Expression SyntaxA regular expression consists of a sequence of atoms and repeaters. An atom is any of the following:
A repeater is any of the following:
The syntax is explained by examples in the rest of the project. Project 78 covers advanced regular expressions, extending the syntax shown here. Tip
To match a character such as star (*), which normally has a special meaning, you must escape its special meaning by preceding it with a backslash (\). The special characters that must be escaped in extended regular expressions are . ^ $ * ? + \ [ { () | Simple Regular ExpressionsLet's form a very simple regular expression that we might use to match an incomplete crossword entry: a p blank l blank. In regular-expression language, a single-character blank is represented by a dot, so here's our regular expression. 'ap.l.' When applied to a list of words, one per line, this expression will match lines that contain apple, apply, and aptly. It will also match lines that contain words such as appliance, pineapple, and inapplicable. When applied to lines (or long strings) of text, the regular expression 'ap.l.' will match lines such as an apple a day and clap loudly because those lines contain matches. It's not necessary to match the entire line or string. Tip
AnchorsThe special symbol caret (^) matches the start of a line or string; it matches a position rather than a character. Repeating our example from the previous section, we find that the regular expression '^ap.l.' matches lines that start with ap.l. and won't match pineapple, inapplicable, or clap loudly. Tip
Similarly, the special symbol dollar ($) matches the end of a line or string, so the regular expression 'ap.l.$' matches words that end with ap.l. and won't match appliance or inapplicable. It's important to realize that anchoring applies to the whole line (or string), not to individual words. If we pass the line red apple, it will not match ^apple because caret anchors to the start of the line. It will match the line apple mac. Similarly, apple$ will match red apple but not apple mac. Tip
Finally, we match an entire line or string by applying both anchors. To match only apple, apply, and aptly, use the regular expression '^ap.l.$' RepeatersTo search for fixed patterns of text separated by arbitrary text, we must specify any number of any character. We do this by combining the atom dot (.) to mean any character and the repeater star (*) to mean zero or more repetitions thereof. Here are some examples that use a text file, paren. $ cat paren Here is (some text) in parentheses. Here we have () empty parentheses. Here we have (a) letter in parentheses. Here we have no parentheses. Let's search for lines that contain anything, including nothing, enclosed in parentheses. To do so, we create a regular expression that means (, followed by anything or nothing, followed by). We must escape the parentheses (and braces, too) because they are special characters (a topic discussed at greater length in Project 78). Tip
$ egrep '\(.*\)' paren Here is (some text) in parentheses. Here we have () empty parentheses. Here we have (a) letter in parentheses. To exclude the empty parentheses, we specify one or more repetitions of any character by using the special character plus (+) instead of star. Learn More
$ egrep '\(.+\)' paren Here is (some text) in parentheses. Here we have (a) letter in parentheses. To specify zero or one repetitions, we use the special character query (?). $ egrep '\(.?\)' paren Here we have () empty parentheses. Here we have (a) letter in parentheses. Repeaters can be applied to specific characters as well as to special characters like dot. Here are two regular expressions, the first matching two or more consecutive dashes (-); the second matching star, then one or two dots, and then star. $ egrep -- '--+' test.txt $ egrep '\*\.\.?\*' test.txt The first example uses a trick to prevent the egrep command from thinking the regular expression is an option because it begins with a dash. A double-dash option preceding the regular expression signifies that no more options follow. The second example uses the special character \ to escape the star and dot characters. Repeaters are summarized in "Regular-Expression Syntax" earlier in this project. Bracket ExpressionsTo match any digit 0 to 9, or perhaps any letter, we list the alternative characters and have the text match exactly one of those characters. Regular expressions provide bracket expressions for just such a purpose, whereby we list the alternative characters in square brackets. For example, the regular expression 'b[aeiou]g' matches bag, beg, big, bog, and bug. It does not match byg or boog. Learn More
The following regular expression will match any line that starts with a, b, or c (uppercase or lowercase) immediately followed by a two-digit number. '^[aAbBcC][0123456789][0123456789]' To match all characters except a particular set, enclose the characters to be excluded in brackets, preceded by a caret (^) symbol. To match any character except a digit, specify the regular expression '[^0123456789]' Tip
Character RangesA character range is a bracketed expression with a start point and an end point separated by a dash. Here are some simple examples to illustrate this.
In the last example, we employed a few tricks to include the special characters [, -, and ^ in the list. To include a ] character, make it first in the bracketed list (or the second when you're negating the list with a caret symbol). A caret must not be the first in the list, and a dash character should be the last in the list. Character ClassesRegular expressions provide special character classes to prevent the need to list many characters in bracketed expressions. To match all letters and digits, for example, we specify the class alnum (alphanumeric). A class name should be surrounded by [: :] and enclosed in brackets. Tip
Let's pose a matching problem and solve it by using character classes. We want to match lines starting with one or more digits, followed by one or more letters, followed by a colon, followed by anything. The line may optionally start with a white space. Here's an example. 42HHGG: Life, the universe, and everything. We might describe our matching criteria by using a regular expression such as '^[[:space:]]*[[:digit:]]+[[:alpha:]]+:' The regular expression uses the character classes space (any white space, including tab), digit (0-9), and alpha (a-z, A-Z). The rest of the expression is formed with the now-familiar repeaters and anchors. The following character classes are defined. alnum alpha blank cntrl digit graph lower print punct space upper zdigit Tip
|