Specifying Patterns | UNIX: The Complete Reference, Second Edition (Complete Reference Series)

Because pattern matching is such a fundamental part of awk, the awk language provides a rich set of operators for specifying patterns. You can use these operators to specify patterns that match a particular word, a phrase, a group of words that have some letters in common (such as all words starting with A), or a number within a certain range. You can also use special operators to combine simple patterns into more complex patterns. These are the basic pattern types in awk:

Regular expressions are sequences of letters, numbers, and special characters that specify strings to be matched. awk accepts the same regular expressions as the egrep command, discussed in Chapter 19.
Comparison patterns are patterns in which you compare two elements using operators such as == (equal to), != (not equal to), > (greater than), and < (less than).
Compound patterns are built up from other patterns, using the logical operators and (&&), or (||), and not (!).
Range patterns have a starting pattern and an ending pattern. They search for the starting pattern and then match every line until they find a line that matches the ending pattern.
BEGIN and END are special built-in patterns that send instructions to your awk program to perform certain actions before or after the main processing loop.

Regular Expressions

You can search for lines that match a regular expression by enclosing it in a pair of slashes (/…/). The simplest kind of regular expression is just a word or string. For example, to match lines containing the phrase “boxing wizards” anywhere in the line, you can use the pattern

 /boxing wizards/

Expressions can also include escape sequences. The most common are \t for TAB and \n for newline.

Table 21–1 shows the special symbols that you can use to form more complex regular expressions.

Table 21–1: awk Regular Expressions
Symbol	Definition	Example	Matches
.	Matches any single character.	th.nk	think, thank, thunk, etc.
\	Quotes the following character.	\\\*	***
*	Matches zero or more repetitions of the previous item.	ap*le	ale, apple, etc.
+	Matches one or more repetitions of the previous item.	.+	any non-empty line
?	Matches the previous item zero or one times.	index\.html?	index. htm, index. html
^	Matches the beginning of a line.	^{^}If	any line beginning with If
$	Matches the end of a line.	\.$	any line ending in a period
[]	Matches any one of the characters inside.	[QqXx]	Q, q, X, or x
[az]	Matches any one of the characters in the range.	[0–9]*	any number: 0110, 27, 9876, etc.
[^ ]	Matches any character not inside.	[^\n]	any character but newline
()	Group a portion of the pattern.	script(\.sh)?	script, script.sh
\|	Matches either the value before or after the \|.	(E\|e)xit	Exit, exit

To illustrate how you can use regular expressions, consider a file containing the inventory of items in a stationery store. The file inventory includes a one-line record for each item. Each record contains the item name, how many are on hand, how much each costs, and how much each sells for:

 pencils  108  .11  .15 markers   50  .45  .75 pens      24  .53  .75 notebooks 15  .75 1.00 erasers  200  .12  .15 books     10 1.00 1.50

If you want to search for the price of markers, but you cannot remember whether you called them “marker” or “markers,” you could use the regular expression

 /markers?/

as the pattern.

To find out how many books you have on hand, you could use the pattern

 /^books/

to find entries that contain “books” only at the beginning of a line. This would match the record for books, but not the one for notebooks.

Case Sensitivity

In awk, string patterns are case sensitive. For example, the pattern/student/wouldn’t match the string “Student”. In gawk, you can set the environment variable IGNORECASE if you want to make matching case-insensitive.

Alternately, you can use tr to convert all of your input to lowercase before running awk, like this:

 cat inputfiles | tr [AZ] [az] awk -f programfile

Some versions of awk have the functions tolower and toupper to help you control the case of strings (see the later section “Working with Strings”).

Comparison Operators

The preceding section dealt with string matches where the target string may occur anywhere in a line. Sometimes, though, you want to compare a string or pattern with a specific string. For example, suppose you want to find all the items in the earlier example that sell for 75 cents. You want to match .75, but only when it is in the fourth field (selling price).

You use the tilde (~) sign to test whether two strings match. For example,

 $4 ~ /^\.75/

checks whether the string $4 contains a match for the expression /^\.75/. That is to say, it checks whether field 4 begins with .75 (the backslash is necessary to prevent the . from being interpreted as a special character). This pattern will match strings such as “.75”, “.7552”, and “.75potatoes”. If you wish to test whether field 4 contains precisely the string .75 and nothing else, you could use

 $4 ~ /^\.75$/

You can test for nonmatching strings with !~. This is similar to ~, but it matches if the first string is not contained in the second string.

The == operator checks whether two strings are identical. For example,

 $1==$3

checks to see whether the value of field 1 is equal to the value of field 3.

Do not confuse == with =. The former (==) tests whether two strings are identical. The single equal sign (=) assigns a value to a variable. For example,

 $1="hello"

sets the value of field 1 equal to “hello”. It would be used as part of an action statement. On the other hand,

 $1=="hello"

compares the value of field 1 to the string “hello”. It could be a pattern statement.

The != operator tests whether the values of two expressions are not equal. For example,

 $1 != "pencils"

is a pattern that matches any line where the first field is not “pencils.”

Comparing Order

The comparison operators <, >, <=, and >= can compare two numbers or two strings. With numbers, they work just as you would expect-for example,

 $1 <= 10

would match the numbers less than or equal to 10.

When used with strings, these operators compare the strings according to the standard ASCII alphabetical order. For example,

 "vanished" < "vorpal"

Remember that in the ASCII character code, all uppercase letters precede all lowercase letters, so

 "Horse" < "cart"

Compound Patterns

Compound patterns are combinations of patterns, joined with the logical operators && (and), || (or), and ! (not). You can create very complex compound patterns.

For example, here is a small but useful program that works on a text file formatted with HTML. It checks whether each starting tag is followed by exactly one ending tag:

 /<B>/ && bold==0 { bold=1 } /<B>/ && bold==1 { print "Missing <B> before line " NR } /</B>/ && bold==1 { bold=0 } /</B>/ && bold==0 { print "Extra </B> at line " NR }

This program look for the HTML tag . If it finds one, it marks the start of bold text (with the variable bold). If bold was already set, it prints an error message. The program also searches for the ending tag . If it finds one, it changes the variable bold to show that the text is no longer bold. If the text wasn’t bold, it prints an error message. You could easily extend this program to test for and other tags.

Compound patterns are useful for numeric variables as well as strings. For example,

 $1 < 10 && $2 >= 30

matches a line if field 1 is less than 10 and field 2 is greater than or equal to 30.

Range Patterns

The syntax for a range pattern is

 startPattern, endPattern

This causes awk to compare each line of input to startPattern. When it finds a line that matches startPattern, that line and every line following it will match the range. awk will continue to match every line until it encounters one that matches endPattern. After that line, the range will no longer match lines of input (until another copy of startPattern appears).

In other words, a range pattern matches all the lines from a starting pattern to an ending pattern. If you have a table in which at least one of the fields is sorted, you can use a range to pull out a section of data. For example, if you have a table in which each line is numbered, you could use this program to print lines 100 to 199:

 $ awk '/100/, /199/ {print}' datafile

BEGIN and END

BEGIN and END are special patterns that separate parts of your awk program from the normal awk loop that examines each line of input. The BEGIN pattern applies before any lines are read. It causes the action following it to be performed before any input is processed.

This allows you to set a variable or print a heading before the main loop of the awk program. For example, suppose you are writing a program that will generate a table. You could use a BEGIN statement to print a header at the top:

 BEGIN {print "Artist     Album     SongTitle     TrackNum"}

The END pattern is similar to BEGIN, but it applies after the last line of input has been read. Suppose you need to count the number of lines in a file. You could use

 { numline = numline + 1 } END { print "There were " numline " lines of input." }

This awk program counts each of line of input and then prints the total when all the input has been processed. A shorter way to write this program is

 END { print "There were " NR " lines of input." }

which uses a built-in awk variable to automatically count the lines.