6.8. Regular Expressions

 <  Day Day Up  >  

A regular expression to awk is a pattern that consists of characters enclosed in forward slashes . Awk supports the use of regular expression metacharacters (same as egrep ) to modify the regular expression in some way. If a string in the input line is matched by the regular expression, the resulting condition is true, and any actions associated with the expression are executed. If no action is specified and an input line is matched by the regular expression, the record is printed. See Table 6.5.

Example 6.22.
 %  nawk  '/Mary/'  employees   Mary Adams     5346     11/4/63     28765  

Table 6.5. awk Regular Expression Metacharacters

Metacharacter

What It Does

^

Matches at the beginning of string

$

Matches at the end of string

.

Matches for a single character

*

Matches for zero or more of the preceding characters

+

Matches for one or more of the preceding characters

?

Matches for zero or one of the preceding characters

[ABC]

Matches for any one character in the set of characters A , B , or C

[^ABC]

Matches any one character not in the set of characters A , B , or C

[A “Z]

Matches for any one character in the range from A to Z

AB

Matches either A or B

(AB) +

Matches one or more sets of AB ; e.g., AB , ABAB , ABABAB

\*

Matches for a literal asterisk

&

Used in the replacement string to represent what was found in the search string


EXPLANATION

All lines in the employees file containing the regular expression pattern Mary are displayed.

Example 6.23.
 %  nawk  '/Mary/{print , }'  employees   Mary Adams  

EXPLANATION

The first and second fields of all lines in the employees file containing the regular expression pattern Mary are displayed.

The metacharacters listed in Table 6.6 are supported by most versions of grep and sed , but are not supported by any versions of awk .

Table 6.6. Metacharacters NOT supported

Metacharacter

Function

\< >/

Word anchors

\( \)

Backreferencing

\{ \}

Repetition


6.8.1 Matching on an Entire Line

A stand-alone regular expression matches for the pattern on an entire line and if no action is given, the entire line where the match occurred will be printed. The regular expression can be anchored to the beginning of the line with the ^ metacharacter.

Example 6.24.
 %  nawk  '/^Mary/'  employees   Mary Adams     5346     11/4/63     28765  

EXPLANATION

All lines in the employees file that start with the regular expression Mary are displayed.

Example 6.25.
 %  nawk  '/^[A-Z][a-z]+ /'  employees   Tom Jones        4424     5/12/66     543354   Mary Adams       5346     11/4/63     28765   Sally Chang      1654     7/22/54     650000   Billy Black      1683     9/23/44     336500  

EXPLANATION

All lines in the employees file beginning with an uppercase letter, followed by one or more lowercase letters , followed by a space, are displayed.

6.8.2 The match Operator

The match operator, the tilde ( ~ ), is used to match an expression within a record or field.

Example 6.26.
 %  cat employees   Tom Jones       4424    5/12/66     543354   Mary Adams      5346     11/4/63     28765   Sally Chang     1654     7/22/54     650000   Billy Black     1683     9/23/44     336500  %  nawk ' ~ /[Bb]ill/' employees   Billy Black     1683     9/23/44     336500  

EXPLANATION

Any lines matching Bill or bill in the first field are displayed.

Example 6.27.
 %  nawk ' !~ /ly$/' employees   Tom Jones       4424     5/12/66     543354   Mary Adams      5346     11/4/63     28765  

EXPLANATION

Any lines not matching ly , when ly is at the end of the first field are displayed.

The POSIX Character Class

POSIX (the Portable Operating System Interface) is an industry standard to ensure that programs are portable across operating systems. In order to be portable, POSIX recognizes that different countries or locales may differ in the way characters are encoded, alphabets, the symbols used to represent currency, and how times and dates are represented. To handle different types of characters, POSIX added to the basic and extended regular expressions, the bracketed character class of characters shown in Table 6.7. Gawk supports this new character class of metacharacters, whereas awk and nawk do not.

Table 6.7. Bracketed Character Class Added by POSIX

Bracket Class

Meaning

[:alnum:]

Alphanumeric characters

[:alpha:]

Alphabetic characters

[:cntrl:]

Control characters

[:digit:]

Numeric characters

[:graph:]

Nonblank characters (not spaces, control characters, etc.)

[:lower:]

Lowercase letters

[:print:]

Like [:graph:] , but includes the space character

[:punct:]

Punctuation characters

[:space:]

All whitespace characters (newlines, spaces, tabs)

[:upper:]

Uppercase letters

[:xdigit:]

Allows digits in a hexadecimal number ( 0-9a-fA-F )


The class, [:alnum:] is another way of saying A “Za “z0 “9 . To use this class, it must be enclosed in another set of brackets for it to be recognized as a regular expression. For example, A “Za “z0 “9 , by itself, is not a regular expression, but [A “Za “z0 “9] is. Likewise, [:alnum:] should be written [[:alnum:]] . The difference between using the first form, [A “Za “z0 “9] and the bracketed form, [[:alnum:]] is that the first form is dependent on ASCII character encoding, whereas the second form allows characters from other languages to be represented in the class, such as Swedish rings and German umlauts.

Example 6.28.
 %  gawk '/[[:lower:]]+g[[:space:]]+[[:digit:]]/' employees  Sally Chang 1654 7/22/54 650000 

EXPLANATION

Gawk searches for one or more lowercase letters, followed by a g , followed by one or more spaces, followed by a digit. (If you are a Linux user , awk is linked to gawk , making both awk and gawk valid commands.)

 <  Day Day Up  >  


UNIX Shells by Example
UNIX Shells by Example (4th Edition)
ISBN: 013147572X
EAN: 2147483647
Year: 2004
Pages: 454
Authors: Ellie Quigley

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net