Regular Expressions | UNIX: The Complete Reference, Second Edition (Complete Reference Series)

A regular expression is a string used for pattern matching. Expressions can be used to search for strings that match a certain pattern, and sometimes to manipulate those strings. Many UNIX System commands (including grep, vi, emacs, sed, and awk) use regular expressions for searching and for text manipulation. Perl has taken the best features for pattern matching from these commands and made them even more powerful.

Pattern Matching

Here’s an example of using a pattern to match a string:

 my @emails = ('derek@rsc.org', 'johng@elsinore.dk',       'kcb@rsc.org', 'olivier@elsinore.dk'); foreach $addr (@emails) {      if ($addr =~ /elsinore/) {           print "$addr matches elsinore.\n";      } }

This example will produce output for johng@elsinore.dk and olivier@elsinore.dk, but not for the other two strings. (By the way, note that single quotes are used around the e-mail addresses. This is to prevent Perl from trying to interpret the @ signs as indicating an array)

As you can see, a string is compared to a regular expression pattern with the =~ operator. The pattern itself is enclosed in a pair of forward slashes. The string is considered a match for the pattern if any part of the string matches the pattern.

If no other string is specified, the pattern is compared to $_. So

 foreach (@emails) {      if (/$pattern/i) {           print "$_ matches $pattern\n";      } }

will look for elements in @emails that match $pattern. The i after /$pattern/ causes Perl to ignore case when matching, so that /elsinore/i will match “Elsinore”.

Constructing Patterns

As you have seen, a string by itself is a regular expression. It matches any string that contains it. For example, elsinore matches “in elsinore castle”. However, you can create far more interesting regular expressions.

Certain characters have special meanings in regular expressions. Table 22–2 lists these characters, with examples of how they might be used.

Table 22–2: Perl Regular Expressions
Char	Definition	Example	Matches
.	Matches any single character.	th.nk	think, thank, thunk, etc.
\	Quotes the following character.	script\.pl	script.pl
*	Previous item may occur zero or more times in a row.	.*	any string, including the empty string
+	Previous item occurs at least once, and maybe more.	\*+	, ****, etc.
?	Previous item may or may not occur.	web\.html?	index.htm, index.html
{n,m}	Previous item must occur at least n times but no more than m times.	\*{3,5}	*, , ***
( )	Group a portion of the pattern.	script(\.pl)?	script, script.pl
\|	Matches either the value before or after the \|.	(R\|r)af	Raf, raf
[ ]	Matches any one of the characters inside. Frequently used with ranges.	[0–9]*	0110, 27, 9876, etc.
[^{^}]	Matches any character not inside the brackets.	[^{^}AZaz]	any nonalphabetic character, such as 2
\s	Matches any white-space character.	\s	space, tab, newline
\S	Matches any non-white space.	the\S	then, they, etc. (but not the)
\d	Matches any digit.	\d*	same as [0–9]*
\D	Matches anything that’s not a digit.	\D+	same as [^{^}0–9]+
\w	Matches any letter, digit, or underscore.	\w+	Q, Oph3L1A, R_and_G, etc
\W	Matches anything that \w doesn’t match.	\W+	&#$%,* etc.
^	Anchor the pattern to the beginning of a string.	^{^}Words	any string beginning with Words
$	Anchor the pattern to the end of the string.	\.$	any string ending in a period

Saving Matches

One use of regular expressions is to parse strings by saving the portions of the string that match your pattern. To save part of a string, put parentheses around the corresponding part of the pattern. The matches to the portions in parentheses are saved in the variables $1, $2, and so on. For example, suppose you have an e-mail address, and you want to get just the username part of the address:

 my $email = 'derek@rsc.org'; if ($email =~ /(\w+)@/) {      print "Username: $1\n";         # $1 is "derek", which matched (\w+) }

You can assign the matches to your own variables, as well. Another way to parse the address is

 (my $username, my $domain) = ($email =~ /(.*)@(.*)/); print "Username: $username\nDomain: $domain\n";

Substitutions

Regular expressions can also be used to modify strings, by substituting text for the part of the string matched by the pattern. The general form for this is

 $string =~ s/$pattern/$replacement/;

In this example, the string “Hello, world” is transformed into “Hello, sailor”:

 my $hello = "Hello, world"; $hello =~ s/world/sailor/;

The flag g causes all occurrences of the pattern to be replaced. For example,

 chomp (my @input = <STDIN>); foreach $line(@input) {       $line =~ s/\d/X/g; }

will replace all the digits in the input with the letter X.

You can include the variables $1, $2, etc., in the replacement, meaning that

 foreach (@input) {       s/(.*)/\L$1/_;              # same as $_=~ s/(.*)/\L$1/_; }

will convert all the input to lowercase.

The flag e will cause the replacement string to be evaluated as an expression before the replacement occurs. You could double all the integers in $mathline with the statement

 $mathline =~ s/(\d+)/2*$1/ge;

To double all the numbers, including decimals, is just a little more complicated. Here’s one way to do it:

 $mathline =~ s/(\d+(\.\d+)?)/2*$1/ge;

Translations

The translation operator is similar to the substitution operator. Its purpose is to translate one set of characters into another set. For example, you could use it to switch uppercase and lowercase letters, as shown:

 $switchchars =~ tr/AZaz/azAZ/;

Or you could convert letters into their scrabble values (with a value of 10 represented by 0):

 $scrabbleword =~ tr/az/13321424185131130111144840/;

This would convert “vanquisher” into “4110111411”.

More examples of the translation operator can be found at http://perldoc.perl.org/perlop.html.

More Uses for Regular Expressions

Regular expressions can be used in several functions for working with strings. These include split, join, grep, and splice.

The split Function

The split function breaks a string at each occurrence of a certain pattern. It takes a regular expression and a string. (If the string is omitted, split operates on $_).

Consider the following line from the file /etc/passwd:

 kcb:x:3943:100:Kenneth Branagh:/home/kcb:/bin/bash

We can use split to turn the fields from this line into a list:

 @passwd = split(/:/, $line); # @passwd = ("kcb", "x", 3943, 100, "Kenneth Branagh", "/home/kcb", "/bin/bash")

Better yet, we can assign a variable name to each field:

 ($login, $passwd, $uid, $gid, $gcos, $home, $shell) = split(/:/, $line);

The join Function

The join function concatenates a series of strings into one string separated by a given separator. It takes a string (not a regular expression) to use as the separator and a list of values to combine.

Given the previously defined array @passwd, we can recreate $line with the following statement:

 $line = join(':', @passwd) ;

We can also join individual scalar values together:

 $line = join("\n", $login, $gcos);

Here, $line will contain the user name and full name separated by a newline.

The grep Function

The grep function is used to extract elements of a list that match a given pattern. This function works in much the same way as the UNIX System grep family of commands. However, perl’s grep function has a number of new features and is usually more efficient.

To extract all the elements of @data that contain numbers, you can write

 @data = ("sjf8", "rlf", "ehb3", "pippin", "13"); @numeric = grep(/ [0–9]/, @data) ; # Same as @numeric = ("sjf8", "ehb3", "13")

The grep function sets $_ to each value in the list as it searches, so we can give it an expression containing $_ to evaluate. For example, we can search an array for numbers less than 50 with the line:

 @numbers = (1..100); @numbers = grep(($_ < 50), @numbers); # Same as @numbers = (1..49)

Or we could double each value in a list by saying

 @numbers = grep(($_ *= 2), 1, 2, 3, 4); # Same as @numbers = (2, 4, 6, 8)

A Sample Program

This program demonstrates the uses of regular expressions, hashes, and some of the other Perl language features that you’ve learned about. It counts the frequency of each word in the input. The words are saved as the keys in a hash; the number of times the words appear are the values.

 #!/usr/local/bin/perl -w use strict; my (%count, $totalwords); while( <>){    my @line = split(/\s/, $_);    foreach my $word (@line) {       $count{$word}++;       $totalwords++;    } } print "$count{$_} $_\n" foreach (sort keys (%count)); print "$totalwords total words found.\n";

The tricky part here is how to split the input lines to find words. The current program uses a regular expression escape sequence “\s”, which splits each line at every space character. Take a look at the results with the following test input:

 $ cat raven.input Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore; While I nodded, nearly napping, suddenly there came a tapping, As of someone gently rapping, rapping at my chamber door. "'Tis some visitor", I muttered, "tapping at my chamber door;                  Only this and nothing more." $ wordcount.pl raven 17 1 "'Tis 1 "tapping 1 As 3 I 1 Once 1 Only 1 Over 1 While 3 a 3 and 2 at 1 came 2 chamber 1 curious 1 door. 1 door;   .   .   . 1 volume 1 weak 1 weary, 1 while 73 total words found.

The output of this program shows a few flaws in its design. First of all, it is counting 17 of something that doesn’t seem to be a word. Second, words are not stripped of punctuation, so “door.” and “door;” are counted as two separate words. Finally, words that are capitalized differently are also counted separately, as in “While” and “while”. In order to get an accurate count of how often a word occurs in a document, we should arrange that all forms of the same word get counted as one word.

Let’s try this version of the word frequency program:

 #!/usr/local/bin/perl -w use strict; my (%count, $totalwords); while (<>) {    tr/AZ/az/;    s/^\W*//;    my @line = split(/\W*\s+\W*/, $_);    foreach my $word (@line) {       $count{$word}++;       $totalwords++;    } } print "$count{$_} $_\n" foreach (sort keys (%count)); print "$totalwords total words found.\n";

The translation operator is used to convert everything to lowercase. The substitution operator then removes leading punctuation for each line. The split pattern is a little more complicated now. It looks for patterns of at least one white-space character, with optional nonword characters on either side. This enables us to correctly count words with punctuation around them.

Here is the new output:

 $ wordcount2.pl raven 3 a 3 and 1 as 2 at 1 came 2 chamber 1 curious 2 door   .     .   . 1 volume 1 weak 1 weary 2 while 56 total words found.