Section 9.5. More Powerful Regular Expressions

9.5. More Powerful Regular Expressions

After reading (almost) three chapters about regular expressions, you know they're a powerful feature in the core of Perl. The Perl developers have added more features, and you'll see some of the most important ones in this section. At the same time, you'll see a little more about the internal operation of the regular expression engine.

9.5.1. Non-Greedy Quantifiers

The four quantifiers you've already seen (in Chapters 7 and 8) are all greedy. That means they match as much as they can, reluctantly giving some back if that's necessary to allow the overall pattern to succeed. Here's an example: Suppose you're using the pattern /fred.+barney/ on the string fred and barney went bowling last night. We know that the regular expression will match that string, but let's see how it goes about it.^[*] First, the subpattern fred matches the identical literal string. The next part of the pattern is the .+, which matches any character except newline, at least one time. But the plus quantifier is greedy; it prefers to match as much as possible. So, it immediately matches all of the rest of the string, including the word night. (This may surprise you, but the story isn't over yet.)

^[*] The regular expression engine makes a few optimizations that make the true story different than we tell it here, and those optimizations change from one release of Perl to the next. You shouldn't be able to tell from the functionality that it's not doing as we say. If you want to know how it works, you should read the latest source code. Be sure to submit patches for any bugs you find.

Now the subpattern barney would like to match, but it can't because we're at the end of the string. Since the .+ could be successful even if it matched one fewer character, it reluctantly gives back the letter t at the end of the string. (It's greedy, but it wants the whole pattern to succeed more than it wants to match everything all by itself.)

The subpattern barney TRies again to match, and still can't. So, the .+ gives back the letter h and lets it try again. One character after another, the .+ gives back what it matched until it gives up all of the letters of barney. Finally, the subpattern barney can match, and the overall match succeeds.

Regular expression engines do a lot of backtracking like that, trying every different way of fitting the pattern to the string until one of them succeeds or until none of them has.^[] As this example shows, that can involve a lot of backtracking, as the quantifier gobbles up too much of the string and the regular expression engine forces it to return some of it.

] Some regular expression engines try every different way, continuing on For each greedy quantifier, a non-greedy alternative is available. Instead of the plus (+), we can use the non-greedy quantifier +?, which matches one or more times (as the plus does), except that it prefers to match as few times as possible, rather than as many as possible. Let's see how that new quantifier works when the pattern is rewritten as /fred.+?barney/.

Once again, fred matches right at the start. This time, the next part of the pattern is .+?, which prefers to match no more than one character, so it matches the space after fred. The next subpattern is barney, but that won't match here (since the string at the current position begins with and barney...). The .+? reluctantly matches the a and lets the rest of the pattern try again. Again, barney can't match, so the .+? accepts the letter n and so on. Once the .+? has matched five characters, barney can match, and the pattern is a success.

There was still some backtracking, but since the engine only had to go back and try again a few times, it should be a big improvement in speed. Well, it's an improvement if you generally find barney near fred. If your data often had fred near the start of the string and barney only at the end, the greedy quantifier might be a faster choice. In the end, the speed of the regular expression depends upon the data.

Non-greedy quantifiers aren't just about efficiency. Though they'll always match (or fail to match) the same strings as their greedy counterparts, they may match different amounts of the strings. For example, suppose you had some HTML-like^[*] text and you want to remove all of the tags <BOLD> and </BOLD>, leaving their contents intact. Here's the text:
^[*] Once again, we aren't using real HTML because you can't correctly parse HTML with simple regular expressions. If you need to work with HTML or a similar markup language, use a module that's made to handle the complexities.
     I'm talking about the cartoon with Fred and <BOLD>Wilma</BOLD>!
And here's a substitution to remove those tags. But what's wrong with it?
     s#<BOLD>(.*)</BOLD>#$1#g;
The problem is that the star is greedy.^[] What if the text had said this instead?
] Theres another possible problem: we should have used the /s modifier as well since the end tag may be on a different line than the start tag. It's a good thing that this is just an example; if we were writing something like this for real, we would have taken our own advice and used a well-written module.
     I thought you said Fred and <BOLD>Velma</BOLD>, not <BOLD>Wilma</BOLD>
In that case, the pattern would match from the first <BOLD> to the last </BOLD>, leaving intact the ones in the middle of the line. Oops! Instead, we want a non-greedy quantifier. The non-greedy form of star is *?, so the substitution now looks like this:
     s#<BOLD>(.*?)</BOLD>#$1#g;
And it does the right thing.

Since the non-greedy form of the plus was +? and the non-greedy form of the star was *?, you've probably realized the other two quantifiers look similar. The non-greedy form of any curly-brace quantifier looks the same, but with a question mark after the closing brace, like {5,10}? or {8,}?.^[*] Even the question-mark quantifier has a non-greedy form: ??. That matches once or not at all, but it prefers not to match anything.
^[*] In theory, there's a non-greedy quantifier form that specifies an exact number, like {3}?. Since that says to match exactly three of the preceding items, it has no flexibility to be greedy or non-greedy.

9.5.2. Matching Multiline Text

Classic regular expressions were used to match single lines of text. Since Perl can work with strings of any length, Perl's patterns can match multiple lines of text as easily as single lines. Of course, you have to include an expression that holds more than one line of text. Here's a string that's four lines long:
     $_ = "I'm much better\nthan Barney is\nat bowling,\nWilma.\n";
The anchors ^ and $ are normally anchors for the start and end of the whole string (see Chapter 8). But the /m regular expression option lets them match at internal newlines as well (think m for multiple lines). This makes them anchors for the start and end of each line, rather than the whole string. So, this pattern can match:
     print "Found 'wilma' at start of line\n" if /^wilma\b/im;
Similarly, you could do a substitution on each line in a multiline string. Here, we read an entire file into one variable,^[] then add the files name as a prefix at the start of each line:
^[] Hope its a small one. The file, that is, not the variable.
     open FILE, $filename       or die "Can't open '$filename': $!";     my $lines = join '', <FILE>;     $lines =~ s/^/$filename: /gm;
9.5.3. Updating Many Files

The most common way of programmatically updating a text file is by writing a new file that looks similar to the old one, but making whatever changes we need as we go along. As you'll see, this technique gives nearly the same result as updating the file, but it has some beneficial side effects as well.

In this example, we've got hundreds of files with a similar format. One of them is fred03.dat, and it's full of lines like these:
     Program name: granite     Author: Gilbert Bates     Company: RockSoft     Department: R&D     Phone: +1 503 555-0095     Date: Tues March 9, 2004     Version: 2.1     Size: 21k     Status: Final beta
We need to fix this file so that it has some different information. Here's roughly what this one should look like when we're done:
     Program name: granite     Author: Randal L. Schwartz     Company: RockSoft     Department: R&D     Date: June 12, 2008 6:38 pm     Version: 2.1     Size: 21k     Status: Final beta
In short, we need to make three changes. The name of the Author should be changed, the Date should be updated to today's date, and the Phone should be removed completely. We have to make these changes in hundreds of similar files as well.

Perl supports a way of in-place editing of files with help from the diamond operator (<>). Here's a program to do what we want, though it may not be obvious how it works at first. This program's only new feature is the special variable $^I; ignore that for now, and we'll come back to it:
     #!/usr/bin/perl -w     use strict;     chomp(my $date = `date`);     $^I = ".bak";     while (<>) {       s/^Author:.*/Author: Randal L. Schwartz/;       s/^Phone:.*\n//;       s/^Date:.*/Date: $date/;       print;     }
Since we need today's date, the program starts by using the system date command. A better way to get the date (in a slightly different format) would be to use Perl's own localtime function in a scalar context:
     my $date = localtime;
The next line sets $^I but keep ignoring that for the moment.

The list of files for the diamond operator here are coming from the command line. The main loop reads, updates, and prints one line at a time. (With what you know so far, that means all of the files' newly modified contents will be dumped to your terminal, scrolling furiously past your eyes, without the files being changed at all. But stick with us.) The second substitution can replace the entire line containing the phone number with an empty string, leaving not even a newline. When that's printed, nothing comes out, and it's as if the Phone never existed. Most input lines won't match any of the three patterns, and those will be unchanged in the output.

This result is close to what we want, except that we haven't shown you how the updated information gets back out on to the disk. The answer is in the variable $^I. By default it's undef, and everything is normal. But when it's set to some string, it makes the diamond operator (<>) more magical than usual.

We know about much of the diamond's magic: it will automatically open and close a series of files for you or read from the standard-input stream if there aren't any filenames given. But when there's a string in $^I, that string is used as a backup filename's extension. Let's see that in action.

Let's say it's time for the diamond to open our file fred03.dat. It opens it like before but renames it, calling it fred03.dat.bak.^[*] We have the same file open, but it has a different name on the disk. Next, the diamond creates a new file and gives it the name fred03.dat. That's okay because we weren't using that name anymore. Now the diamond selects the new file as the default for output, so anything we print will go into that file.^[] The while loop will read a line from the old file, update that, and print it out to the new file. This program can update hundreds of files in a few seconds on a typical machine. Pretty powerful, huh?
^[] The diamond also tries to duplicate the original files permission and ownership settings as much as possible; for example, if the old one was world-readable, the new one should be, as well.

Once the program has finished, what does the user see? The user says, "Ah, I see what happened. Perl edited my file fred03.dat, making the changes I needed, and saved a copy of the original in the backup file fred03.dat.bak just to be helpful." But we know the truth: Perl didn't really edit any file. It made a modified copy, said "Abracadabra!" and switched the files around while we were watching sparks come out of the magic wand. Tricky.

Some folks use a tilde (~) as the value for $^I since that resembles what emacs does for backup files. Another possible value for $^I is the empty string. This enables in-place editing but doesn't save the original data in a backup file. Since a small typo in your pattern could wipe out all of the old data, only use the empty string if you want to find out how good your backup tapes are. It's easy enough to delete the backup files when you're done. When something goes wrong and you need to rename the backup files to their original names, you'll be glad you know how to use Perl to do that. (See the multiple-file rename example in Chapter 13.)

9.5.4. In-Place Editing from the Command Line

A program like the example from the previous section is fairly easy to write. But Larry decided it wasn't easy enough.

Imagine you need to update hundreds of files with the misspelling Randall instead of the one-l name Randal. You could write a program like the one in the previous section. Or you could do it all with a one-line program on the command line:
     $ perl -p -i.bak -w -e 's/Randall/Randal/g' fred*.dat
Perl has a whole slew of command-line options that can be used to build a complete program in a few keystrokes.^[*] Let's see what these few do.
^[*] See the perlrunmanpage for the complete list.

Starting the command with perl does something like putting #!/usr/bin/perl at the top of a file does: it says to use the program perl to process what follows.

The -p option tells Perl to write a program for you. It's not much of a program, though; it looks something like this:^[]
] The print occurs in a continue block. See the perlsyn and perlrun manpages for more information.
     while (<>) {       print;     }
If you want less, you could use -n instead; that leaves out the automatic print statement, so you can print only what you wish. (Fans of awk will recognize -p and -n.) Again, it's not much of a program, but it's pretty good for the price of a few keystrokes.

The next option is -i.bak, which sets $^I to ".bak" before the program starts. If you don't want a backup file, you can use -i alone with no extension. If you don't want a spare parachute, you can leave the airplane with just one.

We've seen -w before: it turns on warnings.

The -e option says "executable code follows." That means the s/Randall/Randal/g string is treated as Perl code. Since we've got a while loop (from the -p option), this code is put inside the loop before the print. For technical reasons, the last semicolon in the -e code is optional. If you have more than one -e and so more than one chunk of code, only the semicolon at the end of the last one may safely be omitted.

The last command-line parameter is fred*.dat, which says that @ARGV should hold the list of filenames that match that filename pattern. Put the pieces all together, and it's as if we had written a program like this, and put it to work on all of those fred*.dat files:
     #!/usr/bin/perl -w     $^I = ".bak";     while (<>) {       s/Randall/Randal/g;       print;     }
Compare this program to the one we used in the previous section. It's pretty similar. These command-line options are pretty handy, aren't they?

9.5.5. Non-Capturing Parentheses

So far, you've seen parentheses that capture parts of a matched string and store them in the memory variables, but what if you just want to use the parentheses to group things? Consider a regular expression where we want to make part of it optional but only capture another part of it. In this example, we want "bronto" to be optional, but to make it optional, we have to group that sequence of characters with parentheses. Later, the pattern uses an alternation to get "steak" or "burger," and captures the one it finds.
     if (/(bronto)?saurus (steak|burger)/)             {             print "Fred wants a $2\n";             }
Even if "bronto" is absent, its part of the pattern goes into $1. Perl counts the order of the opening parentheses to decide what the memory variables will be. The part we want to remember ends up in $2. In more complicated patterns, this situation can become confusing.

Fortunately, Perl's regular expressions have a way to use parentheses to group things but not trigger the memory variables. We call these non-capturing parentheses, and we write them with a special sequence. We add a question mark and a colon after the opening parenthesis, (?:),^[*] and that tells Perl we only use these parentheses for grouping.
^[*] This is the fourth type of ? you've seen in regular expressions: a literal question mark (escaped), the 0 or 1 quantifier, the non-greedy modifier, and now the start of an extended pattern.

We change our regular expression to use non-capturing parentheses around "bronto," and the part that we want to remember appears in $1.
     if (/(?:bronto)?saurus (steak|burger)/)             {             print "Fred wants a $1\n";             }
Later, when we change our regular expression, perhaps to include a possible barbecue version of the brontosaurus burger, we can make the added "BBQ" (with a space) optional and non-capturing so the part we want to remember still shows up in $1. Otherwise, we'd potentially have to shift all of our memory variable names every time we add grouping parentheses to our regular expression.
     if (/(?:bronto)?saurus (?:BBQ )?(steak|burger)/)             {             print "Fred wants a $1\n";             }
Perl's regular expressions have several other special parentheses sequences that do fancy and complicated things such as look-ahead, look-behind, embedded comments, or even run code right in the middle of a pattern. You'll have to check out the perlre manpage for the details.