Section 7.7. Perl Efficiency Issues | Mastering Regular Expressions

7.7. Perl Efficiency Issues

For the most part, efficiency with Perl regular expressions is achieved in the same way as with any tool that uses a Traditional NFA. Use the techniques discussed in Chapter 6the internal optimizations, the unrolling methods , the "Think" section all apply to Perl.

There are, of course, Perl-specific issues as well, and in this section, we'll look at the following topics:

There's More Than One Way To Do It Perl is a toolbox offering many approaches to a solution. Knowing which problems are nails comes with understanding The Perl Way , and knowing which hammer to use for any particular nail goes a long way toward making more efficient and more understandable programs. Sometimes efficiency and understandability seem to be mutually exclusive, but a better understanding allows you to make better choices.
Regex Compilation, qr/‹/ , the /o Modifier, and Efficiency The interpolation and compilation of regex operands are fertile ground for saving time. The /o modifier, which I haven't discussed much yet, along with regex objects ( qr/ ‹/), gives you some control over when the costly re-compilation takes place.
The $& Penalty The three match side effect variables , $', $& , and $ ', can be convenient , but there's a hidden efficiency gotcha waiting in store for any script that uses them, even once, anywhere . Heck, you don't even have to use themthe entire script is penalized if one of these variables even appears in the script.
The Study Function Since ages past, Perl has provided the study (‹) function. Using it supposedly makes regexes faster, but it seems that no one really understands if it does, or why. We'll see whether we can figure it out.
Benchmarking When it comes down to it, the fastest program is the one that finishes first. (You can quote me on that.) Whether a small routine, a major function, or a whole program working with live data, benchmarking is the final word on speed. Benchmarking is easy and painless with Perl, although there are various ways to go about it. I'll show you the way I do it, a simple method that has served me well for the hundreds of benchmarks I've done while preparing this book.
Perl's Regex Debugging Perl's regex-debug flag can tell you about some of the optimizations the regex engine and transmission do, or don't do, with your regexes. We'll look at how to do this and see what secrets Perl gives up.

7.7.1. "There's More Than One Way to Do It"

There are often many ways to go about solving any particular problem, so there's no substitute for really knowing all that Perl has to offer when balancing efficiency and readability. Let's look at the simple problem of padding an IP address like ' 18.181.0.24 ' such that each of the four parts becomes exactly three digits: ' 018.181.000.024 '. One simple and readable solution is:

 $ip = sprintf("%03d.%03d.%03d.%03d", split(/\./, $ip));

This is a fine solution, but there are certainly other ways to do the job. In the interest of comparison, Table 7-6 examines various ways to achieve the same goal, and their relative efficiency (they're listed from the most efficient to the least). This example's goal is simple and not very interesting in and of itself, yet it represents a common text-handling task, so I encourage you to spend some time understanding the various approaches. You may even see some Perl techniques that are new to you.

Each approach produces the same result when given a correct IP address, but fails in different ways if given something else. If there is any chance that the data will be malformed , you'll need more care than any of these solutions provide. That aside, the practical differences lie in efficiency and readability. As for readability, #1 and #13 seem the most straightforward (although it's interesting to see the wide gap in efficiency). Also straightforward are #3 and #4 (similar to #1) and #8 (similar to #13). The rest all suffer from varying degrees of crypticness.

So, what about efficiency? Why are some less efficient than others? It's the interactions among how an NFA works (Chapter 4), Perl's many regex optimizations (Chapter 6), and the speed of other Perl constructs (such as sprintf , and the mechanics of the substitution operator). The substitution operator's /e modifier, while indispensable at times, does seem to be mostly at the bottom of the list.

It's interesting to compare two pairs, #3/#4 and #8/#14 . The two regexes of each pair differ only in their use of parentheses the one without the parentheses is just a bit faster than the one with. But #8's use of $& as a way to avoid parentheses comes at a high cost not shown by these benchmarks (˜ 355).