Section 6.4. Benchmarking | Mastering Regular Expressions

6.4. Benchmarking

Because this chapter talks a lot about speed and efficiency, and I often mention benchmarks I've done, I'd like to mention a few principles of benchmarking. I'll also show simple ways to benchmark in a few languages.

Basic benchmarking is simply timing how long it takes to do some work. To do the timing, get the system time, do the work, get the system time again, and report the difference between the times as the time it took to do the work. As an example, let's compare ^(abcdefg)+$ with ^[a-g]+$ . Well first look at benchmarking in Perl, but will see it in other languages in a bit. Here's a simple (but as we'll see, somewhat lacking) Perl script:

 use Time::HiRes   'time'; #  So time()   gives a  high-resolution value  .     $StartTime = time();     "abababdedfg" =~ m/^(abcdefg)+$/;     $EndTime = time();     printf("Alternation takes %.3f seconds.\n", $EndTime - $StartTime);     $StartTime = time();     "abababdedfg" =~ m/^[a-g]+$/;     $EndTime = time();     printf("Character class takes %.3f seconds.\n", $EndTime - $StartTime);

It looks (and is) simple, but there are some important points to keep in mind while constructing a test for benchmarking:

Time only "interesting" work . Time as much of the "work" as possible, but as little "non-work" as possible. If there is some initialization or other setup that must be done, do it before the starting time is taken. If there's cleanup, do it after the ending time is taken.
Do "enough" work . Often, the time it takes to do what you want to test is very short, and a computer's clock doesn't have enough granularity to give meaning to the timing.

When I run the simple Perl test on my system, I get
```
 Alternation takes 0.000 seconds.     Character class takes 0.000 seconds. 
```
which really doesn't tell me anything other than both are faster than the shortest time that can be measured. So, if something is fast, do it twice, or 10 times, or even 10,000,000 timeswhatever is required to make "enough" work. What is "enough" depends on the granularity of the system clock, but most systems now have clocks accurate down to 1/100 ^th of a second, and in such cases, timing even half a second of work may be sufficient for meaningful results.
Do the "right" work . Doing a very fast operation ten million times involves the overhead of ten million updates of a counter variable in the block being timed. If possible, it's best to increase the amount of real work being done in a way that doesn't increase the overhead work. In our Perl example, the regular expressions are being applied to fairly short strings: if applied to much longer strings, they'd do more "real" work each time.

So, taking these into account, here's another version:

 use Time::HiRes 'time'; #  So time() gives a high-resolution value  .     $TimesToDo = 1000;                     #  Simple setup  $TestString = "abababdedfg" x 1000;    #  Makes a huge string  $Count = $TimesToDo;     $StartTime = time();     while ($Count-- > 0) {         $TestString =~ m/^(abcdefg)+$/;     }     $EndTime = time();     printf("Alternation takes %.3f seconds.\n", $EndTime - $StartTime);     $Count = $TimesToDo;     $StartTime = time();     while ($Count-- > 0) {         $TestString =~ m/^[a-g]+$/;     }     $EndTime = time();     printf("Character class takes %.3f seconds.\n", $EndTime - $StartTime);

Notice how the $TestString and $Count are initialized before the timing starts? ($TestString is initialized with Perl's convenient x operator, which replicates the string on its left as many times as the number on its right.) On my system, with Perl 5.8, this prints:

 Alternation takes 7.276 seconds.     Character class takes 0.333 seconds.

So, with this test case, one is about 22x faster than the other. The benchmark should be executed a few times, with the fastest times taken, to lessen the impact of sporadic background system activity.

6.4.1. Know What You're Measuring

It might be interesting to see what happens when the initialization is changed to:

 $TimesToDo = 1000000;     $TestString = "abababdedfg";

Now, the test string is 1,000x shorter, but the test is done 1,000x more times. The total number of characters tested and matched by each regex remains the same, so conceptually, one might think that the amount of "work" should also remain the same. However, the results are quite different:

 Alternation takes 18.167 seconds.     Character class takes 5.231 seconds.

Both are now much slower than before. This is due to all the extra "non-work" overheadthe update and testing of $Count , and the setup of the regex engine, now each happen 1,000x more than before.

The extra overhead adds almost 5 seconds to the faster test, but more than 10 seconds to the alternation test. Why is the latter affected so much more? It's mostly due to the extra overhead of the capturing parentheses (which require their own extra processing before and after each test, and doing that 1,000x more adds up).

In any case, the point of this change is to illustrate that the results are strongly influenced by how much real work versus non-work overtime is part of the timing.

6.4.2. Benchmarking with PHP

Here's the benchmark example in PHP, using the preg engine:

 $TimesToDo = 1000;     /*  Prepare the test string  */     $TestString = "";     for ($i = 0; $i < 1000; $i++)         $TestString .= "abababdedfg";     /*  Do the first test  */     $start = gettimeofday();      for ($i = 0; $i < $TimesToDo; $i++)          preg_match('/^(abcefg)+$/', $TestString);     $final = gettimeofday();     $sec = ($final['sec'] + $final['usec']/1000000) -            ($start['sec'] + $start['usec']/1000000);     printf("Alternation takes %.3f seconds\n", $sec);     /*  And now the second test  */     $start = gettimeofday();     for ($i = 0; $i < $TimesToDo; $i++)          preg_match('/^[a-g]+$/', $TestString);     $final = gettimeofday();     $sec = ($final['sec'] + $final['usec']/1000000) -            ($start['sec'] + $start['usec']/1000000);     printf("Character class takes %.3f seconds\n", $sec);

On my system, it produces:

 Alternation takes 27.404 seconds     Character class takes 0.288 seconds

If, while benchmarking, you get a PHP error about it "not being safe to rely on the system's timezone settings," add the following to the top of your test program:

 if (phpversion() >= 5)         date_default_timezone_set("GMT");

6.4.3. Benchmarking with Java

Benchmarking Java can be a slippery science, for a number of reasons. Let's first look at a somewhat na ve example, and then look at why it's na ve, and at what can be done to make it less so:

 import java.util.regex.*;     public class JavaBenchmark {      public static void main(String [] args)      {        Matcher regex1 = Pattern.compile("^(abcdefg)+$").matcher("");        Matcher regex2 = Pattern.compile("^[a-g]+$").matcher("");        long timesToDo = 1000;        StringBuffer temp = new StringBuffer();        for (int i = 1000; i > 0; i--)                temp.append("abababdedfg");        String testString = temp.toString();        //  Time first one  ...        long count = timesToDo;        long startTime = System.currentTimeMillis();        while (--count > 0)              regex1.reset(testString).find();        double seconds = (System.currentTimeMillis() - startTime)/1000.0;        System.out.println("Alternation takes " + seconds + " seconds");        //  Time second one  ...       count = timesToDo;       startTime = System.currentTimeMillis();       while (--count > 0)             regex2.reset(testString).find();       seconds = (System.currentTimeMillis() - startTime)/1000.0;       System.out.println("Character class takes " + seconds + " seconds");     }   }

Notice how the regular expressions are compiled in the initialization part of the program? We want to benchmark the matching speed, not the compile speed.

Speed is dependent upon which virtual machine (VM) is used. Sun standard JRE comes with two virtual machines, a client VM optimized for fast startup, and a server VM optimized for heavy-duty long-haul work.

On my system, running the benchmark on the client VM produces:

 Alternation takes 19.318 seconds     Character class takes 1.685 seconds

while the server VM yields:

 Alternation takes 12.106 seconds     Character class takes 0.657 seconds

What makes benchmarking slippery, and this example somewhat na ve, is that the timing can be highly dependent on how well the automatic pre-execution compiler works, or how the run-time compiler interacts with the code being tested. Some VM have a JIT (Just-In-Time compiler), which compiles code on the fly, just before it's needed.

Java has what I call a BLTN (Better-Late-Than-Never) compiler, which kicks in during execution, compiling and optimizing heavily-used code on the fly. The nature of a BLTN is that it doesn't "kick in" until it senses that some code is "hot" (being used a lot). A VM that's been running for a while, such as in a server environment, will be "warmed up," while our simple test ensures a "cold" server (nothing yet optimized by the BLTN).

One way to see "warmed up" times is to run the benchmarked parts in a loop:

 //  Time first one  ...     for (int i = 4; i > 0; i--)     {         long count = timesToDo;         long startTime = System.currentTimeMillis();            while (--count > 0)               regex1.reset(testString).find();         double seconds = (System.currentTimeMillis() - startTime)/1000.0;         System.out.println("Alternation takes " + seconds + " seconds");     }

If the extra loop runs enough times (say, for 10 seconds), the BLTN will have optimized the hot code, leaving the last times reported as representative of a warmed up system. Testing again with the server VM, these times are indeed a bit faster by about 8 percent and 25 percent:

 Alternation takes 11.151 seconds     Character class takes 0.483 seconds

Another issue that makes Java benchmarking slippery is the unpredictable nature of thread scheduling and garbage collection. Again, running the test long enough helps amortize their unpredictable influence.

6.4.4. Benchmarking with VB.NET

Here's the benchmark example in VB.NET:

 Option Explicit On     Option Strict On     Imports System.Text.RegularExpressions     Module Benchmark     Sub Main()       Dim Regex1 as Regex = New Regex("^(abcdefg)+$")       Dim Regex2 as Regex = New Regex("^[a-g]+$")       Dim TimesToDo as Integer = 1000       Dim TestString as String = ""       Dim I as Integer       For I = 1 to 1000          TestString = TestString & "abababdedfg"       Next       Dim StartTime as Double = Timer()        For I = 1 to TimesToDo          Regex1.Match(TestString)       Next       Dim Seconds as Double = Math.Round(Timer() - StartTime, 3)       Console.WriteLine("Alternation takes " & Seconds & " seconds")       StartTime = Timer()       For I = 1 to TimesToDo          Regex2.Match(TestString)       Next       Seconds = Math.Round(Timer() - StartTime, 3)       Console.WriteLine("Character class takes " & Seconds & " seconds")     End Sub     End Module

On my system, it produces:

 Alternation takes 13.311 seconds     Character class takes 1.680 seconds

The .NET Framework allows a regex to be compiled to an even more efficient form, by providing RegexOptions.Compiled as a second argument to each Regex constructor (˜ 410). Doing that results in:

 Alternation takes 5.499 seconds     Character class takes 1.157 seconds

Both tests are faster using the Compiled option, but alternation sees a greater relative benefit (its almost 3x faster when Compiled , but the class version is only about 1.5x faster), so it seems that alternation benefits from the more efficient compilation relatively more than a character class does.

6.4.5. Benchmarking with Ruby

Here's the benchmark example in Ruby:

 TimesToDo=1000     testString=""     for i in 1..1000         testString += "abababdedfg"     end     Regex1 = Regexp::new("^(abcdefg)+$");     Regex2 = Regexp::new("^[a-g]+$");     startTime = Time.new.to_f      for i in 1..TimesToDo         Regex1.match(testString)     end     print "Alternation takes %.3f seconds\n" % (Time.new.to_f - startTime);     startTime = Time.new.to_f     for i in 1..TimesToDo         Regex2.match(testString)     end     print "Character class takes %.3f seconds\n" % (Time.new.to_f - startTime);

On my system, it produces:

 Alternation takes 16.311 seconds     Character class takes 3.479 seconds

6.4.6. Benchmarking with Python

Here's the benchmark example in Python:

 import re     import time     import fpformat     Regex1 = re.compile("^(abcdefg)+$")     Regex2 = re.compile("^[a-g]+$")     TimesToDo = 1250;     TestString = ""     for i in range(800):         TestString += "abababdedfg"     StartTime = time.time()     for i in range(TimesToDo):         Regex1.search(TestString)     Seconds = time.time() - StartTime     print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"     StartTime = time.time()     for i in range(TimesToDo):         Regex2.search(TestString)     Seconds = time.time() - StartTime     print "Character class takes " + fpformat.fix(Seconds,3) + " seconds"

For Python's regex engine, I had to cut the size of the string a bit because the original causes an internal error ("maximum recursion limit exceeded") within the regex engine. This is like a pressure release valve, stopping what seems like a neverending match from running too wild.

To compensate, I increased the number of times the test is done by a proportional amount. On my system, the benchmark produces:

 Alternation takes 10.357 seconds     Character class takes 0.769 seconds

6.4.7. Benchmarking with Tcl

Here's the benchmark example in Tcl:

 set TimesToDo 1000     set TestString ""     for {set i 1000} {$i > 0} {incr i -1} {         append TestString "abababdedfg"     }     set Count $TimesToDo     set StartTime [clock clicks   -milliseconds]     for {} {$Count > 0} {incr Count -1} {         regexp {^(abcdefg)+$} $TestString     }     set EndTime [clock clicks -milliseconds]     set Seconds [expr ($EndTime - $StartTime)/1000.0]     puts [format "Alternation takes %.3f seconds" $Seconds]     set Count $TimesToDo     set StartTime [clock clicks -milliseconds]     for {} {$Count > 0} {incr Count -1} {         regexp {^[a-g]+$} $TestString     }     set EndTime [clock clicks -milliseconds]     set Seconds [expr ($EndTime - $StartTime)/1000.0]     puts [format "Character class takes %.3f seconds" $Seconds]

On my system, this benchmark produces:

 Alternation takes 0.362 seconds      Character class takes 0.352 seconds

Wow, they're both about the same speed! Well, recall from the table on page 145 that Tcl has a hybrid NFA/DFA engine, and these regular expressions are exactly the same to a DFA engine. Most of what this chapter talks about simplydoes not apply to Tcl. See the sidebar on page 243 for more.