Section 3.14. A Few Sample Regular Expressions | The Ruby Way, Second Edition: Solutions and Techniques in Ruby Programming (2nd Edition)

3.13. Ruby and Oniguruma

Ruby's new regular expression engine is code-named Oniguruma, a Japanese name meaning roughly ghost wheel or demon wheel. (It is commonly misspelled by non-Japanese; remember you can't spell Oniguruma without "guru.")

The new engine offers several benefits over the old one. Notably, it handles internationalized strings better, and it adds some powerful features to Ruby's regular expressions. Additionally, it is offered under a less restrictive license, comparable to the rest of Ruby. As this book is being written, Oniguruma is not yet fully integrated into the standard distribution.

The next section deals with detecting the presence of the Oniguruma engine. The section after that outlines how to build it if you don't have it built in.

3.13.1. Testing the Presence of Oniguruma

If you're concerned with Oniguruma, the first step is to find out whether you are already using it. If you are running Ruby 1.8.4 or earlier, you probably don't have the new engine. It is standard in 1.9.

Here is a simple method that uses a three-pronged approach to determine whether the new engine is in place. First, as I said, it's standard in 1.9 and later. In recent versions of both engines, a Regexp::ENGINE is defined; if this string contains the substring Oniguruma, this is the new engine. Finally, we use a trick. If we still haven't determined which engine we have, we will try to evaluate a regex with "new" syntax. If it raises a SyntaxError, we have the old engine; otherwise, the new one.

def oniguruma?   return true if RUBY_VERSION >= "1.9.0"   if defined?(Regexp::ENGINE)  # Is ENGINE defined?     if Regexp::ENGINE.include?('Oniguruma')       return true              # Some version of Oniguruma     else       return false             # Pre-Oniguruma engine     end   end   eval("/(?<!a)b/")            # Newer syntax   return true                  # It worked: New engine. rescue SyntaxError             # It failed: We're using the   return false                 #   old engine. end puts oniguruma?

3.13.2. Building Oniguruma

If you don't have Oniguruma, you can compile Ruby yourself and link it in. The current instructions are shown below. These should work with versions as far back as 1.6.8 (though these are fairly old).

You should be able to obtain the Oniguruma archive from the RAA (http://raa.ruby-lang.org/) or other sources. The Ruby source itself, of course, is always available from the main Ruby site.

If you are on a UNIX-like platform (including a Cygwin environment on Windows or Mac OS/X), you can follow the procedure shown here:

1.	`gunzip oniguruma.tar.gz`
2.	`tar xvf oniguruma.tar`
3.	`cd oniguruma`
4.	`./configure with-rubydir=<ruby-source-dir>`
5.	One of: make 16 # for Ruby 1.6.8 make 18 # for Ruby 1.8.0/1.8.1
6.	`cd ruby-source-dir`

7.	`./configure`
8.	`make clean`
9.	`make`
10.	`make test # Simple test of Ruby interpreter`
11.	`cd ../oniguruma # adjust path as needed`
12.	`make rtest`

Or:

make rtest RUBYDIR=ruby-install-dir

If you are on a pure Win32 such as Windows XP, you will need both Visual C++ and a copy of the patch.exe executable. Then perform the following steps:

1.	Unzip the archive with whatever software you use.
2.	`copy win32\Makefile Makefile`
3.	One of: nmake 16 RUBYDIR=ruby-source-dir # for Ruby 1.6.8 nmake 18 RUBYDIR=ruby-source-dir # for Ruby 1.8.0/1.8.1
4.	Follow the directions in `ruby-source-dir\win32\README.win32`.

If there are problems, use the mailing list or newsgroup as a resource.

3.13.3. A Few New Features of Oniguruma

Oniguruma adds many new features to regular expressions in Ruby. Among the simplest of these is an additional character class escape sequence. Just as \d and \D match digits and nondigits, respectively (for decimal numbers), \h and \H do the same for hexadecimal digits:

"abc" =~ /\h+/   # 0 "DEF" =~ /\h+/   # 0 "abc" =~ /\H+/   # nil

Character classes in brackets get a little more power. The && operator can be used to nest character classes. Here is a regex that matches any letter except the vowels a, e, i, o, and u:

reg1 = /[a-z&&[^aeiou]]/     # Set intersection

Here is an example matching the entire alphabet but "masking off" m through p:

reg2 = /[a-z&&[^m-p]]/

Because this can be confusing, I recommend using this feature sparingly.

Other Oniguruma features such as lookbehind and named matches are covered in the rest of section 3.13. Features related to internationalization are deferred until Chapter 4.

3.13.4. Positive and Negative Lookbehind

If lookahead isn't enough for you, Oniguruma offers lookbehinddetecting whether the current location is preceded by a given pattern.

Like many areas of regular expressions, this can be difficult to understand and motivate. Thanks goes to Andrew Johnson for the following example.

Imagine that we are analyzing some genetic sequence. (The DNA molecule consists of four "base" molecules, abbreviated A, C, G, and T.) Suppose that we are scanning for all nonoverlapping nucleotide sequences (of length 4) that follow a T. We couldn't just try to match a T and four characters because the T may have been the last character of the previous match.

gene = 'GATTACAAACTGCCTGACATACGAA' seqs = gene.scan(/T(\w{4})/) # seqs is: [["TACA"], ["GCCT"], ["ACGA"]]

But in this preceding code, we miss the GACA sequence that follows GCCT. Using a positive lookbehind (as follows), we catch them all:

gene = 'GATTACAAACTGCCTGACATACGAA' seqs = gene.scan(/(?<=T)(\w{4})/) # seqs is: [["TACA"], ["GCCT"], ["GACA"], ["ACGA"]]

This next example is adapted from one by K. Kosako. Suppose that we want to take a bunch of text in XML (or HTML) and shift to uppercase all the text outside the tags (that is, the cdata). Here is a way to do that using lookbehind:

text = <<-EOF <body> <h1>This is a heading</h1> <p> This is a paragraph with some <i>italics</i> and some <b>boldface</b> in it...</p> </body> EOF pattern = /(?:^|              # Beginning or...               (?<=>)    #   following a '>'            )            ([^<]*)       # Then all non-'<' chars (captured).           /x puts text.gsub(pattern) {|s| s.upcase } # Output: # <body> <h1>THIS IS A HEADING</h1> # <p>THIS IS A PARAGRAPH WITH SOME # <i>ITALICS</i> AND SOME <b>BOLDFACE</b> # IN IT...</p> # </body>

3.13.5. More on Quantifiers

We've already seen the atomic subexpression in Ruby's "regex classic" engine. This uses the notation (?>...), and it is "possessive" in the sense that it is greedy and does not allow backtracking into the subexpression.

Oniguruma allows another way of expressing possessiveness, with the postfix + quantifier. This is distinct from the + meaning "one or more" and can in fact be combined with it. (In fact, it is a "secondary" quantifier, like the ? which gives us ??, +?, and *?).

In essence, + applied to a repeated pattern is the same as enclosing that repeated pattern in an independent subexpression. For example:

r1 = /x*+/    # Same as:  /(?>x*)/ r2 = /x++/    # Same as:  /(?>x+)/ r3 = /x?+/    # Same as:  /(?>x?)/

For technical reasons, Ruby does not honor the {n,m}+ notation as possessive.

Obviously, this new quantifier is largely a notational convenience. It doesn't really offer any new functionality.

3.13.6. Named Matches

A special form of subexpression is the named expression. This in effect gives a name to a pattern (rather than just a number).

The syntax is simple: (?<name>expr) where name is some name starting with a letter (like a Ruby identifier). Notice how similar this is to the non-named atomic subexpression.

What can we do with a named expression? One thing is to use it as a backreference. The following example is a simple regex that matches a doubled word (see also section 3.14.6, "Detecting Doubled Words in Text"):

re1 = /\s+(\w+)\s+\1\s+/ str = "Now is the the time for all..." re1.match(str).to_a          # ["the the","the"]

Note how we capture the word and then use \1 to reference it. We can use named references in much the same way. We give the name to the subexpression when we first use it, and we access the backreference by \k followed by that same name (always in angle brackets).

re2 = /\s+(?<anyword>\w+)\s+\k<anyword>\s+/

The second variant is longer but arguably more readable. (Be aware that if you use named backreferences, you cannot use numbered backreferences in the same regex.) Use this feature at your discretion.

Ruby has long had the capability to use backreferences in strings passed to sub and gsub; in the past, this has been limited to numbered backreferences, but in very recent versions, named matches can be used:

str = "I breathe when I sleep" # Numbered matches... r1  = /I (\w+) when I (\w+)/ s1  = str.sub(r1,'I \2 when I \1') # Named matches... r1  = /I (?<verb1>\w+) when I (?<verb2>\w+)/ s2  = str.sub(r2,'I \k<verb2> when I \k<verb1>') puts s1     # I sleep when I breathe puts s2     # I sleep when I breathe

Another use for named expressions is to re-invoke that expression. In this case, we use \g (rather than \k) preceding the name.

For example, let's defines a spaces subpattern so that we can use it again. The last regex then becomes

re3 = /(?<spaces>\s+)(?<anyword>\w+)\g<spaces>\k<anyword>\g<spaces>/

Note how we invoke the pattern repeatedly by means of the \g marker. This feature makes more sense if the regular expression is recursive; that is the topic of the next section.

A notation such as \g<1> may also be used if there are no named subexpressions. This re-invokes a captured subexpression by referring to it by number rather than name.

One final note on the use of named matches. In the most recent versions of Ruby, the name can be used (as a symbol or a string) as a MatchData index. For example:

str = "My hovercraft is full of eels" reg = /My (?<noun>\w+) is (?<predicate>.*)/ m = reg.match(str) puts m[:noun]         # hovercraft puts m["predicate"]   # full of eels puts m[1]             # same as m[:noun] or m["noun"]

As shown, ordinary indices may still be used. There is also some discussion of adding singleton methods to the MatchData object.

puts m.noun puts m.predicate

At the time of this writing, this has not been implemented.

3.13.7. Recursion in Regular Expressions

The ability to re-invoke a subexpression makes it possible to craft recursive regular expressions. For example, here is one that matches any properly nested parenthesized expression. (Thanks again to Andrew Johnson.)

str = "a * ((b-c)/(d-e) - f) * g" reg = /(?            # begin named expression          \(                # match open paren          (?:               # non-capturing group            (?>             # possessive subexpr to match:               \\[()]       #  either an escaped paren             |              # OR               [^()]        #  a non-paren character            )               # end possessive            |               # OR            \g        # a nested parens group (recursive call)           )*               # repeat non-captured group zero or more           \)               # match closing paren         )                  # end named expression       /x m = reg.match(str).to_a    # ["((b-c)/(d-e) - f)", "((b-c)/(d-e) - f)"]

Note that left-recursion is not allowed. This is legal:

str = "bbbaccc" re1 = /(?<foo>a|b\g<foo>c)/ re1.match(str).to_a        # ["bbbaccc","bbbaccc"]

But this is illegal:

re2 = /(?<foo>a|\g<foo>c)/ # Syntax error!

This example is illegal because of the recursion at the head of each alternative. This leads, if you think about it, to an infinite regress.