# Section 3.6. Positive and Negative Lookahead

### 3.5. Using Quantifiers

A big part of regular expressions is handling optional items and repetition. An item followed by a question mark is optional; it may be present or absent, and the match depends on the rest of the regex. (It doesn't make sense to apply this to an anchor but only to a subpattern of non-zero width.)

`pattern = /ax?b/ pat2    = /a[xy]?b/ pattern =~ "ab"     # 0 pattern =~ "acb"    # nil pattern =~ "axb"    # 0 pat2    =~ "ayb"    # 0 pat2    =~ "acb"    # nil`

It is common for entities to be repeated an indefinite number of times (which we can specify with the + quantifier). For example, this pattern matches any positive integer:

`pattern = /[0-9]+/ pattern =~ "1"         # 0 pattern =~ "2345678"   # 0`

Another common occurrence is a pattern that occurs zero or more times. We could do this with + and ?, of course; here we match the string Huzzah followed by zero or more exclamation points:

`pattern = /Huzzah(!+)?/   # Parentheses are necessary here pattern =~ "Huzzah"       # 0 pattern =~ "Huzzah!!!!"   # 0`

However, there's a better way. The * quantifier describes this behavior.

`pattern = /Huzzah!*/      # * applies only to ! pattern =~ "Huzzah"       # 0 pattern =~ "Huzzah!!!!"   # 0`

What if we want to match a U.S. Social Security Number? Here's a pattern for that:

`ssn = "987-65-4320" pattern = /\d\d\d-\d\d-\d\d\d\d/ pattern =~ ssn       # 0`

But that's a little unclear. Let's explicitly say how many digits are in each group. A number in braces is the quantifier to use here:

`pattern = /\d{3}-\d{2}-\d{4}/`

This is not necessarily a shorter pattern, but it is more explicit and arguably more readable.

Comma-separated ranges can also be used. Imagine that an Elbonian phone number consists of a part with three to five digits and a part with three to seven digits. Here's a pattern for that:

`elbonian_phone = /\d{3,5}-\d{3,7}/`

The beginning and ending numbers are optional (though we must have one or the other):

`/x{5}/        # Match 5 xs /x{5,7}/      # Match 5-7 xs /x{,8}/       # Match up to 8 xs /x{3,}/       # Match at least 3 xs`

Obviously, the quantifiers ?, +, and * could be rewritten in this way:

`/x?/          # same as /x{0,1}/ /x*/          # same as /x{0,} /x+/          # same as /x{1,}`

The terminology of regular expressions is full of colorful personifying terms such as greedy, reluctant, lazy, and possessive. The greedy/non-greedy distinction is one of the most important.

Consider this piece of code. You might expect that this regex would match "Where the", but it matches the larger substring "Where the sea meets the" instead:

`str = "Where the sea meets the moon-blanch'd land," match = /.*the/.match(str) p match[0]  # Display the entire match:             # "Where the sea meets the"`

The reason is that the * operator is greedyin matching, it consumes as much of the string as it can for the longest match possible. We can make it non-greedy by appending a question mark:

`str = "Where the sea meets the moon-blanch'd land," match = /.*?the/.match(str) p match[0]  # Display the entire match:             # "Where the"`

This shows us that the * operator is greedy by default unless a ? is appended. The same is true for the + and {m,n} quantifiers, and even for the ? quantifier itself.

I haven't been able to find good examples for the {m,n}? and ?? cases. If you know of any, please share them.

The Ruby Way, Second Edition: Solutions and Techniques in Ruby Programming (2nd Edition)
ISBN: 0672328844
EAN: 2147483647
Year: 2004
Pages: 269
Authors: Hal Fulton

Similar book on Amazon