Section 3.15. Conclusion | The Ruby Way, Second Edition: Solutions and Techniques in Ruby Programming (2nd Edition)

3.14. A Few Sample Regular Expressions

This section presents a small list of regular expressions that might be useful either in actual use or as samples to study. For simplicity, none of these patterns depends on Oniguruma.

3.14.1. Matching an IP Address

Suppose that we want to determine whether a string is a valid IPv4 address. The standard form of such an address is a dotted quad or dotted decimal string. This is, in other words, four decimal numbers separated by periods, each number ranging from 0 to 255.

The pattern given here will do the trick (with a few exceptions such as "127.1"). We break up the pattern a little just for readability. Note that the \d symbol is double-escaped so that the slash in the string gets passed on to the regex. (We'll improve on this in a minute.)

num = "(\\d|[01]?\\d\\d|2[0-4]\\d|25[0-5])" pat = "^(#{num}\.){3}#{num}$" ip_pat = Regexp.new(pat) ip1 = "9.53.97.102" if ip1 =~ ip_pat                # Prints "yes"   puts "yes" else   puts "no" end

Note how we have an excess of backslashes when we define num in the preceding example. Let's define it as a regex instead of a string:

num = /(\d|[01]?\d\d|2[0-4]\d|25[0-5])/

When a regex is interpolated into another, to_s is called, which preserves all the information in the original regex.

num.to_s    # "(?-mix:(\\d|[01]?\\d\\d|2[0-4]\\d|25[0-5]))"

In some cases, it is more important to use a regex instead of a string for embedding. A good rule of thumb is to interpolate regexes unless there is some reason you must interpolate a string.

IPv6 addresses are not in widespread use yet, but we include them for completeness. These consist of eight colon-separated 16-bit hex numbers with zeroes suppressed.

num = /[0-9A-Fa-f]{0,4}/ pat = /^(#{num}:){7}#{num}$/ ipv6_pat = Regexp.new(pat) v6ip = "abcd::1324:ea54::dead::beef" if v6ip =~ ipv6_pat     # Prints "yes"   puts "yes" else   puts "no" end

3.14.2. Matching a Keyword-Value Pair

Occasionally we want to work with strings of the form "attribute=value" (as, for example, when we parse some kind of configuration file for an application).

The following code fragment extracts the keyword and the value. The assumptions are that the keyword or attribute is a single word, the value extends to the end of the line, and the equal sign may be surrounded by whitespace:

pat = /(\w+)\s*=\s*(.*?)$/ str = "color = blue" matches = pat.match(str) puts matches[1]        # "color" puts matches[2]        # "blue"

3.14.3. Matching Roman Numerals

In the following example we match against a complex pattern to determine whether a string is a valid Roman number (up to decimal 3999). As before, the pattern is broken up into parts for readability:

rom1 = /m{0,3}/i rom2 = /(d?c{0,3}|c[dm])/i rom3 = /(l?x{0,3}|x[lc])/i rom4 = /(v?i{0,3}|i[vx])/i roman = /^#{rom1}#{rom2}#{rom3}#{rom4}$/ year1985 = "MCMLXXXV" if year1985 =~ roman      # Prints "yes"   puts "yes" else   puts "no" end

You might be tempted to put the i on the end of the whole expression and leave it off the smaller ones:

# This doesn't work! rom1 = /m{0,3}/ rom2 = /(d?c{0,3}|c[dm])/ rom3 = /(l?x{0,3}|x[lc])/ rom4 = /(v?i{0,3}|i[vx])/ roman = /^#{rom1}#{rom2}#{rom3}#{rom4}$/i

Why doesn't this work? Look at this for the answer:

rom1.to_s   # "(?-mix:m{0,3})"

Notice how the to_s captures the flags for each subexpression, and these then override the flag on the big expression.

3.14.4. Matching Numeric Constants

A simple decimal integer is the easiest number to match. It has an optional sign and consists thereafter of digits (except that Ruby allows an underscore as a digit separator). Note that the first digit should not be a zero; then it would be interpreted as an octal constant:

int_pat = /^[+-]?[1-9][\d_]*$/

Integer constants in other bases are similar. Note that the hex and binary patterns have been made case-insensitive because they contain at least one letter:

hex_pat = /^[+-]?0x[\da-f_]+$/i oct_pat = /^[+-]?0[0-7_]+$/ bin_pat = /^[+-]?0b[01_]+$/i

A normal floating point constant is a little tricky; the number sequences on each side of the decimal point are optional, but one or the other must be included:

float_pat = /^(\d[\d_]*)*\.[\d_]*$/

Finally, scientific notation builds on the ordinary floating-point pattern:

sci_pat = /^(\d[\d_]*)?\.[\d_]*(e[+-]?)?(_*\d[\d_]*)$/i

These patterns can be useful if, for instance, you have a string and you want to verify its validity as a number before trying to convert it.

3.14.5. Matching a Date/Time String

Suppose that we want to match a date/time in the form mm/dd/yy hh:mm:ss. This pattern is a good first attempt: datetime = /(\d\d)\/(\d\d)\/(\d\d) (\d\d): (\d\d):(\d\d)/.

However, that will also match invalid date/times and miss valid ones. The following example is pickier. Note how we build it up by interpolating smaller regexes into larger ones:

mo = /(0?[1-9]|1[0-2])/          # 01 to 09 or 1 to 9 or 10-12 dd = /([0-2]?[1-9]|[1-3][01])/   # 1-9 or 01-09 or 11-19 etc. yy = /(\d\d)/                    # 00-99 hh = /([01]?[1-9]|[12][0-4])/    # 1-9 or 00-09 or... mi = /([0-5]\d)/                 # 00-59, both digits required ss = /([0-6]\d)?/                # allows leap seconds ;-) date = /(#{mo}\/#{dd}\/#{yy})/ time = /(#{hh}:#{mi}:#{ss})/ datetime = /(#{date} #{time})/

Here's how we might call it using String#scan to return an array of matches:

str="Recorded on 11/18/07 20:31:00" str.scan(datetime) # [["11/18/07 20:31:00", "11/18/07", "11", "18", "00", #   "20:31:00", "20", "31", ":00"]]

Of course, this could all have been done as a large extended regex:

datetime = %r{(   (0?[1-9]|1[0-2])/        # mo: 01 to 09 or 1 to 9 or 10-12   ([0-2]?[1-9]|[1-3][01])/ # dd: 1-9 or 01-09 or 11-19 etc.   (\d\d) [ ]               # yy: 00-99   ([01]?[1-9]|[12][0-4]):  # hh: 1-9 or 00-09 or...   ([0-5]\d):               # mm: 00-59, both digits required   (([0-6]\d))?             # ss: allows leap seconds ;-) )}x

Note the use of the %r{} notation so that we don't have to escape the slashes.

3.14.6. Detecting Doubled Words in Text

In this section we implement the famous double-word detector. Typing the same word twice in succession is a common typing error. The following code detects instances of that occurrence:

double_re = /\b(['A-Z]+) +\1\b/i str="There's there's the the pattern." str.scan(double_re)  #  [["There's"],["the"]]

Note that the trailing i in the regex is for case-insensitive matching. There is an array for each grouping, hence the resulting array of arrays.

3.14.7. Matching All-Caps Words

This example is simple if we assume no numerics, underscores, and so on:

allcaps = /\b[A-Z]+\b/ string = "This is ALL CAPS" string[allcaps]                #  "ALL" Suppose you want to extract every word in all-caps: string.scan(allcaps)           #  ["ALL", "CAPS"]

If we wanted, we could extend this concept to include Ruby identifiers and similar items.

3.14.8. Matching Version Numbers

A common convention is to express a library or application version number by three dot-separated numbers. This regex matches that kind of string, with the package name and the individual numbers as submatches

package = "mylib-1.8.12" matches = package.match(/(.*)-(\d+)\.(\d+)\.(\d+)/) name, major, minor, tiny = matches[1..-1]

3.14.9. A Few Other Patterns

Let's end this list with a few more "odds and ends." As usual, most of these could be done in more than one way.

Suppose that we wanted to match a two-character USA state postal code. The simple way is just /[A-Z]{2}/, of course. But this matches names such as XY and ZZ that look legal but are meaningless. The following regex matches all the 51 usual codes (50 states and the District of Columbia):

state =  /^A[LKZR] | C[AOT] | D[EC] | FL | GA | HI | I[DLNA] |           K[SY] | LA | M[EDAINSOT] | N[EVHJMYCD] | O[HKR] |           PA | RI | S[CD] | T[NX] | UT | V[TA] | W[AVIY]$/x

For clarity, I've made this an extended regex (by using the x modifier). The spaces and newlines are ignored.

In a similar vein, here is a regex to match a U.S. ZIP Code (which may be five or nine digits):

zip = /^\d{5}(-\d{4})?$/

The anchors (in this regex and others) are only to ensure that there are no extraneous characters before or after the matched string. Note that this regex will not catch all invalid codes. In that sense, it is less useful than the preceding one.

The following regex matches a phone number in the NANP format (North American Numbering Plan). It allows three common ways of writing such a phone number:

phone = /^((\(\d{3}\) |\d{3}-)\d{3}-\d{4}|\d{3}\.\d{3}\.\d{4})$/ "(512) 555-1234" =~ phone    # true "512.555.1234"   =~ phone    # true "512-555-1234"   =~ phone    # true "(512)-555-1234" =~ phone    # false "512-555.1234"   =~ phone    # false

Matching a dollar amount with optional cents is also trivial:

dollar = /^\$\d+(\.\d\d)?$/

This one obviously requires at least one digit to the left of the decimal and disallows spaces after the dollar sign. Also note that if you only wanted to detect a dollar amount rather than validate it, the anchors would be removed and the optional cents would be unnecessary.