Section 3.9. Extended Regular Expressions


3.8. Using Character Classes

Character classes are simply a form of alternation (specification of alternative possibilities) where each submatch is a single character. In the simplest case, we list a set of characters inside square brackets:

/[aeiou]/    # Match any single letter a, e, i, o, u; equivalent              # to /(a|e|i|o|u)/ except for group-capture


Inside a character class, escape sequences such as \n are still meaningful, but metacharacters such as . and ? do not have any special meanings:

/[.\n?]/     # Match any of: period, newline, question mark


The caret (^) has special meaning inside a character class if used at the beginning; it negates the list of characters (or refers to their complement):

[^aeiou]     # Any character EXCEPT a, e, i, o, u


The hyphen, used within a character class, indicates a range of characters (a lexicographic range, that is):

/[a-mA-M]/   # Any letter in the first half of the alphabet /[^a-mA-M]/  # Any OTHER letter, or number, or non-alphanumeric              # character


When a hyphen is used at the beginning or end of a character class, or a caret is used in the middle of a character class, these characters lose their special meaning and only represent themselves literally. The same is true of a left bracket, but a right bracket must obviously be escaped:

/[-^[\]]/       # Match a hyphen, caret, or right bracket


Ruby regular expressions may contain references to named character classes, which are basically named patterns (of the form [[:name:]]). For example, [[:digit:]] means the same as [0-9] in a pattern. In many cases, this turns out to be shorthand or is at least more readable.

Some others are [[:print:]] (printable characters) and [[:alpha:]] (alphabetic characters):

s1 = "abc\007def" /[[:print:]]*/.match(s1) m1 = Regexp::last_match[0]               # "abc" s2 = "1234def" /[[:digit:]]*/.match(s2) m2 = Regexp::last_match[0]               # "1234" /[[:digit:]]+[[:alpha:]]/.match(s2) m3 = Regexp::last_match[0]               # "1234d"


A caret before the character class name negates the class:

/[[:^alpha:]]/   # Any non-alpha character


There are also shorthand notations for many classes. The most common ones are \d (to match a digit), \w (to match any "word" character), and \s (to match any whitespace character such as a space, tab, or newline):

str1 = "Wolf 359" /\w+/.match(str1)      # matches "Wolf" (same as /[a-zA-Z_0-9]+/) /\w+ \d+/.match(str1)  # matches "Wolf 359" /\w+ \w+/.match(str1)  # matches "Wolf 359" /\s+/.match(str1)      # matches " "


The "negated" forms are typically capitalized:

/\W/                   # Any non-word character /\D/                   # Any non-digit character /\S/                   # Any non-whitespace character


For additional information specific to Oniguruma, refer to section 3.13, "Ruby and Oniguruma."




The Ruby Way(c) Solutions and Techniques in Ruby Programming
The Ruby Way, Second Edition: Solutions and Techniques in Ruby Programming (2nd Edition)
ISBN: 0672328844
EAN: 2147483647
Year: 2004
Pages: 269
Authors: Hal Fulton

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net