3.8. Using Character ClassesCharacter classes are simply a form of alternation (specification of alternative possibilities) where each submatch is a single character. In the simplest case, we list a set of characters inside square brackets: /[aeiou]/ # Match any single letter a, e, i, o, u; equivalent # to /(a|e|i|o|u)/ except for group-capture Inside a character class, escape sequences such as \n are still meaningful, but metacharacters such as . and ? do not have any special meanings: /[.\n?]/ # Match any of: period, newline, question mark The caret (^) has special meaning inside a character class if used at the beginning; it negates the list of characters (or refers to their complement): [^aeiou] # Any character EXCEPT a, e, i, o, u The hyphen, used within a character class, indicates a range of characters (a lexicographic range, that is): /[a-mA-M]/ # Any letter in the first half of the alphabet /[^a-mA-M]/ # Any OTHER letter, or number, or non-alphanumeric # character When a hyphen is used at the beginning or end of a character class, or a caret is used in the middle of a character class, these characters lose their special meaning and only represent themselves literally. The same is true of a left bracket, but a right bracket must obviously be escaped: /[-^[\]]/ # Match a hyphen, caret, or right bracket Ruby regular expressions may contain references to named character classes, which are basically named patterns (of the form [[:name:]]). For example, [[:digit:]] means the same as [0-9] in a pattern. In many cases, this turns out to be shorthand or is at least more readable. Some others are [[:print:]] (printable characters) and [[:alpha:]] (alphabetic characters): s1 = "abc\007def" /[[:print:]]*/.match(s1) m1 = Regexp::last_match[0] # "abc" s2 = "1234def" /[[:digit:]]*/.match(s2) m2 = Regexp::last_match[0] # "1234" /[[:digit:]]+[[:alpha:]]/.match(s2) m3 = Regexp::last_match[0] # "1234d" A caret before the character class name negates the class: /[[:^alpha:]]/ # Any non-alpha character There are also shorthand notations for many classes. The most common ones are \d (to match a digit), \w (to match any "word" character), and \s (to match any whitespace character such as a space, tab, or newline): str1 = "Wolf 359" /\w+/.match(str1) # matches "Wolf" (same as /[a-zA-Z_0-9]+/) /\w+ \d+/.match(str1) # matches "Wolf 359" /\w+ \w+/.match(str1) # matches "Wolf 359" /\s+/.match(str1) # matches " " The "negated" forms are typically capitalized: /\W/ # Any non-word character /\D/ # Any non-digit character /\S/ # Any non-whitespace character For additional information specific to Oniguruma, refer to section 3.13, "Ruby and Oniguruma." |