3.8. Using Character Classes
Character classes are simply a form of alternation (specification of alternative possibilities) where each submatch is a single character. In the simplest case, we list a set of characters inside square brackets:
/[aeiou]/ # Match any single letter a, e, i, o, u; equivalent # to /(a|e|i|o|u)/ except for group-capture
Inside a character class, escape sequences such as \n are still meaningful, but metacharacters such as . and ? do not have any special meanings:
/[.\n?]/ # Match any of: period, newline, question mark
The caret (^) has special meaning inside a character class if used at the beginning; it negates the list of characters (or refers to their complement):
[^aeiou] # Any character EXCEPT a, e, i, o, u
The hyphen, used within a character class, indicates a range of characters (a lexicographic range, that is):
/[a-mA-M]/ # Any letter in the first half of the alphabet /[^a-mA-M]/ # Any OTHER letter, or number, or non-alphanumeric # character
When a hyphen is used at the beginning or end of a character class, or a caret is used in the middle of a character class, these characters lose their special meaning and only represent themselves literally. The same is true of a left bracket, but a right bracket must obviously be escaped:
/[-^[\]]/ # Match a hyphen, caret, or right bracket
Ruby regular expressions may contain references to named character classes, which are basically named patterns (of the form [[:name:]]). For example, [[:digit:]] means the same as [0-9] in a pattern. In many cases, this turns out to be shorthand or is at least more readable.
Some others are [[:print:]] (printable characters) and [[:alpha:]] (alphabetic characters):
s1 = "abc\007def" /[[:print:]]*/.match(s1) m1 = Regexp::last_match # "abc" s2 = "1234def" /[[:digit:]]*/.match(s2) m2 = Regexp::last_match # "1234" /[[:digit:]]+[[:alpha:]]/.match(s2) m3 = Regexp::last_match # "1234d"
A caret before the character class name negates the class:
/[[:^alpha:]]/ # Any non-alpha character
There are also shorthand notations for many classes. The most common ones are \d (to match a digit), \w (to match any "word" character), and \s (to match any whitespace character such as a space, tab, or newline):
str1 = "Wolf 359" /\w+/.match(str1) # matches "Wolf" (same as /[a-zA-Z_0-9]+/) /\w+ \d+/.match(str1) # matches "Wolf 359" /\w+ \w+/.match(str1) # matches "Wolf 359" /\s+/.match(str1) # matches " "
The "negated" forms are typically capitalized:
/\W/ # Any non-word character /\D/ # Any non-digit character /\S/ # Any non-whitespace character
For additional information specific to Oniguruma, refer to section 3.13, "Ruby and Oniguruma."