7.2 Building Blocks | Perl 6 and Parrot Essentials, Second Edition

Every language has a set of basic components (words or parts of words) and a set of syntax rules for combining them. The "words" in rules are literal characters (or symbols), some metacharacters (or metasymbols), and escape sequences, while the combining syntax includes other metacharacters, quantifiers, bracketing characters , and assertions.

7.2.1 Metacharacters

The "word"-like metacharacters are ., ^ , ^^ , $ , and $$ . The . matches any single character, even a newline character. Actually, what it matches by default is a Unicode grapheme, but you can change that behavior with a pragma in your code, or a modifier on the rule. (We'll discuss modifiers in Section 7.3 later in this chapter.) The ^ and $ metacharacters are zero-width matches on the beginning and end of a string. They each have doubled alternates ^^ and $$ that match at the beginning and end of every line within a string.

The , & , \ , # , and := metacharacters are all syntax structure elements. The is an alternation between two options. The & matches two patterns simultaneously (the patterns must be the same length). The \ turns literal characters into metacharacters (the escape sequences) or turns metacharacters into literal characters. The # marks a comment to the end of the line. Whitespace insensitivity (the old /x modifier) is on by default, so you can start a comment at any point on any line in a rule. Just make sure you don't comment out the symbol that terminates the rule. The := binds a hypothetical variable to the result of a subrule or grouped pattern. Hypotheticals are covered in Section 7.6 later in this chapter.

The metacharacters ( ) , [ ] , { } , and <> are bracketing pairs. The pairs always have to be balanced within the rule, unless they are literal characters (escaped with a \ ). The brackets ( ) and [ ] group patterns to match as a single atom. They're often used to capture a result, mark the boundaries of an alternation, or mark a group of patterns with a quantifier, among other things. Parentheses ( ( ) ) are capturing, and square brackets ( [ ] ) are noncapturing. The { } brackets define a section of Perl code (a closure) within a rule. These closures are always a successful zero-width match, unless the code explicitly calls the fail function. The < . . . > brackets mark assertions, which handle a variety of constructs including character classes and user -defined quantifiers. Assertions are covered in Section 7.2.4 later in this chapter.

Table 7-2 summarizes the basic set of metacharacters.

Table 7-2. Metacharacters

Symbol	Meaning
.	Match any single character, including a newline.
`^`	Match the beginning of a string.
`$`	Match the end of a string.
`^^`	Match the beginning of a line.
`$$`	Match the end of a line.
	Match alternate patterns (OR).
`&`	Match multiple patterns (AND).
`\`	Escape a metacharacter to get a literal character, or escape a literal character to get a metacharacter.
`#`	Mark a comment (to the end of the line).
`:=`	Bind the result of a match to a hypothetical variable.
`( . . . )`	Group patterns and capture the result.
`[ . . . ]`	Group patterns without capturing.
`{ . . . }`	Execute a closure (Perl 6 code) within a rule.
`< . . . >`	Match an assertion.

7.2.2 Escape Sequences

The escape sequences are literal characters acting as metacharacters, marked with the \ escape. Some escape sequences represent single characters that are difficult to represent literally, like \t for tab, or \x[ . . . ] for a character specified by a hexadecimal number. Some represent limited character classes, like \d for digits or \w for word characters. Some represent zero-width positions in a match, like \b for a word boundary. With all the escape sequences that use brackets, ( ) , { } , and <> work in place of [ ] .

Note that since an ordinary variable now interpolates as a literal string by default, the \Q escape sequence is rarely needed.

Table 7-3 shows the escape sequences for rules.

Table 7-3. Escape sequences

Escape	Meaning
`\0[ . . . ]`	Match a character given in octal (brackets optional).
`\b`	Match a word boundary.
`\B`	Match when not on a word boundary.
`\c[ . . . ]`	Match a named character or control character.
`\C[ . . . ]`	Match any character except the bracketed named or control character.
`\d`	Match a digit.
`\D`	Match a nondigit.
`\e`	Match an escape character.
`\E`	Match anything but an escape character.
`\f`	Match the form feed character.
`\F`	Match anything but a form feed.
`\n`	Match a (logical) newline.
`\N`	Match anything but a (logical) newline.
`\h`	Match horizontal whitespace.
`\H`	Match anything but horizontal whitespace.
`\L[ . . . ]`	Everything within the brackets is lowercase.
`\Q[ . . . ]`	All metacharacters within the brackets match as literal characters.
`\r`	Match a return.
`\R`	Match anything but a return.
`\s`	Match any whitespace character.
`\S`	Match anything but whitespace.
`\t`	Match a tab.
`\T`	Match anything but a tab.
`\U[ . . . ]`	Everything within the brackets is uppercase.
`\v`	Match vertical whitespace.
`\V`	Match anything but vertical whitespace.
`\w`	Match a word character (Unicode alphanumeric plus "_").
`\W`	Match anything but a word character.
`\x[ . . . ]`	Match a character given in hexadecimal (brackets optional).
`\X[ . . . ]`	Match anything but the character given in hexadecimal (brackets optional).

7.2.3 Quantifiers

Quantifiers specify the number of times an atom (a single character, metacharacter, escape sequence, grouped pattern, assertion, etc.) will match.

The numeric quantifiers use assertion syntax. A single number ( <3> ) requires exactly that many matches. A numeric range quantifier ( <3..5> ) succeeds if the number of matches is between the minimum and maximum numbers . A range with three trailing dots ( <2 . . . > ) is shorthand for < n ..Inf> and matches as many times as possible.

Each quantifier has a minimal alternate form, marked with a trailing ? , that matches the shortest possible sequence first.

Table 7-4 shows the built-in quantifiers.

Table 7-4. Quantifiers

Maximal	Minimal	Meaning
`*`	`*?`	Match 0 or more times.
`+`	`+?`	Match 1 or more times.
`?`	`??`	Match 0 or 1 times.
`<` n `>`	`<` n `>?`	Match exactly n times.
`<` n `.` .m `>`	`<` n `.` .m `>?`	Match at least n and no more than m times.
`<` n `. . . >`	`<` n `. . . >?`	Match at least n times.

7.2.4 Assertions

Generally , an assertion simply states that some condition or state is true and the match fails when that assertion is false. Many different constructs with many different purposes use assertion syntax.

Assertions match named and anonymous rules, arrays, or hashes containing anonymous rules, and subroutines or closures that return anonymous rules. You have to enclose a variable in assertion delimiters to get it to interpolate as an anonymous rule or rules. A bare scalar in a pattern interpolates as a literal string, while a scalar variable in assertion brackets interpolates as an anonymous rule. A bare array in a pattern matches as a series of alternate literal strings, while an array in assertion brackets interpolates as a series of alternate anonymous rules. In the simplest case, a bare hash in a pattern matches a word ( \w+ ) and tries to find that word as one of its keys, ^[2] while a hash in assertion brackets does the same, but then also matches the associated value as an anonymous rule.

^[2] The effect is much as if it matched the keys as a series of alternates, but you're guaranteed to match the longest possible key, instead of just the first one it hits in random order.

A bare closure in a pattern always matches (unless it calls fail ), but a closure in assertion brackets <{ . . . }> must return an anonymous rule, which is immediately matched.

An assertion with parentheses <( . . . )> is similar to a bare closure in a pattern in that it allows you to include straight Perl code within a rule. The difference is that <( . . . )> evaluates the return value of the closure in Boolean context. The match succeeds if the return value is true and fails if the return value is false.

Assertions match character classes, both named and enumerated. A named rule character class is often more accurate than an enumerated character class. For example, <[a-zA-Z]> is commonly used to match alphabetic characters, but generally, what's really needed is the built-in rule <alpha> , which matches the full set of Unicode alphabetic characters.

Table 7-5 shows the syntax for assertions.

Table 7-5. Assertions

Syntax	Meaning
`< . . . >`	Generic assertion delimiter .
`<! . . . >`	Negate any assertion.
`<` `name` `>`	Match a named rule or character class.
`<[ . . . ]>`	Match an enumerated character class.
`<- . . . >`	Complement a character class (named or enumerated).
`<" . . . ">`	Match a literal string (interpolated at match time).
`<' . . . '>`	Match a literal string (not interpolated).
`<( . . . )>`	Boolean assertion. Execute a closure and match if it returns a true result.
`<$scalar>`	Match an anonymous rule.
`<@array>`	Match a series of anonymous rules as alternates.
`<%hash>`	Match a key from the hash, then its value (which is an anonymous rule).
`<&sub( )>`	Match an anonymous rule returned by a sub.
`<{code}>`	Match an anonymous rule returned by a closure.
`<.>`	Match any logical grapheme, including combining character sequences.