Section 21.3. Constraining facets


Prev	don't be afraid of buying books	Next

21.3. Constraining facets

This section describes the twelve facet elements that are used when deriving by restriction. They fall into six categories: range, length, decimal digit, enumeration, white space, and pattern.

21.3.1 Range restrictions

The simplest kind of restriction available for numeric and date datatypes is a range restriction. For instance we might want to define a user-derived datatype called "publication year" as a year that must fall between 1000 and 2100. Example 21-5 defines a pubYear datatype with these characteristics.

Example 21-5. Defining a restricted range integer datatype

 <xsd:simpleType name="pubYear">   <xsd:annotation>     <xsd:documentation>A publication year</xsd:documentation>   </xsd:annotation>   <xsd:restriction base="xsd:gYear">       <xsd:minInclusive value="1000"/>       <xsd:maxInclusive value="2100"/>   </xsd:restriction> </xsd:simpleType>

The minInclusive element sets the minimum value allowed for the new pubYear datatype and the maxInclusive element sets the maximum value allowed.

Alternatively, we could use the maxExclusive and minExclusive elements. They also set upper and lower bounds, but the bounds exclude the named value. In other words, a maxExclusive value of 2100 allows 2099 but not 2100.

21.3.2 Length restrictions

There are three length constraining facets: minLength and maxLength, which work together or separately to set lower and upper bounds on a value's length, and a facet called simply length which requires a specific fixed length.

Length means slightly different things depending on the datatype that it is restricting.

list

If the datatype is a list (e.g. IDREFS) then the length facets constrain the number of items in the list.

string

If the datatype derives from string (directly or through multiple levels of derivation) then the length facets constrain the number of characters in the string.

binary

Applied to a binary datatype, length facets constrain the number of bytes of decoded binary data.

21.3.3 Decimal digit restrictions

There are two facets that only apply to decimal numbers and types derived from them. These are totalDigits and fractionDigits.

The first constrains the maximum number of digits in the decimal representation of the number and the second constrains the maximum number of digits in the fractional part (after the .). For instance it would make sense when dealing with dollars to constrain the fractionDigits to two.

21.3.4 Enumeration restrictions

You can also define a datatype as a list of allowable values by using several enumeration elements. In Example 21-6, we define a dayOfWeek datatype.

Example 21-6. Defining an enumerated datatype

 <xsd:simpleType name="workday">   <xsd:restriction base="xsd:string">     <xsd:enumeration value="Sunday"/>     <xsd:enumeration value="Monday"/>     <xsd:enumeration value="Tuesday"/>     <xsd:enumeration value="Wednesday">       <xsd:annotation>         <xsd:documentation>Halfway there!</xsd:documentation>       </xsd:annotation>     </xsd:enumeration>     <xsd:enumeration value="Thursday"/>     <xsd:enumeration value="Friday">       <xsd:annotation>         <xsd:documentation>Almost done!</xsd:documentation>       </xsd:annotation>     </xsd:enumeration>     <xsd:enumeration value="Saturday"/>   </xsd:restriction> </xsd:simpleType>

Note that an enumeration element, like all facet-constraining elements, may have an annotation sub-element.

Enumeration elements work together to restrict the value to one of the enumeration values. Enumeration elements are so strict that they supersede any of the base datatype's constraints. Where other constraints say things like "the value must be higher than X or look like Y", enumerations say: "the value must be one of these values and not anything else".

The enumeration values must be legal values of the base datatype. Therefore, if an enumeration datatype derives from another enumeration datatype, the derived datatype may only have values that the base datatype has.

21.3.5 The `whiteSpace` facet

The whiteSpace facet is a little bit different from the others. Rather than constraining the value of a datatype, it constrains the processing. The facet can have one of three values:

preserve

The datatype processor will leave the whitespace alone.

replace

Whitespace of any kind is changed into space characters.

collapse

Sequences of whitespace are collapsed to a single space and leading and trailing whitespace is discarded.

The built-in datatypes all use the value collapse, except for string and types derived from it. Those types include normalizedString, which in turn is a base type for token. As we discussed in 21.1.2.2.2, "Other XML constructs", on page 451, all of the datatypes that represent names of things (language, Name, NMTOKEN, etc.)derive from tokens.

21.3.6 The pattern facet

The most sophisticated and powerful facet is the pattern facet. Patterns have a value attribute that is a regular expression.

The regular expressions that constrain datatypes are based on those used in various programming languages. They have especially close ties to the Perl programming language which integrates regular expressions into its core syntax.

Note

The full regular expression language is quite complicated because of deep support for Unicode. There are also shortcuts to reduce the size of regular expressions. We do not cover all of these. Instead we concentrate on those regular expression features that are used most of the time.

Example 21-7 illustrates some interesting patterns. We'll explain what goes into them in the following sections.

Example 21-7. Pattern examples

 <xsd:simpleType name="even-number">     <xsd:restriction base="xsd:decimal">         <xsd:pattern value="\d*[02468]"/>     </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="old-fashioned-domain-name">     <xsd:restriction base="xsd:string">         <xsd:pattern value="\w+\.(com|net|org|gov)"/>     </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="phone-number">     <xsd:restriction base="xsd:string">         <xsd:pattern value="(\d{3}-)?\d{3}-\d{4}"/>     </xsd:restriction> </xsd:simpleType>

21.3.6.1 Constructing regular expressions

The simplest regular expression is just a character. a is a regular expression that matches the character "a". There is only one string in the world that matches this string.

If we put other letters beside the "a" then the regular expression will match them in order. abcd matches "a" and then "b" and then "c" and then "d". Once again there is only one string that matches that expression.

21.3.6.1.1 Quantifiers

A slightly more sophisticated regular expression will match one or more occurrences of the last letter "d": abcd+. The plus symbol means "one or more of this thing", where "this thing" is whatever comes just before the plus symbol–typically a character. There are also ? and * symbols available. They stand for "zero or one of the thing" and "zero or more of the thing". We call these quantifiers.

There is another quantifier that uses curly braces ({}) with one or two numbers between them. The quantifier {3} means that there should be three occurrences of the thing.

So ba{3}d matches "baaad". It is also possible to express a range: ba{3,7}d matches "baaad", "baaaad", "baaaaad", "baaaaaad" and "baaaaaaad". The lower bound can go as low as zero, in which case the item is optional (just as if you had used a question mark).

The upper bound can be omitted like this: ba{3,}d. That means that there is no upper bound. That expression is equivalent to baaaa*d which is itself equivalent to baaa+d. All three expressions match three or more occurrences of the middle letter "a".

We can put these ideas together. For instance ab{2,5}c+d{4} matches one "a" followed by two to five "b"s followed by one or more "c"s followed by exactly 4 "d"s. Example 21-8 demonstrates.

Example 21-8. Quantifiers in a regular expression

 <xsd:simpleType name="reg-exp-example">     <xsd:restriction base="xsd:string">         <xsd:pattern value="ab{2,5}c+d{4}"/>     </xsd:restriction> </xsd:simpleType>

21.3.6.1.2 Alternatives and grouping

Regular expressions use the | symbol to represent alternatives.

Consider the regular expression yes+|no+. It matches "yes", "yess", "yesss", "no", "noo", "nooo" and so forth.

If we want to repeat the whole word yes, we can use parentheses. The regular expression (yes)+|(no)+ matches the strings "yes", "yesyes", "yesyesyes", "no", "nono", "nonono".

We can even use parentheses to group the whole expression: (yes|no)+. This expression allows us to match multiple occurrences of "yes" and "no". For example: "yes", "no", "yesyes", "yesno", "noyes", "nono", "yesyesyes", "yesnoyes" and so forth.

21.3.6.1.3 Special characters

There are ways to refer to function characters:

\n represents the newline (or line-feed) character.
\r represents the return character.
\t represents the tab character.

So a\nb represents the letter "a" followed by a newline followed by "b".

There are also ways to refer to the characters that would normally be interpreted as symbols. For instance the + character might be necessary in a regular expression involving mathematics.

Convert a symbol into an ordinary character by preceding it with the symbol \. So \? is translated into the ordinary character (not special symbol) ?.
Convert a backslash symbol to a character like this: \\. Convert two of them like this: \\\\.
The symbol characters you need to handle in this way are: \ |. -^?*+{ } ( ) [ ]

21.3.6.2 Character classes

It is often useful to be able to refer to lists of characters without explicitly listing all of the characters. For instance it would be annoying to need to enter (1|2|3|4|5|6|7|8|9|0) whenever you want to allow a digit.

Plus there are many Unicode characters that are considered digits that are not in this set. Examples include TIBETAN DIGIT ZERO, GUJARATI DIGIT TWO and DINGBAT NEGATIVE CIRCLED DIGIT NINE. It would not be internationally correct to ignore those!

These sets of reusable characters are known as character classes. They are represented by a backslash character followed by a letter.^[6]

^[6] Note that the backslash does two different jobs. It turns symbol (punctuation) characters into ordinary characters and ordinary letters into character classes.

21.3.6.2.1 Built-in character classes

This particular character class, digits, is represented by \d. So \d+ means one or more digits, while \d+a\d{2,5} means one or more digits, and then the letter "a", and then two to five more digits. The opposite of this character class is indicated by an upper-case \D, meaning anything that is not a digit.

Another major character class is indicated simply by a period (.). It matches any character except a newline or linefeed character. So d.g matches "dig" and "dog" but also "d~g" and "d%g". The middle character could even be Kanji or Cherokee (both components of Unicode!).

The character class \s represents any whitespace character (space, tab, newline, carriage return). \S is its opposite. It represents any non-whitespace character.

\i represents the set of "initial name" characters–basically letters, the underscore and the colon.^[7] \I is its opposite: anything that is not an initial name character. \c represents the set of all name characters. \C is its opposite.^[8]

^[7] The class was renamed from the proper XML terminology of name start characters so that \n could have its common regular expression meaning of "newline".

^[8] See 15.1.4, "Names", on page 353 for a refresher on both of these character classes.

\w represents what you might call "word" characters: basically letters, digits, and some symbols (e.g. currency, math). That is to say, characters that are not punctuation, separators or the like. \w+ will roughly match a word; \W represents the opposite – the characters that are not considered word characters.

21.3.6.2.2 Constructing a character class

You can also create character classes in your regular expressions (though you cannot give them fancy backslash-prefixed names!). The syntax to do this is called a character class expression and it uses square brackets ([]).

For instance to represent the first four characters in the alphabet you would say [abcd]. This simple example could just as easily be represented as (a|b|c|d) but the character class notation allows a couple of tricks that are difficult to emulate with the | symbol.

When you construct a character class expression you can specify ranges. For instance [a-z] represents the characters from "a" through "z" in Unicode (which are the same as in the English alphabet). You can even put multiple ranges together: [a-zA-Z0-9].

Inside square brackets, the characters *, +, (,), {,} and ? are just characters. They have no special meaning. Backslash (\) remains special because we use that to refer to the built-in character classes (\d for digits and so forth). Here is a character class that matches all of them and also matches Unicode digit and word characters: [*+(){}\d\w].

It is also possible to construct a "negative" character class which includes all of the characters that you do not list. You do this by starting your character class expression with a caret (^) symbol. So to match every character except "a" and "z", you would use the regular expression: [^az]. If you wanted to match everything except the characters from "a" to "z", you could do that like this: [^a-z].


	Amazon