Hack 72 Validate an XML Document with RELAX NG

   

figs/moderate.gif figs/hack72.gif

Compared to the alternatives, RELAX NG schemas are easy to use and learn, and the more you use them the more you become convinced.

RELAX NG (http://www.relaxng.org) is a powerful schema language with a simple syntax. Originally, RELAX NG was developed in a small OASIS technical committee led by James Clark. It is based on ideas from Clark's TREX (http://www.thaiopensource.com/trex/) and Murata Makoto's Relax (http://www.xml.gr.jp/relax/), and its first committee spec was published on December 3, 2001 (http://www.oasis-open.org/committees/relax-ng/spec.html). A tutorial is also available (http://www.oasis-open.org/committees/relax-ng/tutorial.html). Recently, RELAX NG became an international standard under ISO as ISO/IEC 19757-2:2004, Information technology Document Schema Definition Language (DSDL) Part 2: Regular-grammar-based validation RELAX NG (see http://www.y12.doe.gov/sgml/sc34/document/0458.htm).

RELAX NG schemas may be written in either XML or a compact syntax. This hack demonstrates both.

5.6.1 XML Syntax

Recall the document time.xml:

<?xml version="1.0" encoding="UTF-8"?>     <!-- a time instant --> <time timezone="PST">  <hour>11</hour>  <minute>59</minute>  <second>59</second>  <meridiem>p.m.</meridiem>  <atomic signal="true"/> </time>

Here is a RELAX NG schema for time.xml called time.rng:

<element name="time" xmlns="http://relaxng.org/ns/structure/1.0">  <attribute name="timezone"/>  <element name="hour"><text/></element>  <element name="minute"><text/></element>  <element name="second"><text/></element>  <element name="meridiem"><text/></element>  <element name="atomic">   <attribute name="signal"/>  </element> </element>

At a glance, you can immediately tell how simple the syntax is. Each element is defined with an element element, and each attribute with an attribute element. The namespace URI for RELAX NG is http://relaxng.org/ns/structure/1.0. The document element in this schema happens to be element, but any element in RELAX NG that defines a pattern may be used as a document element (grammar may also be used, even though it doesn't define a pattern). Each of the elements and attributes defined in this schema has text content, as indicated by the text element for elements and by default for attributes; for example, <attribute name="signal"/> and <attribute name="signal"><text/></attribute> are equivalent.

5.6.1.1 xmllint

You can validate documents with RELAX NG using xmllint [Hack #9]). To validate time.xml against time.rng, type this command in a shell:

xmllint --relaxng time.rng time.xml

The response upon success will be:

<?xml version="1.0" encoding="UTF-8"?> <!-- a time instant --> <time timezone="PST">  <hour>11</hour>  <minute>59</minute>  <second>59</second>  <meridiem>p.m.</meridiem>  <atomic signal="true"/> </time> time.xml validates

xmllint mirrors the well-formed document on standard output, plus on the last line it reports that the document validates (emphasis added). You can submit one or more XML instances at the end of the command line for validation.

5.6.1.2 Jing

You can also validate documents with RELAX NG using James Clark's Jing (http://www.thaiopensource.com/relaxng/jing.html). You can download the latest version from http://www.thaiopensource.com/download/. To validate time.xml against time.rng, use this command:

java -jar jing.jar time.rng time.xml

When Jing is silent after this command, it means that time.xml is valid with regard to time.rng. Jing, by the way, can accept one or more instance documents on the command line.

Jing also has a Windows 32 version, jing.exe, downloadable from the same location (http://www.thaiopensource.com/download/). In my tests, jing.exe runs faster than jing.jar, as you might expect.

At a Windows command prompt, run jing.exe like this:

jing time.rng time.xml

5.6.1.3 A more complex RELAX NG schema

Example 5-8 is a more complex, yet more precise, version of time.rng called precise.rng, which refines what is permitted in an instance.

Example 5-8. precise.rng
<grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <start>  <ref name="Time"/> </start> <define name="Time"> <element name="time">  <attribute name="timezone">   <ref name="Timezones"/>  </attribute>  <element name="hour">   <ref name="Hours"/>  </element>  <element name="minute">   <ref name="MinutesSeconds"/>  </element>  <element name="second">   <ref name="MinutesSeconds"/>  </element>  <element name="meridiem">   <choice>    <value>a.m.</value>    <value>p.m.</value>   </choice>  </element>  <element name="atomic">   <attribute name="signal">   <choice>    <value>true</value>    <value>false</value>   </choice>   </attribute>  </element> </element> </define> <define name="Timezones">  <!-- http://www.timeanddate.com/library/abbreviations/timezones/ -->   <choice>    <value>GMT</value>    <value>UTC</value>    <value>ACDT</value>    <value>ACST</value>    <value>ADT</value>    <value>AEDT</value>    <value>AEST</value>    <value>AKDT</value>    <value>AKST</value>    <value>AST</value>    <value>AWST</value>    <value>BST</value>    <value>CDT</value>    <value>CEST</value>    <value>CET</value>    <value>CST</value>    <value>CXT</value>    <value>EDT</value>    <value>EEST</value>    <value>EET</value>    <value>EST</value>    <value>HAA</value>    <value>HAC</value>    <value>HADT</value>    <value>HAE</value>    <value>HAP</value>    <value>HAR</value>    <value>HAST</value>    <value>HAT</value>    <value>HAY</value>    <value>HNA</value>    <value>HNC</value>    <value>HNE</value>    <value>HNP</value>    <value>HNR</value>    <value>HNT</value>    <value>HNY</value>    <value>IST</value>    <value>MDT</value>    <value>MESZ</value>    <value>MEZ</value>    <value>MST</value>    <value>NDT</value>    <value>NFT</value>    <value>NST</value>    <value>PDT</value>    <value>PST</value>    <value>WEST</value>    <value>WET</value>    <value>WST</value>   </choice> </define> <define name="Hours">  <data type="string"><param name="pattern">[0-1][0-9]|2[0-3]</param></data> </define> <define name="MinutesSeconds">  <data type="integer">       <param name="minInclusive">0</param>       <param name="maxInclusive">59</param> </data> </define> </grammar>

This schema uses the grammar document element (line 1). RELAX NG supports the XML Schema datatype library, and so it is declared on line 2. The start element (line 4) indicates where the instances will start; i.e., what the document element of the instance will be. The ref element refers to a named definition (define), which starts on line 8. There are no name conflicts between named definitions and other named structures such as element and attribute. This means that you could have a definition named time and an element named time with no conflicts. (I use Time as the name of the definition just as a personal convention.)

The possible values for the timezone attribute (line 10) are defined in the Timezones definition (line 39). The choice element (line 41) indicates the content of one of the 50 enumerated value elements that may be used as a value for timezone. This technique is also used for the content of the meridiem element (line 22) and the signal attribute (line 29).

The definitions for the content of the hour, minute, and second elements each refer to a definition. The hour element refers to the Hours definition (line 95). The data element points to the XML Schema type string (line 96). This string is constrained by the param element whose name is pattern (answerable to the XML Schema facet pattern). The regular expression [0-1][0-9]|2[0-3] indicates that the content of these elements must be two consecutive digits, the first in the range 00 through 19 ([0-1][0-9]) and the second in the range 20 through 23 (2[0-3]). The elements minute and second both refer to the definition MinutesSeconds (line 99). Rather than use a regular expression, this definition takes a different approach: it uses a minInclusive parameter of 0 (line 101) and a maxInclusive of 59 (line 102).

Test precise.rng by validating time.xml against it with xmllint:

xmllint --relaxng precise.rng time.xml

Or with Jing:

java -jar jing.jar -c precise.rng time.xml

Or with jing.exe:

jing -c precise.rng time.xml

5.6.2 Compact Syntax

RELAX NG's non-XML compact syntax is a pleasure to use (http://www.oasis-open.org/committees/relax-ng/compact-20021121.html). A tutorial on the compact syntax is available (http://relaxng.org/compact-tutorial-20030326.html). Its syntax is similar to XQuery's computed constructor syntax (http://www.w3.org/TR/xquery/#id-computedConstructors). Following is a compact version of time.rng called time.rnc (the .rnc file suffix is conventional, representing the use of compact syntax):

element time {   attribute timezone { text },   element hour { text },   element minute { text },   element second { text },   element meridiem { text },   element atomic {     attribute signal { text }   } }

The RELAX NG namespace is assumed though not declared explicitly. The element, attribute, and text keywords define elements, attributes, and text content, respectively. Sets of braces ({ }) hold content models.

5.6.2.1 Jing with compact syntax

You cannot validate a document with xmllint when using compact syntax. You can validate a document using Jing and the -c switch. The command looks like:

java -jar jing.jar -c time.rnc time.xml

Or with jing.exe it looks like:

jing -c time.rnc time.xml

Silence is golden with Jing. In other words, if Jing reports nothing, the document is valid.

5.6.2.2 RNV

David Tolpin has developed a validator for RELAX NG's compact syntax; it is called RNV and is written in C (http://davidashen.net/rnv.html). It is fast and is a nice piece of work. Source is available, and you can recompile it on your platform using the make file provided or by writing your own. A Windows 32 executable version is also available. Download the latest version of either from http://ftp.davidashen.net/PreTI/RNV/.

A copy of the Windows 32 executable rnv.exe (Version 1.6.1) is available in the file archive. Validate time.xml against time.rnc using this command:

rnv -p time.rnc time.xml

The -p option writes the file to standard output, as shown here. Without it, only the name of the validated file is displayed (see emphasis) when successful.

time.xml <?xml version="1.0" encoding="UTF-8"?>     <!-- a time instant --> <time timezone="PST">  <hour>11</hour>  <minute>59</minute>  <second>59</second>  <meridiem>p.m.</meridiem>  <atomic signal="true"/> </time>

A nice feature of RNV is that it can check a compact schema alone, without validating an instance. This is done with the -c option:

rnv -c time.rnc

As with Jing, the sound of silence means that the compact grammar is in good shape.

5.6.2.3 A more complex RELAX NG schema in compact syntax

Example 5-9 is a more complex yet more precise version of time.rnc called precise.rnc, which is only about 25 percent as long as its counterpart precise.rng.

Example 5-9. precise.rnc
start = Time Time =   element time {     attribute timezone { Timezones },     element hour { Hours },     element minute { MinutesSeconds },     element second { MinutesSeconds },     element meridiem { "a.m." | "p.m." },     element atomic {       attribute signal { "true" | "false" }     }   } Timezones =   # http://www.timeanddate.com/library/abbreviations/timezones/   "GMT" | "UTC" | "ACDT" | "ACST" | "ADT" | "AEDT" | "AEST" | "AKDT"   | "AKST" | "AST" | "AWST" | "BST" | "CDT" | "CEST" | "CET" | "CST"   | "CXT" | "EDT" | "EEST" | "EET" | "EST" | "HAA" | "HAC" | "HADT"   | "HAE" | "HAP" | "HAR" | "HAST" | "HAT" | "HAY" | "HNA" | "HNC"   | "HNE" | "HNP" | "HNR" | "HNT" | "HNY" | "IST" | "MDT" | "MESZ"   | "MEZ" | "MST" | "NDT" | "NFT" | "NST" | "PDT" | "PST" | "WEST"   | "WET" | "WST" Hours = xsd:string { pattern = "[0-1][0-9]|2[0-3]" } MinutesSeconds = xsd:integer { minInclusive = "0" maxInclusive="59"}

The compact schema precise.rnc was generated by Trang from precise.rng (http://www.thaiopensource.com/relaxng/trang.html).


Comparing precise.rnc with precise.rng should yield many insights into the compact syntax. The start symbol (line 1) indicates where the document element begins, as does the start element in XML syntax. The names of definitions (lines 2, 13, 22, and 23) are followed by equals signs (=), then by the patterns they represent. These definitions are referenced by name in the content models of elements or attributes (lines 1, 4, 5, 6, and 7). Choices of values are separated by a vertical bar (|) on lines 8, 10, and 15-21, and each of the values is quoted. Comments begin with # (line 14) instead of beginning with <!-- and ending with -->. The XML Schema datatype library is assumed, without being identified in the schema directly. Anything prefixed with xsd: is assumed to be a datatype from the XML Schema datatype library (xsd:string on line 22 and xsd:integer on line 23). The pattern keyword on line 22 is associated with a regular expression. The minInclusive and maxInclusive keywords are parameters (facets in XML Schema) that define an inclusive range of 0 through 59.

Test this compact schema by validating time.xml against it with RNV:

rnv precise.rnc time.xml

with Jing:

java -jar jing.jar -c precise.rnc time.xml

or with jing.exe:

jing -c precise.rnc time.xml

5.6.3 See Also

  • Eric van der Vlist's RELAX NG (O'Reilly) provides a complete tutorial for RELAX NG, plus a reference

  • If you run into problems, a good place to post questions is the RELAX NG user list: http://relaxng.org/mailman/listinfo/relaxng-user

  • Sun's Multi-schema validator by Kawaguchi Kohsuke: http://wwws.sun.com/software/xml/developers/multischema/

  • Tenuto, a C# validator for RELAX NG: http://sourceforge.net/projects/relaxng



XML Hacks
XML Hacks: 100 Industrial-Strength Tips and Tools
ISBN: 0596007116
EAN: 2147483647
Year: 2006
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net