Prerequisites | Professional XML Development with Apache Tools: Xerces, Xalan, FOP, Cocoon, Axis, Xindice (Wrox Professional Guides)

You must understand a few basics about XML and related standards in order to make good use of the material in this chapter. Following is a quick review. If you need more information, XML in a Nutshell, Second Edition by Eliotte Rusty Harold and W. Scott Means is a good source for the relevant background. Let’s begin with the following simple XML file:

  1: <?xml version="1.0" encoding="UTF-8"?>   2: <book xmlns="http://sauria.com/schemas/apache-xml-book/book"    3:   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"    4:   xsi:schemaLocation=   5:    "http://sauria.com/schemas/apache-xml-book/book    6:     http://www.sauria.com/schemas/apache-xml-book/book.xsd"   7:   version="1.0">   8:   <title>Professional XML Development with Apache Tools</title>   9:   <author>Theodore W. Leung</author>  10:   <isbn>0-7645-4355-5</isbn>  11:   <month>December</month>  12:   <year>2003</year>  13:   <publisher>Wrox</publisher>  14:   <address>Indianapolis, Indiana</address>  15: </book>

Like all XML files, this file begins with an XML declaration (line 1). The XML declaration says that this is an XML file, the version of XML being used is 1.0, and the character encoding being used for this file is UTF-8. Until recently, the version number was always 1.0, but the W3C XML Working Group is in the process of defining XML 1.1. When they have finished their work, you will be able to supply 1.1 in addition to 1.0 for the version number. If there is no encoding declaration, then the document must be encoded using UTF-8. If you forget to specify an encoding declaration or specify an incorrect encoding declaration, your XML parser will report a fatal error. We’ll have more to say about fatal errors later in the chapter.

Well-Formedness

The rest of the file consists of data that has been marked up with tags (such as <title> and <author>). The first rule or prerequisite for an XML document is that it must be well-formed. (An XML parser is required by the XML specification to report a fatal error if a document isn’t well-formed.) This means every start tag (like <book>) must have an end tag (</book>). The start and end tag, along with the data in between them, is called an element. Elements may not overlap; they must be nested within each other. In other words, the start and end tag of an element must be inside the start and end tag of any element that encloses it. The data between the start and end tag is also known as the content of the element; it may contain elements, characters, or a mix of elements and characters. Note that the start tag of an element may contain attributes. In our example, the book element contains an xsi:schemaLocation attribute in lines 4-5. The value of an attribute must be enclosed in either single quotes (') or double quotes ("). The type of the end quote must match the type of the beginning quote.

Namespaces

In lines 2-4 you see a number of namespace declarations. The first declaration in line 2 sets the default namespace for this document to http://sauria.com/schemas/apache-xml-book/book. Namespaces are used to prevent name clashes between elements from two different grammars. You can easily imagine the element name title or author being used in another XML grammar, say one for music CDs. If you want to combine elements from those two grammars, you will run into problems trying to determine whether a title element is from the book grammar or the CD grammar.

Namespaces solve that problem by allowing you to associate each element in a grammar with a namespace. The namespace is specified by a URI, which is used to provide a unique name for the namespace. You can’t expect to be able to retrieve anything from the namespace URI. When you’re using name-spaces, it’s as if each element or attribute name is prefixed by the namespace URI. This is very cumbersome, so the XML Namespaces specification provides two kinds of shorthand. The first shorthand is the ability to specify the default namespace for a document, as in line 2. The other shorthand is the ability to declare an abbreviation that can be used in the document instead of the namespace URI. This abbreviation is called the namespace prefix. In line 3, the document declares a namespace prefix xsi for the name-space associated with http://www.w3.org/2001/XMLSchema-instance. You just place a colon and the desired prefix after xmlns.

Line 4 shows how namespace prefixes are used. The attribute schemaLocation is prefixed by xsi, and the two are separated by a colon. The combined name xsi:schemaLocation is called a qualified name (QName). The prefix is xsi, and the schemaLocation portion is also referred to as the local part of the QName. (It’s important to know what all these parts are called because the XML parser APIs let you access each piece from your program.)

Default namespaces have a lot of gotchas. One tricky thing to remember is that if you use a default namespace, it only works for elements—you must prefix any attributes that are supposed to be in the default namespace. Another tricky thing about default namespaces is that you have to explicitly define a default namespace. There is no way to get one "automatically". If you don’t define a default namespace, and then you write an unprefixed element or attribute, that element or attribute is in no namespace at all.

Namespace prefixes can be declared on any element in a document, not just the root element. This includes changing the default namespace. If you declare a prefix that has been declared on an ancestor element, then the new prefix declaration works for the element where it’s declared and all its child elements.

You may declare multiple prefixes for the same namespace URI. Doing so is perfectly allowable; however, remember that namespace equality is based on the namespace URI, not the namespace prefix. Thus elements that look like they should be in the same namespace can actually be in different name-spaces. It all depends on which URI the namespace prefixes have been bound to. Also note that certain namespaces have commonly accepted uses, such as the xsi prefix used in this example. Here are some of the more common prefixes:

Namespace Prefix	Namespace URI	Usage
xsi	http://www.w3.org/2001/XMLSchema-instance	XML Schema Instance
xsd	http://www.w3.org/2001/XMLSchema	XML Schema
xsl	http://www.w3.org/1999/XSL/Transform	XSLT
fo	http://www.w3.org/1999/XSL/Format	XSL Formatting Objects
xlink	http://www.w3.org/1999/xlink	XLink
svg	http://www.w3.org/2000/svg	Scalable Vector Graphics
ds	http://www.w3.org/2000/09/xmldsig#	XML Signature
xenc	http://www.w3.org/2001/04/xmlenc#	XML Encryption

Validity

The second rule for XML documents is validity. It’s a little odd to say "rule" because XML documents don’t have to be valid, but there are well defined rules that say what it means for a document to be valid. Validity is the next step up from well-formedness. Validity lets you say things like this: Every book element must have a title element followed by an author element, followed by an isbn element, and so on. Validity says that the document is valid according to the rules of some grammar. (Remember diagramming sentences in high-school English? It’s the same kind of thing we’re talking about here for valid XML documents.)

Because a document can only be valid according to the rules of a grammar, you need a way to describe the grammar the XML document must follow. At the moment, there are three major possibilities: DTDs, the W3C’s XML Schema, and OASIS’s Relax-NG.

DTDs

The XML 1.0 specification describes a grammar using a document type declaration (DTD). The language for writing a DTD is taken from SGML and doesn’t look anything like XML. DTDs can’t deal with namespaces and don’t allow you to say anything about the data between a start and end tag. Suppose you have an element that looks like this:

<quantity>5</quantity>

Perhaps you’d like to be able to say that the content of a <quantity> element is a non-negative integer. Unfortunately, you can’t say this using DTDs.

XML Schema

Shortly after XML was released, the W3C started a Working Group to define a new language for describing XML grammars. Among the goals for this new schema language were the following:

Describe the grammar/schema in XML.
Support the use of XML Namespaces.
Allow rich datatypes to constrain element and attribute content.

The result of the working group’s effort is known as XML Schema. The XML Schema specification is broken into two parts:

XML Schema Part 1: Structures describes XML Schema’s facilities for specifying the rules of a grammar for an XML document. It also describes the rules for using XML Schema in conjunction with namespaces.
XML Schema Part 2: Datatypes covers XML Schema’s rich set of datatypes that enable you to specify the types of data contained in elements and attributes. There are a lot of details to be taken care of, which has made the specification very large.

If you’re unfamiliar with XML Schema, XML Schema Part 0: Primer is a good introduction.

Relax-NG

The third option for specifying the grammar for an XML document is Relax-NG. It was designed to fulfill essentially the same three goals that were used for XML Schema. The difference is that the resulting specification is much simpler. Relax-NG is the result of a merger between James Clark’s TREX and MURATA Makoto’s Relax. Unfortunately, there hasn’t been much industry support for Relax-NG, due to the W3C’s endorsement of XML Schema. Andy Clark’s Neko XML tools provide basic support for Relax-NG that can be used with Xerces. We’ll cover the Neko tools a bit later in the chapter.

Validity Example

Let’s go back to the example XML file. We’ve chosen to specify the grammar for the book.xml document using XML Schema. The xsi:schemaLocation attribute in lines 4-5 works together with the default namespace declaration in line 2 to tell the XML parser that the schema document for the namespace http://sauria.com/schemas/apache-xml-book/book is located at http://www.sauria.com/schemas/apache-xml-book/book.xsd. The schema is attached to the namespace, not the document. There’s a separate mechanism for associating a schema with a document that has no namespace (xsi:noNamespaceSchemaLocation). For completeness, here’s the XML Schema document that describes book.xml.

  1: <?xml version="1.0" encoding="UTF-8"?>   2: <xs:schema    3:   targetNamespace="http://sauria.com/schemas/apache-xml-book/book"    4:   xmlns:book="http://sauria.com/schemas/apache-xml-book/book"    5:   xmlns:xs="http://www.w3.org/2001/XMLSchema"    6:   elementFormDefault="qualified">   7:   <xs:element name="address" type="xs:string"/>   8:   <xs:element name="author" type="xs:string"/>   9:   <xs:element name="book">  10:     <xs:complexType>  11:       <xs:sequence>  12:         <xs:element ref="book:title"/>  13:         <xs:element ref="book:author"/>  14:         <xs:element ref="book:isbn"/>  15:         <xs:element ref="book:month"/>  16:         <xs:element ref="book:year"/>  17:         <xs:element ref="book:publisher"/>  18:         <xs:element ref="book:address"/>  19:       </xs:sequence>  20:       <xs:attribute name="version" type="xs:string" use="required"/>  21:   22:     </xs:complexType>  23:   </xs:element>  24:   <xs:element name="isbn" type="xs:string"/>  25:   <xs:element name="month" type="xs:string"/>  26:   <xs:element name="publisher" type="xs:string"/>  27:   <xs:element name="title" type="xs:string"/>  28:   <xs:element name="year" type="xs:short"/>  29: </xs:schema>

Entities

The example document is a single file; in XML terminology, it’s a single entity. Entities correspond to units of storage for XML documents or portions of XML documents, like the DTD. Not only is an XML document a tree of elements, it can be a tree of entities as well. It’s important to keep this in mind because entity expansion and retrieval of remote entities can be the source of unexpected performance problems. Network fetches of DTDs or a common library of entity definitions can cause intermittent performance problems. Using entities to represent large blocks of data can lead to documents that look reasonable in size but that blow up when the entities are expanded. Keep these issues in mind if you’re going to use entities in your documents.