XML Syntax | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

This is not an introductory book about XML. I certainly expect that you already have some experience with XML documents. Nonetheless, when writing programs to process XML it's very important that you are crystal clear about the exact terminology used when discussing XML. Therefore I'd like to take a few pages to briefly review the proper terminology for discussing XML, as well as to clarify a few points that are often confused or misunderstood.

XML Documents

The precise meaning of "XML document" is defined by the XML 1.0 specification [http://www.w3.org/TR/REC-xml] published by the W3C. This specification provides a detailed Backus-Naur Form (BNF) grammar defining exactly what is and is not an XML document. Anything that satisfies the document production in that BNF grammar and adheres to the 15 well- formedness constraints is an XML document. ^[2] Anything that does not is not an XML document.

^[2] The well-formedness constraints specify requirements that are difficult or impossible to express in BNF form; for example, that "The Name in an element's end-tag must match the element type in the start-tag."

Well-formedness is the minimum requirement for an XML document. A document that is not well- formed is not an XML document. Parsers cannot read it. A parser is not allowed to fix a malformed document. It cannot take a best guess at what the document author intended. When a parser encounters a malformed document, it stops parsing and reports the error. It will not read any further into the document. ^[3] Depending on the API through which you're accessing the parser, you may or may not have already received some information from the parts of the document before the error. However, under no circumstances will the parser give you any data following the first well-formedness error in the document.

^[3] A few parsers continue reading so they can report further errors after the first one. However, they report errors only, not content.

The detailed rules an XML document must follow aren't so important here because the parser will check them for you. Very roughly , an XML document must have a single root element. All start-tags must be matched by end-tags. All attribute values must be quoted. And only the Unicode characters that are legal in XML may be used in the document. (Almost all Unicode characters are legal in XML documents. The only ones really ruled out are the C0 controls such as null, bell, and form feed.)

There's another way to look at XML documents besides simply as a sequence of characters that adheres to certain rules, and it's one that sometimes makes sense, especially when writing programs that process XML documents. An XML document is a tree. It has a root node that contains various child nodes. Some of these child nodes have children of their own. Others are leaf nodes that have no children.

Note

Occasionally developers ask how they can parse a document that is almost, but not quite, a well-formed XML document. For example, it may end with a form feed inserted by some Unix text editor to separate documents. Or it may be part of an infinite stream of elements, the last of which is never seen, so there's no end-tag for the root element. Imagine, for example, weather observations or stock quotes being pushed across the Internet as XML elements.

The short answer is that you can't parse these things because they are not XML documents, even if they use a lot of tags and attributes and other XML-like markup. The long answer is that you may be able to write a non-XML-aware program to preprocess the streams, fix up any well-formedness mistakes, and only then pass the fixed documents to the XML parser. However, the XML parser must receive a complete well-formed document. It cannot work with anything less.

There are roughly five different kinds of nodes in an XML tree:

Root

Also known as the document node, this is the abstract node that contains the entire XML document. Its children include comments, processing instructions, and the root element of the document.

Element

An XML element with a name, a set of attributes, a set of in-scope namespaces, and a list of children.

Text

The parsed character data between two tags (or any other kind of nontext node).

Comment

An XML comment such as  . The contents of the comment are its data. A comment does not have any children.

Processing Instruction

A processing instruction such as <?xml-stylesheet type="text/css" href="order.css"?> A processing instruction has a target and a value. It does not have any children.

Depending on the context, some details of this tree structure can be interpreted differently. For example, some tree models consider parsed entities or CDATA sections to be additional kinds of nodes. Others simply merge them into the tree structure as elements and text nodes. Some models allow one text node to follow another. Others require each text node to be the maximum contiguous run of text not interrupted by some other kind of node. Some models include the document type declaration and/or the XML declaration as a node. Others ignore them. Probably the most hotly debated point is how to handle attributes and namespaces. I chose not to consider them as nodes in the tree in their own right, treating them instead as properties of elements. Generally even tree models such as XPath that do treat them as separate nodes still don't make them children of the element to which they belong. For now the details aren't too important. The broad outline is the same for most of the tree models.

Caution

There is some argument about whether it makes sense to talk about an XML document as having any existence independent of the text that makes up the document. After all, the XML 1.0 specification defines concepts such as document and element only in terms of text strings. Later W3C specifications, such as the XML Information Set [http://www.w3.org/TR/xml-infoset/] (Infoset) and the Document Object Model (DOM) do suggest a more abstract understanding of the components of an XML document. However, these specifications are much more controversial than XML 1.0 itself, and not as broadly implemented or accepted. For the purpose of writing programs that process XML, I do find it useful to consider XML documents more abstractly, and I will do so in this book. However, even here there is a split depending on which API you choose. DOM is a very abstract model of XML documents that defines classes representing elements, attributes, comments, and more. The Simple API for XML (SAX), on the other hand, defines almost no such classes. It presents the content of an XML document almost exclusively as strings and arrays of characters.

XML Applications

An XML application is a specific XML vocabulary that contains particular elements and attributes. It is not a software program that somehow uses XML, such as the EditML Pro XML editor or the Mozilla web browser. XML applications limit the very flexible rules of XML to a finite set of elements of certain types. For example, DocBook is an XML application designed for producing technical manuscripts such as this book. Elements it defines include book , chapter , para , sect1 , sect2 , programlisting , and several hundred others. When writing a DocBook document, you have to use these elements, and you have to use them in certain ways. For example, a sect2 element can be a child of a sect1 but not a child of a sect3 or a chapter . Scalable Vector Graphics (SVG) is an XML application for line art. Elements it defines include line , circle , ellipse , polygon , polyline , and so forth. All SVG documents are XML documents, but not all XML documents are SVG documents.

An XML application can have a schema that defines what is a legal document for that application and what is not. Schemas can be written in a variety of languages, including DTDs, the W3C XML Schema Language, RELAX NG, Schematron, and numerous others. Depending on the power of the schema language used, it may be necessary to specify additional rules for the application in less formal prose . For example, the XHTML 1.1 specification includes the requirement that "There must be a DOCTYPE declaration in the document prior to the root element. If present, the public identifier included in the DOCTYPE declaration must reference the DTD found in Appendix C using its Formal Public Identifier." None of the common schema languages allow you to require anything about the DOCTYPE declaration.

An instance document is an instance of an XML application, whether formally defined or not. That is, it is an XML document with a root element and whatever other content it possesses that satisfies all the rules of some XML application. There are many possible instance documents for any one XML application, just as many programs can be written in any one programming language.

Elements and Tags

The fundamental unit of XML is the element. You can write good XML documents without using any other XML construct. If for some reason you have a grudge against comments, processing instructions, attributes, or namespaces, you can pretend they don't exist and still write well-formed XML documents. However, you must use elements. Every XML document has at least one element, and you cannot write XML documents without using them.

Logically every element has four key pieces:

A name
The attributes of the element
The namespaces in scope on the element
The content of the element

In addition, once schemas become more prevalent and parsers and APIs are revised to support them, it may also make sense to talk about the element's type. For now, though, there's not a lot of practical help to be gained by considering the type. Furthermore, DOM and XPath have mutually incompatible concepts of the value of an element. However, in both cases, the value is derived purely from the element content, so it's not really a separate thing.

Syntactically in the text form of an XML document, elements are delimited by tags. Start-tags begin with a < immediately followed by the element name. End-tags begin with a </ immediately followed by the element name. Both start- and end-tags terminate with > . Everything in between the two tags is the content of the element. For example, following is a Quantity element with the content "12":

 <Quantity>12</Quantity>

Tags and elements are closely related , but they are not the same thing. Be wary of books that confuse them. An element is the whole sandwich, including bread, meat, cheese, pickles, and mayonnaise, whereas the tags are just the bread. An element is composed of a start-tag, followed by content, followed by an end-tag.

It is possible that an element may have no content. In this case it is called an empty element. For example, following is an empty Quantity element:

 <Quantity></Quantity>

The start-tag butts right up against the end-tag; there is not even a single space character between them. By contrast, this next element is not empty because it does contain some white space, though nothing else:

 <Quantity> </Quantity>

Besides start-tags and end-tags, there is one other kind of tag, the empty-element tag. An empty-element tag begins with a < followed by an element name as does a start-tag. However, it ends with a /> . For example, following is an empty Quantity tag:

 <Quantity/>

This tag both starts and ends a Quantity element. The content of this element is nothing, just like the content of <Quantity></Quantity> . Indeed <Quantity/> is just syntax sugar for <Quantity></Quantity> . They mean exactly the same thing. No application should treat these two constructs as different in any way. Indeed, most XML parsers and APIs won't even tell you which form the element took in the source document. In both cases, what's reported is an empty element with the name "Quantity." How that element was represented is not important.

As well as text, an element can also contain one or more child elements. These elements are completely contained between the element's start-tag and end-tag, and are not contained in any other element also contained in the parent element. For example, this ShipTo element has four child elements: Street , City , State , and Zip .

 <ShipTo>    <Street>135 Airline Highway</Street >   <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip> </ShipTo>

In addition to the four child elements, this ShipTo element contains some white space; for example, the single space character between </City> and <State> . These spaces form text nodes that are also counted among the element's children. Text nodes like these that are composed of nothing but white space are sometimes called ignorable white space. This is an unfortunate turn of phrase, because sometimes you can ignore these nodes, but most of the time you can't. The more proper term is white space in element content. ^[4]

^[4] Technically, whether or not white-space -only nodes are considered to be white space in element content depends on the content specification for the element given by the DTD. A white-space-only text node is only white space in element content when the content specification for the parent element in the DTD indicates that the parent element can contain child elements only but not mixed content. Because Example 1.2 doesn't have a DTD, this can't possibly be white space in element content.

All of the elements contained in an element are called the element's descendants, and only the highest level are the children. The descendants include not only the children, but also the children of the children, the children of the children's children, and so forth. If you look at Example 1.2 again, you'll see that the Order element has 15 descendant elements.

An element can also have mixed content, which is when an element contains both child elements and text nodes containing non-white-space characters. For example, the following variant ShipTo element has the child elements you saw before as well as text nodes containing the strings "Chez Fred" and "Apt. 17D":

 <ShipTo>    Chez Fred   <Street>135 Airline Highway</Street >   Apt. 17D   <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip> </ShipTo>

Mixed content is very useful, indeed almost essential, for XML applications that contain narratives, such as books and stories. Such applications include XHTML, DocBook, TEI, and XSL Formatting Objects. Mixed content is much less useful and much more cumbersome for data-oriented applications. XML documents that are intended for computers to read, as opposed to those intended for humans to read, should use mixed content sparingly, if at all.

Text

XML documents are text. Each XML document is a sequence of characters taken from the Unicode character set. ^[5] However, you can write XML documents in any character set that your XML parser knows how to convert to Unicode, provided that it is properly specified in the document's encoding declaration in the XML declaration.

^[5] Unicode is a character set with room for more than 1 million different characters, although currently (Unicode 3.2) a few less than 100,000 are defined. Scripts covered by Unicode include Latin, Cyrillic, Greek, Hebrew, Arabic, Devanagari, the Han ideographs, and many more.

Caution

Many developers have decided that they can make XML more efficient by defining a binary version. This tends to result from some vague notion that binary formats are inherently smaller or faster than text formats. These developers rarely have any evidence to back up this claim, which is not surprising because it isn't true. XML documents are routinely smaller and faster to read than the equivalent binary files in standard applications such as Oracle, Microsoft Word, Microsoft Excel, and so forth. The fact is modern binary file formats are quite bloated, but disks have gotten so large that almost no one's noticed or cared. Nonetheless, there seems to be a large pool of programmers who mistakenly believe the following:

File size matters.
They can compress better than gzip.
Human legible/human editable data doesn't matter.

All three beliefs have been empirically proven false time and time again. Nonetheless, about once a month some developer somewhere announces that he or she has come up with yet another special-purpose binary compression format for XML. These have proven completely pointless in practice. There is no actual benefit to such a format, and no one needs one. Worse yet, these formats substantially eliminate many of the existing benefits of XML.

Caution

Contrary to what you may have heard , Unicode is not a two-byte character set and really never has been. Because there are more than 1 million different spaces for characters in Unicode, an arbitrary Unicode character cannot be represented by a single two-byte unsigned integer such as Java's char data type. Prior to Unicode 3.1, all defined Unicode characters had code points of less than 65,536, which fooled some developers into thinking they could get away with using two-byte characters. However, it has long been known that more than 65,536 characters are actually used on Earth today and that Unicode would have to assign characters outside the Basic Multilingual Plane (the first 65,535 characters) to accommodate them.

Although characters were not actually assigned code points greater than 65,536 until Unicode 3.1, the space for them was reserved. XML was designed by forward-thinkers who saw the problems ahead and prepared for the eventual expansion of Unicode. Consequently, XML documents can use the full range of all million-plus characters available in Unicode. Java's designers weren't as prescient, however, and restricted the char data type to two bytes. Consequently, Java programmers need to go through some pretty nasty gyrations to handle Unicode documents (including XML documents) adequately.

With very few exceptions, you can use any character defined in Unicode in the text content of an element or the value of an attribute. In brief, the exceptions are as follows :

C0 Controls

C0 controls are the nonprinting characters such as null and formfeed, between code points 0 and 31 (decimal). The carriage return, linefeed , and horizontal tab are allowed.

Surrogate blocks

Surrogate blocks consist of two sets of 1,024 code points each, which are used to extend Unicode beyond the Basic Multilingual Plane by allowing some characters to be represented as two surrogate characters. You can include surrogate pairs in an XML document in an encoding such as UTF-16 that uses surrogate pairs. You just can't treat an individual half of a surrogate pair as a character by itself.

Byte order mark

Also known as the zero-width nonbreaking space, the byte order mark can be used at the beginning of a document to indicate the encoding and endianness of the document, but it cannot be used elsewhere in the document.

All other characters are fair game, including some you probably shouldn't be using anyway, such as characters in the private use area and compatibility characters Unicode offers purely for interoperability with existing character sets.

The rules for characters used in the names of things (elements, attributes, entities, etc.) are a little stricter. In brief, only letters , digits, and ideographs defined in Unicode 2.0 can be used. In addition the punctuation marks -, ., _, and : are also legal. Digits, the hyphen, and the period cannot be the first character in a name. Other punctuation marks as well as new characters first defined in Unicode 3.0 or later are not allowed anywhere in a name. These are essentially the same rules used for naming variables , methods , and classes in Java. The major difference is that XML allows the hyphen, and Java doesn't; and Java allows the dollar sign, and XML doesn't. Unlike Java, XML also allows the colon , but XML reserves this for use with namespaces. The colon should not be used as an arbitrary name character.

XML parsers faithfully preserve white space. A string containing only white space is not the same as a string containing nothing at all. A string with leading and trailing white space is not the same as the equivalent string with white space trimmed . Some specific XML applications may decide that white space is not significant in certain contexts. However, in general XML, all white space is significant and must be accounted for.

Attributes

Attributes are name value pairs associated with elements. The name of an attribute may be any legal XML name. The value may be any string of text, even potentially including characters such as < and ". The document author needs to escape such characters as < and " . However, the parser will resolve these references before passing the data to your application. The attribute value is enclosed in either single or double quotes, and the name is separated from the value by an equals sign. For example, this Subtotal element has a currency attribute with the value USD:

 <Subtotal currency='USD'>393.85</Subtotal>

The quote marks are not part of the attribute value. Whether single or double quotes are used or whether there is extra white space around the equals sign is not important. Most parsers don't bother to report the difference. These two elements are equivalent to the previous one:

 <Subtotal currency="USD">393.85</Subtotal>  <Subtotal currency = "USD">393.85</Subtotal>

Attributes are unordered. There is no difference between these two elements:

 <Tax rate="7.0" currency="USD">27.57</Tax>  <Tax currency="USD" rate="7.0">27.57</Tax>

When a parser tells you which attributes are attached to an element, it may or may not provide them in the same order as in the input document. Some APIs report the attributes using an unordered data structure such as a hash table. Others use an array or a list, but even in these cases there's no guarantee that the order of the attributes in the list matches the order of the attributes in the start-tag.

Perhaps most surprising, attribute values whose type is not CDATA are normalized. This means that all leading and trailing white space is stripped from the value, and runs of white space characters are compressed to a single space. This does not apply to any of the attributes in the examples so far because untyped attributes are not normalized. However, once you add a DTD, it is possible to declare that an attribute has type ID, IDREF, IDREFS, NMTOKEN, and several other types. Attributes of these types are always normalized before being passed to the client application.

Note

Tim Bray, one of the primary authors of XML 1.0, has admitted that normalization of attribute values was a mistake. In his words, "Why the $#%%!@! should attribute values be 'normalized' anyhow? This was a pure process failure: at no point during the 18-month development cycle of XML 1.0 did anyone stand up and say 'why are you doing this?' I'd bet big bucks that if someone had, the silly thing would have died a well-deserved death." ^[6]

^[6] Attribute normalization and character entities [http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Jan-2000/1085.html], posted on the xml-dev mailing list, January 27, 2000.

XML Declaration

Most XML documents begin with an XML declaration. An XML declaration has a version attribute with the value 1.0 and may have optional standalone and encoding attributes. For example, this XML declaration states that the document is written in XML 1.0 in the ISO-8859-1 (Latin-1) character set and does not require the parser to read the external DTD subset:

 <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

The version attribute always has the value 1.0 . If XML 1.0 is ever revised, this may change to some other value. As I write this, there's a hotly debated proposal at the W3C for a new version of XML code named Blueberry, which would make XML marginally more compatible with Unicode 3.0 and later. It would also make it easier to edit with some brain-damaged IBM mainframe software that can't handle files where lines end in carriage returns, linefeeds, or both. If Blueberry is adopted (and I for one hope it isn't), it may lead to a new value for the version attribute. However, for now, version is effectively fixed with the value 1.0.

The encoding attribute identifies the character set and encoding in which the document is written. Whatever the encoding is, one of the jobs of the parser is to convert the document to Unicode before passing it to the client application. Most APIs don't offer any means of finding out what the original encoding was. You'll simply receive Unicode strings from which all traces of the original encoding have been removed.

The standalone attribute specifies whether the XML parser may have to read parts of the DTD that are outside the instance document to correctly parse the file. This is primarily a hint for the parser. Some parser APIs may tell you what the value was, but generally you don't need to worry about it. The parser either will or won't read external entities as necessary. By the time your code gets hold of the document, all of this will have already been taken care of. You need not concern yourself with it.

Comments

XML comments are almost identical to HTML comments. They begin with  . For example, here's a comment you might find in an order document:

 <!-- Please make sure this order goes out ASAP! -->

Everything between the  should be ignored. In fact, most parsers and APIs do make the comments available to you if you want them, mostly so you can round trip documents (read them in and then write them back out again with everything still intact). However, beyond this use case, you really shouldn't pay much attention to comments in your programs. Some HTML systems abuse comments to support server-side includes or editor-specific extensions. Because XML is much more flexible than HTML, you can use elements, attributes, oras a last resortprocessing instructions for these use cases.

Processing Instructions

Processing instructions tell particular software how it should handle an XML document after the document has been parsed. Generally, processing instructions are used for meta-information that may apply to documents from many different domains and XML vocabularies. For example, the most common processing instruction, xml-stylesheet , tells a browser or other formatter where it can find the stylesheet it should apply to the document. This can be used with DocBook documents, XHTML documents, Human Resources Markup Language documents, or the custom XML application you invented last Tuesday to catalog your baseball card collection. For another example, the Apache XML Project's Cocoon application server reads cocoon-process processing instructions to figure out what processes to apply to a document before sending it to a user . This processing instruction tells Cocoon to replace the XInclude include elements with the contents of the documents they reference:

 <?cocoon-process type="xinclude"?>

The basic syntax of a processing instruction is <? , followed immediately by an XML name identifying the target of the processing instruction, followed by white space and any data at all, followed by ?> .

Unlike elements or attributes, processing instructions can be added to a document without considering whether or not the DTD or schema allows it. Most schema languages do not consider the presence, absence, or structure of processing instructions when determining validity. Furthermore, unlike elements, processing instructions can appear before, after, or inside the root element. They are frequently placed in the document prolog, although they can appear in the document body or after the root element as well.

Most of the time, the processing instruction is not associated with any one XML application. For example, an XML application might describe gene sequences, sixteenth-century Italian love poetry, financial records, or vector graphics. However, each of these might need to be loaded into a web browser, which would apply a stylesheet to it. Processing instructions can be inserted into a document to support this without changing or otherwise affecting the normal document structure. In essence, processing instructions provide an out-of- band channel for passing information to software other than the program that would normally read a document.

XML parsers report the target and contents of processing instructions to the client application. However, they provide no further support for interpreting the data in the processing instruction. For example, many processing instructions use a pseudo-attribute format like the following:

 <?xml-stylesheet type="text/xml" href="limited.xsl"?>

However, as far as the XML parser is concerned , the data in this processing instruction is merely a string that happens to contain some equal signs and quotation marks. These are not treated differently from any other character. ^[7] Both the syntax and semantics of the data is completely up to the application reading the document. Processing instructions are specifically for information that is not related to XML.

^[7] JDOM and dom4j actually do provide special support for processing instructions written in this pseudo-attribute format. However, they both do a substantial amount of work in their own classes to support this interface, beyond what the parser provides.

Entities

XML documents are not necessarily the same thing as XML files. A single XML document may be composed of several different files. Indeed, the pieces that make up an XML document may not be files at all, but may instead be records in a database, data sent out over the Internet by a web server in response to a CGI query, a small part of a much larger file, or something stranger still.

The individual storage units that make up any one XML document are called entities. Every XML document has at least one entity, the document entity. This is the storage unit, be it a file or something else, that holds the root element of the document. Every other entity in a document has a name. There are five such types of named entities, and they are classified according to three criteria:

Internal or External

The replacement text of an internal entity is defined as a string literal in the document's DTD. The replacement text of an external entity is read out of a different file located via a URL.

Parsed or Unparsed

A parsed entity contains XML. It is itself well formed, and may even be a complete XML document if it has a root element. (Some entities that are only intended to be used as parts of other documents do not have root elements.) You can think of a parsed entity as something that will be pasted right into the middle of an XML document, such that the resulting document would still be well formed.

An unparsed entity can contain anything at all, including binary data. Unparsed entities are not pasted (even metaphorically) into XML documents. Instead a URL to the entity's data is provided in an ENTITY declaration in the DTD. Then this entity is referenced in an attribute with the type ENTITY or ENTITIES in the document. An unparsed entity also has a notation that defines the type of data in the unparsed entity (for example, a GIF image or C source code). Like the URL, the notation is also specified in the DTD rather than in the instance document. In practice, unparsed entities and notations are not used much.

General or Parameter

A general entity is used within the instance document. A general entity reference begins with an & . A parameter entity is used within the DTD. A parameter entity reference begins with a % . Because this book focuses on processing instance documents, we'll consider general entities primarily.

Not all combinations are possible. In fact, there are exactly five kinds of named entities:

Internal Parsed General Entities

The familiar entity references such as & and © that are defined completely in the DTD are internal parsed general entities. For example, this declaration define defines the copy entity as the text "Copyright":
 <!ENTITY copy "Copyright"> 
These entities are used in element content and attribute values.

External Parsed General Entities

External parsed general entities are just like internal parsed general entities except that their replacement text is read from a separate document rather than from the DTD. The document is identified by a relative or absolute URL. For example, this declaration defines the legal entity as the content read from the URL http://www.example.com/legal.xml :
 <!ENTITY legal SYSTEM "http://www.example.com/legal.xml"> 
The file from which such an entity is read is just like another XML document except that it has a text declaration instead of an XML declaration, may not have a document type declaration, and might not have a single root element.

External Unparsed General Entities

External unparsed general entities refer to files containing non-XML, binary data. They are declared similarly to external parsed entities, but they also have a notation. For example, these definitions identify points to an unparsed entity named logo at the URL http://www.example.com/logo.png with the notation image/png:
 <!NOTATION PNG SYSTEM "image/png">  <!ENTITY logo SYSTEM "http://www.example.com/logo.png"   NDATA PNG> 
Unparsed entities are referenced by attributes with type ENTITY or ENTITIES rather than by entity references. For example, such an attribute might be declared like this:
 <!ELEMENT figure EMPTY>  <!ATTLIST figure logo ENTITY #REQUIRED> 
Instances of the figure element would look like this:
 <figure source="logo"/> 
The parser does not actually provide you with the contents of an unparsed entity. Instead it tells you the URI from which the data can be retrieved and the notation for that data. However, you have to use Java's networking and I/O classes to get the data at that URI.

Internal Parsed Parameter Entities

Internal parsed parameter entities are used purely within the DTD. The replacement text is provided by a string literal in the DTD. References to these entities begin with a percent symbol. They're often used to parameterize content models and attribute types. For example, the DocBook DTD defines the intermod.redecl.module parameter entity as the word IGNORE:
 <!ENTITY % intermod.redecl.module "IGNORE"> 
Unlike a general entity reference, the %intermod.redecl.module; parameter entity reference can only be used in the DTD, not in the instance document. Because our focus is on instance documents, not DTDs, you won't find many of these in this book.

External Parsed Parameter Entities

External parsed parameter entities are used purely with the DTD. The replacement text is provided by a DTD fragment at a given URL. References to these entities begin with a percent symbol. They often connect the different parts of a modular DTD into one coherent whole. For example, the DocBook DTD defines the dbpool parameter entity using a PUBLIC ID that loads the DTD fragment at the relative URL dbpoolx.mod:
 <!ENTITY % dbpool PUBLIC  "-//OASIS//ELEMENTS DocBook XML Information Pool V4.1.2//EN" "dbpoolx.mod"> 

Again, because our focus is on instance documents and not DTDs, you won't see many of these in this book.

Namespaces

Namespaces are not part of XML 1.0. They were invented about a year after XML 1.0 was released to help sort out the rapidly expanding world of XML applications that all needed to be mixed together in the same documents. There are many good XML applications that don't use namespaces at all. For example, DocBook 4.2.0, the XML application in which this book was written, is completely free of namespaces, as are XML-RPC and RSS 0.9.1. However, even if you can write very useful XML applications without thinking about namespaces, you're going to encounter namespaces when you work with XML applications designed by other developers. Consequently it's important to have a solid understanding of them.

The key idea of namespaces is that each element is bound to a uniform resource identifier ( URI; a URL in practice). If IBM only uses URIs in the ibm.com domain and Sun only uses URIs in the sun.com domain, then there won't be any confusion between Sun's Book element and IBM's Book element, even if they're used in the same document. Just look at the URIs to tell which is which.

Note

A URI identifies a resource, but it does not necessarily locate it. URIs include not only uniform resource locators (URLs) but also uniform resource names ( URN s). For example, a URN for this book based on its ISBN number is urn:isbn:0201771861 ; but this does not tell you where you can find a copy of the book. However, most developers agree that only absolute URLs should be used as namespace URIs, and most XML applications follow this suggestion.

The URIs are purely string identifiers. Even if the URI is a URL, the parser does not connect to the server and try to download the document found there. Indeed there may not be any such document. When plugged into web browsers, namespace URLs often produce 404 Not Found errors. You can use namespaces in standalone systems without any network connection at all. You don't even need access to DNS. For the same reason, two different URLs that point to the same page define two different namespaces. For example, the following URLs identify the same page but three different namespaces:

 http://ns.cafeconleche.org/Orders/,  http://ns.cafeconleche.org/Orders, http://ns.cafeconleche.org/Orders/index.html

Because URIs contain many characters that are illegal in element names as well as being excessively long to type, short prefixes stand in for the URIs. The prefixes are separated from the local name by a colon. For example, instead of the URI http://www.w3.org/2001/XInclude you might use the prefix xinclude or xi . An include element in the http://www.w3.org/2001/XInclude namespace would then be written as xi:include . This element has the prefix xi , the local name include , the qualified name xi:include , and the namespace URI http://www.w3.org/2001/XInclude .

xmlns: prefix attributes bind particular prefixes to particular URIs within the element where the attribute appears. For example, inside this Order element, the prefix xi is bound to the URI http://www.w3.org/2001/XInclude :

 <Order xmlns:xi="http://www.w3.org/2001/XInclude">    <xi:include href="order_details.xml"/> </Order>

Each prefix used in an element or attribute name must be bound to a URI. Failure to do this is a namespace well-formedness error. Although you can parse documents without considering namespaces, in practice most parsers and APIs check namespaces by default, and a violation of namespace well-formedness is almost as serious as a violation of XML 1.0 rules.

The prefix can change as long as the URI stays the same. For example, this element is the same as the previous one:

 <Order xmlns:xinclude="http://www.w3.org/2001/XInclude">    <xinclude:include href="order_details.xml"/> </Order>

You can also define a default namespace that applies to elements without prefixes. Example 1.6 places the Order element and all its descendants in the http://ns.cafeconleche.org/Orders/ namespace, even though none of them have prefixes.

Example 1.6 An XML Document That Uses a Default Namespace

 <?xml version="1.0" encoding="ISO-8859-1"?> <Order xmlns="http://ns.cafeconleche.org/Orders/">   <Customer id="c32">Chez Fred</Customer>   <Product>     <Name>Birdsong Clock</Name>     <SKU>244</SKU>     <Quantity>12</Quantity>     <Price currency="USD">21.95</Price >     <ShipTo>       <Street>135 Airline Highway</Street >       <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip>     </ShipTo>   </Product>   <Subtotal currency='USD'>263.405</Subtotal>   <Tax rate="7.0"        currency='USD'>18.44</Tax>   <Shipping  method="USPS" currency='USD'>8.95</Shipping>   <Total currency='USD' >290.79</Total> </Order>

Although it's most common to place the namespace binding attributes on the root element, they can appear on other elements deeper in the hierarchy. They can even override previous bindings in the ancestor elements. This is especially common with the binding of the default namespace. In Example 1.7, the Order , Customer , Product , Name , SKU , Price , Subtotal , Tax , Shipping , and Total elements are all in the http://ns.cafeconleche.org/Orders/ namespace. However, the ShipTo , Street , City , State , and Zip elements are in the http://ns.cafeconleche.org/Address/ namespace.

Example 1.7 An XML Document That Uses Two Default Namespaces

 <?xml version="1.0" encoding="ISO-8859-1"?> <Order xmlns="http://ns.cafeconleche.org/Orders/">   <Customer id="c32">Chez Fred</Customer>   <Product>     <Name>Birdsong Clock</Name>     <SKU>244</SKU>     <Quantity>12</Quantity>     <Price currency="USD">21.95</Price >     <ShipTo xmlns="http://ns.cafeconleche.org/Address/">       <Street>135 Airline Highway</Street >       <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip>     </ShipTo>   </Product>   <Subtotal currency='USD'>263.40</Subtotal>   <Tax rate="7.0"        currency='USD'>18.44</Tax>   <Shipping  method="USPS" currency='USD'>8.95</Shipping>   <Total currency='USD' >290.79</Total> </Order>

Although less common, prefixes can also be attached to attribute names to indicate what namespace the attribute is in. For example, XLink uses this to distinguish between the XLink attributes such as type and href and attributes with the same names that might be used in elements that need to become XLinks. This ShipTo element is also a simple XLink to the recipient's e-mail address:

 <ShipTo xmlns="http://ns.cafeconleche.org/Address/"       xmlns:xlink="http://www.w3.org/1999/xlink"      xlink:type="simple" xlink:href="mailto:chezfred@yahoo.com">   <GiftRecipient>Samuel Johnson</GiftRecipient>   <Street>271 Old Homestead Way</Street >   <City>Woonsocket</City> <State>RI</State> <Zip>02895</Zip> </ShipTo>

Unprefixed attributes are never in any namespace. Unlike elements, they cannot be in the default namespace. Furthermore, they are not in the same namespace as the element to which they are attached. If an attribute does not have a prefix, it is not in a namespace.

On occasion, namespace prefixes are used in attribute values, element content, and even processing instructions. In these cases the nearest ancestor element that contains a binding for that prefix establishes what URI the prefix is mapped to. Inside an element with an xmlns: prefix attribute, we say that the namespace is in scope even if it isn't obviously used anywhere in that element. Namespaces in scope on an element include not only those that the element itself declares but also those which are declared on that element's ancestors . An element can redeclare a namespace prefix so that the prefix is bound to a different URI in the element and the element's children than it is bound to in the element's parent. Slightly more commonly, an element can change the default namespace that applies within the element and its content.

When writing software to process XML documents that use namespaces, you almost always want to make your code dependent on the URI, not the prefix. If you're comparing two elements for equality, compare them by URI and local name, not prefix and local name. If you're searching for an element of a certain type, look for an element with the right URI and local name, not the right prefix and local name.