XML documents are the class of data objects described by the XML Recommendation [XML]. All XML documents are made up of two parts:
For example, the following is an XML document:
<?xml version="1.0"?> <body> content </body>
All XML documents have a logical and a physical structure. A document usually consists of a hierarchical structure of elements. An element consists of data (including null data) surrounded by start and end tags. In XML, you can generate an infinite number of custom tag sets for your documents.
Example 3-1 compares similar XML and HTML documents. HTML typically describes what a document looks like, whereas XML describes how a document is logically structured.
Example 3-1 Comparison of similar XML and HTML documents
XML Example <sale-item> <head>House</head> <type>single family</type> <cond>like new</cond> <size>1400 sq. ft.</size> <bedroom>3</bedroom> <bath>1 1/2</bath> <lot>8000 sq. ft.</lot> <price>$158,000</price> </sale-item> HTML Example <h1>House for Sale</h1> <p align=center>Single family</p> <br><i>like new</i> 1400 sq. ft. <br>3 bedrooms <br>1 _ baths <br>lot size of 8000 sq. ft. <br>asking $158,000
The XML Recommendation defines the rules for creating the semantic tags that you use to describe data and for adding markup to documents. An XML document consists of text (data) plus XML markup. Note that an XML document is always interpreted as [Unicode]. If the document uses non-Unicode character codes, the processing agent maps them into Unicode code points when read. An XML markup language must follow standard rules that provide the following information:
Table 3-3 lists the components for encoding and decoding an XML document.
The processing application needs to know the syntax of the markup to determine what to do with the XML.
3.2.1 XML Parsing Process
To read an XML document, you need an XML parser/processor, which can be implemented as a browser, if the XML is just to be displayed, or as an application module, if it is to feed more complex processing. The XML Recommendation provides for two types of parser/processors: nonvalidating and validating. The XML Recommendation also provides for two categories of XML documents: well formed (Section 3.2.2) and valid (Section 3.2.3). An XML parser must determine whether the markup is well formed and, if a DTD is present, whether to determine if it is valid. The XML Recommendation does not require that XML documents have a DTD. All XML documents must follow the rules for being well formed or else they are, by definition, not XML documents. All XML documents do not have to be valid, but all valid documents are well formed.
A nonvalidating parser checks the XML document against the well-formed constraints of XML. Note that a DTD can be present for a nonvalidating parser/processor. As described in Chapter 4, it must still expand entities that are defined in the DTD but will not provide a valid/nonvalid indication. A validating parser checks the XML document against the validity constraints of XML and any contained in the DTD.
In many applications, the parsed information ends up in an internal data structure or in a database. Later, some code may modify the data structure, synthesize new XML data structures, or retrieve information from the database and create output XML. For example, many of today's Java-based processors are designed for use with Web applications. With a Java-based XML processor, the application uses the processor classes to read in the document. Once the application reads in the document, the information in the document becomes available to Java.
Generally, parsers just go from an external representation of XML to an internal one. Some processors also have facilities to interpret stylesheets and produce various output. For example, a browser-oriented processor translates an XML document into different types of documents such as HTML, RTF, or TeX. During translation, the parser checks whether the XML document is well formed and/or confirms the validity of the XML document. When a document meets the requirements for well-formed and/or valid documents, the parser or code generator transforms it into a different document type (see Figure 3-1).
Figure 3-1. Browser-oriented parsing process
Transforming an XML document into another document type requires a translation file or stylesheet. Section 3.7 discusses stylesheets.
3.2.2 Well-Formed Documents
"Well formed" has an exact meaning in XML. A well-formed document adheres to the syntax rules specified by the XML 1.0 Recommendation. If the document is not well-formed and an error appears in the XML syntax, the XML processor stops and reports the presence of a fatal error. A textual object is a well-formed XML document if it meets the following criteria:
A well-formed XML document will meet the minimum requirement of being parseable as XML. Example 3-2 shows such a document. The first line in the example is the XML declaration. The XML declaration, if present, must begin the XML document and must be in lowercase. It tells the parser that the document is XML and that it conforms to the version 1.0 specification.
Example 3-2 A well-formed XML document
<?xml version="1.0"?> <memo> <to>Jon</to> <from>Chris</from> <subject>Reminder</subject> <body>Three PM meeting canceled. Have a great weekend.</body> </memo>
A well-formed document also adheres to the following rules:
3.2.3 Valid XML Documents
An XML document is valid if it is well formed, has an associated DTD, and complies with the constraints expressed in that DTD. The DTD defines the grammar and vocabulary of a markup language, specifying what is and what is not allowed to appear in a document for example, which tags can appear in the document and how they must nest within one another.
An XML document can contain the DTD, the XML document can link to an external DTD, or DTD material can appear in both places. Different documents and Web sites can share external DTDs. The DTD or its reference must appear before the first element in the document.
Example 3-3 shows a well-formed and hypothetically valid XML document. This example references an external DTD. Chapter 4 discusses the details of DTDs.
Example 3-3 A well-formed and valid XML document with an external DTD
<?xml version="1.0"?> <!DOCTYPE memo SYSTEM "InternalMemo.dtd"> <memo> <to>Jon</to> <from>Chris</from> <subject>Reminder</subject> <body>Three PM meeting canceled. Let's meet at Big Stick Farm after work</body> </memo>