3.2 XML Documents

XML documents are the class of data objects described by the XML Recommendation [XML]. All XML documents are made up of two parts:

  • A prolog, which, if present, contains at least the XML declaration

  • A body that contains the actual marked-up document

For example, the following is an XML document:

 <?xml version="1.0"?> <body> content </body> 

All XML documents have a logical and a physical structure. A document usually consists of a hierarchical structure of elements. An element consists of data (including null data) surrounded by start and end tags. In XML, you can generate an infinite number of custom tag sets for your documents.

Example 3-1 compares similar XML and HTML documents. HTML typically describes what a document looks like, whereas XML describes how a document is logically structured.

Example 3-1 Comparison of similar XML and HTML documents
 XML Example <sale-item> <head>House</head> <type>single family</type> <cond>like new</cond> <size>1400 sq. ft.</size> <bedroom>3</bedroom> <bath>1 1/2</bath> <lot>8000 sq. ft.</lot> <price>$158,000</price> </sale-item> HTML Example <h1>House for Sale</h1> <p align=center>Single family</p> <br><i>like new</i> 1400 sq. ft. <br>3 bedrooms <br>1 _ baths <br>lot size of 8000 sq. ft. <br>asking $158,000 

The XML Recommendation defines the rules for creating the semantic tags that you use to describe data and for adding markup to documents. An XML document consists of text (data) plus XML markup. Note that an XML document is always interpreted as [Unicode]. If the document uses non-Unicode character codes, the processing agent maps them into Unicode code points when read. An XML markup language must follow standard rules that provide the following information:

  • The syntax for marking up, which follows strict rules. If the syntax is not exactly right, the parser stops processing and returns an error message.

  • The meaning behind the markup (standards for encoding data with information about itself).

Table 3-3 lists the components for encoding and decoding an XML document.

The processing application needs to know the syntax of the markup to determine what to do with the XML.


The use of XML is not limited to text markup. Thanks to its extensibility, XML could just as easily apply to sound markup or video markup. For example, as text markup the tag <EMPHASIZE> might display text as bold. Used in audio markup, the same tag might produce a louder voice. Used in logical asser tion markup, it might indicate an assertion with higher strength, which would prevail when conflicts with other assertions arise. The actions of tags are generally defined by the applications.

Table 3-3. Components Related to Encoding and Decoding Using XML
Component Function
XML document A data object containing a custom markup language
Document Type Definition

(DTD; optional)

Specifies the markup syntax that is, what it means to be a valid tag (see Chapter 4 for more information about DTDs)
Schema (optional) Specifies more detailed markup syntax and data typing (see Chapter 5 for more information about schemas)
Stylesheet (optional) Usually contains the graphical user interface (GUI) instructions specifying display instructions when the XML is intended for output presentation for the processing application (see Section 3.7)
Display agent Combines the DTD, stylesheet logic, and XML document, and displays it according to the rules and the data

3.2.1 XML Parsing Process

To read an XML document, you need an XML parser/processor, which can be implemented as a browser, if the XML is just to be displayed, or as an application module, if it is to feed more complex processing. The XML Recommendation provides for two types of parser/processors: nonvalidating and validating. The XML Recommendation also provides for two categories of XML documents: well formed (Section 3.2.2) and valid (Section 3.2.3). An XML parser must determine whether the markup is well formed and, if a DTD is present, whether to determine if it is valid. The XML Recommendation does not require that XML documents have a DTD. All XML documents must follow the rules for being well formed or else they are, by definition, not XML documents. All XML documents do not have to be valid, but all valid documents are well formed.

  • The XML processor provides access to the XML document's content and structure. An XML processor acting on behalf of the application, either independently or through a browser, reads and interprets the XML document, and, if present, the DTD, schema, and stylesheet. This processing agent uses the DTD and schema as part of the input and uses stylesheets for display output.

A nonvalidating parser checks the XML document against the well-formed constraints of XML. Note that a DTD can be present for a nonvalidating parser/processor. As described in Chapter 4, it must still expand entities that are defined in the DTD but will not provide a valid/nonvalid indication. A validating parser checks the XML document against the validity constraints of XML and any contained in the DTD.

In many applications, the parsed information ends up in an internal data structure or in a database. Later, some code may modify the data structure, synthesize new XML data structures, or retrieve information from the database and create output XML. For example, many of today's Java-based processors are designed for use with Web applications. With a Java-based XML processor, the application uses the processor classes to read in the document. Once the application reads in the document, the information in the document becomes available to Java.

Generally, parsers just go from an external representation of XML to an internal one. Some processors also have facilities to interpret stylesheets and produce various output. For example, a browser-oriented processor translates an XML document into different types of documents such as HTML, RTF, or TeX. During translation, the parser checks whether the XML document is well formed and/or confirms the validity of the XML document. When a document meets the requirements for well-formed and/or valid documents, the parser or code generator transforms it into a different document type (see Figure 3-1).

Figure 3-1. Browser-oriented parsing process


Transforming an XML document into another document type requires a translation file or stylesheet. Section 3.7 discusses stylesheets.

3.2.2 Well-Formed Documents

"Well formed" has an exact meaning in XML. A well-formed document adheres to the syntax rules specified by the XML 1.0 Recommendation. If the document is not well-formed and an error appears in the XML syntax, the XML processor stops and reports the presence of a fatal error. A textual object is a well-formed XML document if it meets the following criteria:

  • It contains one or more elements.

  • It meets all the well-formed constraints given in the XML 1.0 Recommendation [XML].

  • Each of the parsed entities directly or indirectly referenced within the document is well formed.


A parsed entity's contents are referred to as its replacement text; this text is considered an integral part of the document.

A well-formed XML document will meet the minimum requirement of being parseable as XML. Example 3-2 shows such a document. The first line in the example is the XML declaration. The XML declaration, if present, must begin the XML document and must be in lowercase. It tells the parser that the document is XML and that it conforms to the version 1.0 specification.


Case sensitivity is strictly observed in XML.

Example 3-2 A well-formed XML document
 <?xml version="1.0"?> <memo>   <to>Jon</to>   <from>Chris</from>   <subject>Reminder</subject>   <body>Three PM meeting canceled. Have a great weekend.</body> </memo> 

A well-formed document also adheres to the following rules:

  • Elements containing data have both start and end tags For example:

  • Elements containing no data and using a single tag end with slash greater than ("/>"). These so-called empty elements need not always end with slash greater than ("/>"). An empty element is also valid if the start tag is immediately followed by an end tag. For more information, see the discussion of elements later in this chapter. For example:

  • A document must contain exactly one element that contains all other elements. For instance, in the preceding example:

     <memo> . . . </memo>. 
  • Nesting of elements is allowed but elements may not overlap.

  • Quotes must surround attribute values, as in the following example:

     <termdef  term='dog'/> 
  • Use of the less than ("<") and ampersand ("&") characters is limited to start tags or entity references.

  • The following five predefined entity references are available to represent markup characters in content or single and double quotes:

    &amp;, &lt;, &gt;, &apos;, &quot;

3.2.3 Valid XML Documents

An XML document is valid if it is well formed, has an associated DTD, and complies with the constraints expressed in that DTD. The DTD defines the grammar and vocabulary of a markup language, specifying what is and what is not allowed to appear in a document for example, which tags can appear in the document and how they must nest within one another.

An XML document can contain the DTD, the XML document can link to an external DTD, or DTD material can appear in both places. Different documents and Web sites can share external DTDs. The DTD or its reference must appear before the first element in the document.

Example 3-3 shows a well-formed and hypothetically valid XML document. This example references an external DTD. Chapter 4 discusses the details of DTDs.

Example 3-3 A well-formed and valid XML document with an external DTD
 <?xml version="1.0"?> <!DOCTYPE memo SYSTEM "InternalMemo.dtd"> <memo>   <to>Jon</to>   <from>Chris</from>   <subject>Reminder</subject>   <body>Three PM meeting canceled. Let's meet at Big Stick Farm         after work</body> </memo> 

Secure XML(c) The New Syntax for Signatures and Encryption
Secure XML: The New Syntax for Signatures and Encryption
ISBN: 0201756056
EAN: 2147483647
Year: 2005
Pages: 186

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net