5.3 XML | ITV Handbook: Technologies and Standards

XML and HTML were designed with different goals: Whereas XML was designed to describe data and to focus on what data is, HTML was designed to display data and to focus on how data looks. This section does not provide an XML tutorial; for detailed tutorials, the reader is referred to the XML tutorial [XML-TUTORIAL]. A discussion on the relationship between XML and SGML (and HTML) can be found in [SGML-XML].

XML was designed as an extensible data representation language [XML]. In HTML, the tags and structure are predefined. The author of HTML documents can only use tags that are defined in the HTML standard (like <p>, <h1>, etc.). In contrast, with XML, the author can, and is required to, define the tags and document structure. Each element in an XML document must have both an opening and closing tag.

The processing model of XML documents could be regarded as an onion model (see Figure 5.10). Each element could specify the namespace with which all of its child elements are specified. Therefore, each element could be regarded as an XML subdocument with its own namespace.

Figure 5.10. The onion model induced by namespaces.

To illustrate the utility of this approach, consider an XML document that uses three namespaces (see Example 5.11): An organization-wide namespace, a department-wide namespace, and a temporary developer specific debugging namespace used for debugging purposes and whose life-span is a few hours. A program that can process organization-wide data can process the entire document knowing exactly which portions of the data are out of scope for company-wide processing logic. Each department could have its own customized version of the program that extends the organization-wide program and knows how to process the department-specific portion of the data. In addition, a software engineer debugging a program might want to introduce some debug information into the document without impacting any of the programs processing the data.

Each layer in the onion model defines a namespace whose scope narrows going from the root of the XML tree toward the leaves . Example 5.11 presents an XML document containing three namespaces, aliased to a, b , and d , where a is used with the largest scope, b is used with a more specific scope than a , and d is used with the most specific scope for debugging. Programs that can process portions marked by a can easily ignore and skip all portions of the document that are not marked by a . Programs that can process portions marked by b should be invoked by programs that can process content marked by a and should ignore or skip portions marked by d .

Example 5.11 Utilizing onion model induced by namespaces.

  <?xml:namespace name="http://xyz.tv/schema/" as="a"?>   <?xml:namespace name="http://xyz.tv/musicdpt/schema/" as="b"?>   <?xml:namespace name="http://xyz.tv/musicdpt/debug/schema/" as="d"?>   <a:order>   <a:name>   <a:first>John</a:first>   <a:last>Smith</last>   </a:name>   <b:item xmlns:b>   <b:/albumtitle>Classic Timeless Pieces</b:/albumtitle>   <d:warehouse_id>unknown</d:warehouse_id>   </b:item>   </a:order>

XML is as expressive (although maybe not the most efficient method) as one would need for representing both data and programs. To illustrate the expressiveness of XML, consider the task of representing script function or a procedural program looping through data to find and retrieve a specific data element. What is needed is to represent a simple for loop with an appropriate termination condition. Much of this expressiveness was utilized in the development of XML Path (XPath) and XML Stylesheet Transformation (XSLT) language (see Example 5.12).

There are several methods for defining the XML tags and document structure:

DTD : as described earlier for SGML.
XDR : also known as XML-Data specification [XDR], is an extension of DTD that enables specifying element types and relations between elements.
DCD : is an extension of DTD designed to be a Resource Description Framework (RDF) conforming language capable of expressing sub-classing relationships and database interfaces [DCD].
XSD : an advanced language which is more expressive than all other schema languages and is rapidly becoming the state-of-the-art of XML schema specifications [XSD].

Example 5.12 Recursive XSLT code producing an HTML rendering of an XML file.

  <xsl:template name="RecursiveRenderXML">   <xsl:param name="xml"/>   <xsl:param name="spacing"/>   <xsl:for-each select="$xml/*">   <BR/><xsl:value-of select="$spacing"/>&lt;<xsl:value-of   select="name()"></xsl:value-of>   <xsl:for-each select="./@*">   <xsl:text>&#160;</xsl:text><xsl:value-of select="name()"/>='<xsl:value-   of select="."/>'   </xsl:for-each>   <xsl:choose>   <xsl:when test="not(./*)">   <xsl:variable name="v"><xsl:value-of select="."/></xsl:variable>   <xsl:choose>   <xsl:when test="normalize-space($v)=''">/&gt;</xsl:when>   <xsl:otherwise>   &gt;<xsl:value-of select="."/>&lt;/<xsl:value-of select=   "name()"/>&gt;   </xsl:otherwise>   </xsl:choose>   </xsl:when>   <xsl:otherwise>   &gt;   <xsl:call-template name="RecursiveRenderXML">   <xsl:with-param name="xml" select="."/>   <xsl:with-param name="spacing">&#160;&#160;&#160;&#160;&#160;   <xsl:value-of select="$spacing"/></xsl:with-param>   </xsl:call-template>   <BR/><xsl:value-of select="$spacing"/>   &lt;/<xsl:value-of select="name()"/>&gt;   </xsl:otherwise>   </xsl:choose>   </xsl:for-each>   </xsl:template>

5.3.1 Document Object Module

The DOM is an XML platform and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure, and style of documents [DOM]. The document can be further processed and the results of that processing can be incorporated back into the presented page. This is an overview of DOM- related materials at W3C and around the Web [WWW].

As explained above, a markup content document, such as an HTML page, is a tree. The DOM data structure together with the associated API, enable storing, accessing, and modifying components of that tree. The DOM presents documents as a hierarchy of Node objects that also implement other, more specialized interfaces. Some types of nodes may have child nodes of various types, and others are leaf nodes that cannot have anything below them in the document structure. The DTD, DCD, and XSD are all specifying constraints on the DOM tree structure.

5.3.1.1 Hierarchical versus Flat Views

The DOM Core APIs present two somewhat different sets of interfaces to an XML/HTML document [DOM]:

An object oriented approach with a hierarchy of inheritance
A simplified view that allows all manipulation to be done via the Node interface without requiring casts (in Java and other C-like languages) or query interface calls in Microsoft Common Object Model (COM) environments.

In practice, this means that there is a certain amount of redundancy in the API. The Microsoft Internet Explorer implementation supports both flat and hierarchical, as well as a mixed semi-flat version. The W3C DOM working group considers the inheritance approach the primary view of the API, and the full set of functionality on Node to be extra functionality that users may employ , but it does not eliminate the need for methods on other interfaces that an object oriented analysis would dictate . Thus, even though there is a generic nodeName attribute on the Node interface, there is still a tagName attribute on the Element interface; these two attributes contain the same value, but it is worthwhile to support both, given the different constituencies the DOM API must satisfy [DOM].

5.3.1.2 Nodes versus NodeLists

The DOM also specifies a NodeList interface to handle ordered lists of Node s, such as the children of a Node, or the elements returned by the ECMA Script method getElementsByTagName() method of the Element interface, and also a NamedNodeMap interface to handle unordered sets of nodes referenced by their name attribute, such as the attributes of an Element. NodeList and NamedNodeMap objects in the DOM such that changes to the underlying document structure are reflected in all relevant NodeList and NamedNodeMap objects. For example, if a DOM user gets a NodeList object containing the children of an Element , then subsequently adds more children to that element (or removes children, or modifies them), those changes are automatically reflected in the NodeList , without further action on the user's part. Likewise, changes to a Node in the tree are reflected in all references to that Node in NodeList and NamedNodeMap objects.

5.3.1.3 Namespaces

The DOM-2 supports XML namespaces by augmenting several DOM-1 interfaces that allow creation and manipulation of elements and attributes associated with a namespace. The special attributes used for declaring XML namespaces are exposed through the API and can be manipulated just like any other attribute. However, nodes are permanently bound to namespace URIs as they get created. Consequently, moving a node within a document, using the DOM, in no case results in a change of its namespace prefix or namespace URI. Similarly, creating a node with a namespace prefix and namespace URI, or changing the namespace prefix of a node, does not result in any addition, removal, or modification of any special attributes for declaring the appropriate XML namespaces. Namespace validation is not enforced; the DOM application is responsible. In particular, because the mapping between prefixes and namespace URIs is not enforced, in general, the resulting document cannot be serialized naively. For example, applications may have to declare every namespace in use when serializing a document.

DOM-1 is namespace ignorant as its methods solely identify attribute nodes by their nodeName. In contrast, the DOM-2 methods related to namespaces, identify attribute nodes by their namespace URI and localName() . Because of this fundamental difference, mixing both sets of methods can lead to unpredictable results. In particular, using setAttributeNS () , an element may have two attributes (or more) that have the same nodeName () , but different namespace URIs.

DOM-2 doesn't perform any URI normalization or canonicalization. The URIs given to the DOM are assumed to be valid (e.g., characters such as white spaces are properly escaped), and no lexical checking is performed. Absolute URI references are treated as strings and compared literally. How relative namespace URI references are treated is undefined. To ensure interoperability only absolute namespace URI references (i.e., URI references beginning with a scheme name and a colon ) should be used. Note that because the DOM does no lexical checking, the empty string will be treated as a real namespace URI in DOM-2 methods. Applications use the value null as the namespace URI parameter for methods if they wish to have no namespace.