The XML Document Structure | XML and ASP.NET

only for RuBoard

You will see the term document throughout this book. A document refers to a set of XML tags that follow the XML document syntax. Because the XML can be contained within a file, a stream, or a string, the term document is used to separate the implementation of how the bytes are stored from the logical grouping of those bytes into an informational item.

A document contains three sections: a prologue, a body, and an epilogue . In terms of understanding XML in .NET, you are only going to focus on the body section of an XML document and what an XML document is. Look at the components of XML and how to create your XML documents.

XML Document Syntax

The XML document syntax is the backbone of XML. The document syntax describes what an XML document is.

An XML document:

Is composed of one or more nodes
Has one and only one root node
Has elements whose tags are properly nested
Has elements that contain both start and end tags

Take a look at some of these concepts in a more familiar manner. (We assume here that you have written at least simple HTML documents.) Consider the following HTML code for creating an anchor tag or hyperlink:

 <a href="http://www.microsoft.com">Microsoft.com</a>

The syntax of XML is recognizable if you have experience with HTML. XML also uses the angle brackets (< and >) for delimiting tags, and HTML uses the same beginning tag and ending tag sequence. The XML document syntax actually refers to an XML element that contains a beginning tag with matching case. Figure 1.1 shows the begin tag, end tag, and content of an XML element.

Figure 1.1. The begin tag, end tag, and content of an XML element.

Many browsers are accepting of poorly formed HTML documents. For example, Internet Explorer gladly renders the following:

  <H1>  This is a heading where the tags don't match  </h1>

Notice that the tags use a different case: This might be acceptable for HTML browsers, but it is not acceptable for XML. XML is case-sensitive, and begin and end tags must match. Similarly, HTML elements do not always require an ending tag. One such HTML tag is the <br> tag. In XML, all elements require both a start tag and an end tag.

If a document conforms to the preceding syntax rules, the document is said to be well formed. For example, the following can be considered well formed:

 <parent/>

 <parent></parent>

 <parent>data</parent>

 <parent><teacher>text</teacher><student>name</student></parent>

The following is not well formed because it contains multiple root nodes:

 <parent></parent><teacher></teacher>

This example is not well formed because does not have an end tag:

 <root>

Here, you see a document that is not well formed because its children are not properly nested:

 <parent><teacher>text</student><student>name</teacher></parent>

The XML document syntax addresses what a document is and what its structure is. By conforming to these rules, you impose a common structure on your data that's flexible enough to meet different applications. The next step in understanding XML is to understand its building blocks.

Elements

Elements were mentioned in the previous section. An element is a type name for an XML entity. The equivalent concept in HTML is a tag. An element is marked by a start tag and an end tag. An element is similar to a tag in HTML. For example, the following HTML creates a table:

 <table>      <tr>         <td>Hello</td><td>World</td>      </tr>  </table>

In XML, you can think of each of the HTML tags as XML elements. But the presence of a tag does not make an XML document. An XML document must conform to the XML document syntax, and the data contained in an element or attribute must not contain certain types of characters . Elements can contain other elements, character data, or both.

The text appearing between the start tag and end tag is called the element's content . If no data appears between the start and end tag, the element is said to be empty. An empty tag can be represented either as

 <customer></customer>

 <customer/>

An element has a name and is associated with a namespace. See the sections, " Names " and "Namespaces" for more information.

Elements are rather simplistic. This simplicity makes XML a successful technology. The ability to create new elements with little restriction is an integral part of XML.

Attributes

An attribute is a simple name and value pairing that describes a facet of an element. An attribute must have an associated name that is unique to the element, and each attribute has a value that does not contain angled brackets ( < or > ).

 <?xml version='1.0'?>  <children>  <child nickname="Little Bit">Carson Allen Evans</child>  <child nickname="Teeny Bit"/>  </children>

In the preceding example, the nickname attribute describes a single facet of the child element. The name of the attribute is nickname , and the value of the attribute for Carson Allen Evans is Little Bit . An attribute name follows the qualified naming syntax (see the section, "Names" for more information).

Because you can structure a document by using elements and/or attributes, the natural question arises, "When should I choose one over the other?" This is a long-standing debate. Some developers argue that attributes are no longer necessary because you can represent any attribute as a child element. Others argue that attributes make a document infinitely more readable. Attributes and child elements are used throughout this book. The only hard and fast rule is that each attribute name must be unique within an element. If you find that an attribute must appear more than once for an element, you need to switch to using nested elements instead of attributes.

Two attributes are predefined in the XML 1.0 Recommendation: xml:space and xml:lang .

xml:space

The xml:space attribute enables the document's author to explicitly state how white space is to be handled within the document. XML parsers, such as those used in the .NET Framework classes, can strip white space from a document unless otherwise instructed. The xml:space attribute signifies that the current node and its children should be treated accordingly . The valid values for xml:space are preserve and default . Specifying a value of preserve signifies to the processor that the current node and its children should have whitepsace preserved. This behavior can then be overridden in a child node by using a value of default to signify that the default white space handling should again be used.

 <data xml:space="preserve">      <a>     </a>      <b xml:space="default">         <c>     </c>      </b>  </data>

Using this example, space is preserved for the data and a nodes. The b node overrides the behavior of its parent and uses default white space handling. Node c then uses default white space handling because its parent, b , overrode the handling locally.

Table 1.1 lists the definitions of white space.

Table 1.1. White Space Characters in XML

ASCII Character Code	Description
9	Horizontal tab (vbTab)
10	Line feed
13	Carriage return
32	Space character

The .NET Framework classes dealing with XML provide properties to specify white space handling. It is more typical to handle white space formatting in the object model and parser or XSLT stylesheet than it is to explicitly declare the white space formatting within the XML document itself because the document should specify data and not its presentation or behavior.

xml:lang

The xml:lang attribute is used for internationalization. The valid values are defined in ISO 639, RFC 1766, or a user -defined language code. Table 1.2 shows you some examples of valid values for the xml:lang attribute.

Table 1.2. Examples of ISO 639 Country Codes and Subcodes

Country Code (and Subcode)	Description
en-US	English (U.S.)
en-GB	English (Great Britain)
Fr	French
sp-MX	Spanish

Here, you see three different languages being used within the same document: U.S. English, French, and Spanish:

 <?xml version="1.0" encoding="UTF-8"?>  <phrases>       <phrase xml:lang="en-US">One beer, please</phrase>       <phrase xml:lang="sp-MX">Uno cerveza, por favor</phrase>       <phrase xml:lang="fr">Une bire, s'il vous plat</phrase>  </phrases>

xmlns

Actually, one more attribute is valid for elements, <xmlns> , which was not a part of the XML 1.0 Recommendation. This attribute associated a node with a given namespace. Namespaces are discussed in more detail in the section, "Namespaces."

Comments

XML documents can also contain comments. The syntax for creating a comment in XML is the same as it is in HTML:

 <customer>     <!--The following nodes are commented out and unreachable       <foo>Testing</foo>       <bar>Another test</bar>     -->  </customer>

Comments can contain any type of data, but cannot be nested. XML parsers can ignore comments in their implementation and might not be reachable as a node. The .NET Framework classes that deal with XML recognize comments as a processable node.

CDATA

CDATA sections contain data that would otherwise be recognized as markup. For example, elements cannot contain angled brackets in their content; neither can attributes. They can, however, contain a character entity reference (< or > ), but cannot contain the literal angled bracket . CDATA sections provide a means for containing this type of data.

 <script language="JavaScript">  var i;  function getIncrement()  {   return(++i);  }  </script>

To represent the preceding JavaScript function block in an XML document, you need to use character entity references to denote the literal values that would otherwise break up the document's organization, as shown here:

 <?xml version="1.0"?>  <function>  &lt;script&gt; language=&quot;JavaScript&quot;>  var i;  function getIncrement()  {   return(++i);  }  &lt;/script&gt;  </function>

Instead of using character entity references, you can use a CDATA section to contain the data, as follows:

 <?xml version="1.0"?>  <function>  <![CDATA[ <script language="JavaScript">  var i;  function getIncrement()  {   return(++i);  }  </script>  ]]>  </function>

Because the angled brackets and double quotes are contained within the CDATA section, the document is now well formed.

Processing Instructions

Processing instructions (PIs) give the XML processor hints on how the document needs to be handled.

Another processing instruction is commonly used to link an XSL stylesheet to an XML document:

 <?xml-stylesheet type="text/xsl" href="mystylesheet.xsl"?>

This common processing instruction links an XML document to a particular XSL stylesheet.

You can also declare your own processing instructions by using the processing instruction syntax. PIs must use a valid NCName and cannot begin with "xml" because the use of that prefix is reserved. Here's an example:

 <?my-instruction this is my custom PI?>

As you can see in the preceding code line, a processing instruction does not need to contain a formal attribute. It can contain any content except for the end tag delimiter , which is ?> .

A common misconception is that the following is also a PI:

 <?xml version="1.0" encoding="UTF-8"?>

This looks like a PI, but it is actually the optional XML declaration that specifies the XML version and the encoding scheme. The reason this misconception is prevalent is because the only way to create this by using MSXML is to create a PI that looks like the XML declaration. This difference might seem trivial, but it makes a huge difference in the .NET Framework because a distinction is made between processing instructions and the XML declaration. For example, the XmlDocument class contains the CreateProcessingInstruction method and a separate CreateXmlDeclaration method. In addition, a separate class, XmlDeclaration , is separate from the XmlProcessingInstruction class.

Names

Both elements and attributes have an associated local name and a namespace. A local name, or NCName , begins with an underscore or an alphabetic character followed by zero or more alphanumeric characters, periods, hyphens, underscores, or full-stops (periods)[2]. A name cannot have spaces in it. The following element names are valid:

 <Customers></Customers>

and

 <products></products>

and

 <_orders></_orders>

The following element name is not valid because it contains a space:

 <car type></car type>

Namespaces

As mentioned previously, all elements and attributes also have an associated namespace that, combined with the local name, forms the name. A namespace is a Uniform Resource Identifier (URI), which is either a Uniform Resource Locator (URL) or Uniform Resource Name (URN). If not expressly declared, an element or attribute falls in to the default namespace, which might be the null namespace.

When using the Visual Studio .NET Schema Designer, it uses a default namespace of www.tempuri.org . This implies that the document must be found at that location; however, this is not the case.

All domain names on the web are unique. For example, I cannot register the domain name Microsoft.com because that domain name is already taken. Because domain names are unique across the Internet, you can easily see the rationale behind using a URL as a unique qualifier for a namespace. Because domain names are unique, it is easy to control the uniqueness of a namespace by using your domain name as its basis. URLs are commonly used and understood , in theory, by most Internet users. URNs, however, are not familiar to many developers, so they simply choose what is convenient or better known. No real advantage exists to one over the other, except a URN can more clearly show that a unique identifier does not imply the resource's location. I could have easily used example.com , tempuri.org , foo.bar , anything. anywhere ,or any other valid URL even though I might not have access to that URL and the URL might not even exist. There's no guarantee that the namespace used has anything to do with the location of the document; it only implies the uniqueness.

After you understand that the namespace name does not imply its location, you're ready to move on to the concept of namespaces and their use. Namespaces prevent naming collisions.

The following XML document expresses siblings in the Evans family:

 <siblings>     <name>Bob Evans</name>     <name>Keith Evans</name>     <name>Michelle Schultz</name>  </siblings>

Use the <name> element to designate a sibling's name. But what if you want to distinguish between brothers and sisters? In that case, you need to use namespaces, as follows:

 <siblings xmlns:brother="urn:sibling:brother"           xmlns:sister="urn:sibling:sister">     <brother:name>Bob Evans</brother:name>     <brother:name>Keith Evans</brother:name>     <sister:name>Michelle Schultz</sister:name>  </siblings>

Now you can differentiate between brother and sister elements while using the same local name, name , for each element. A namespace is composed of a namespace prefix and the URL or URI that's associated with the namespace. Together, the namespace prefix and the identifier form a qualified name, which is also referred to as a QName .

Suppose that you are developing an XSLT stylesheet. You want to output a node, message , and the XSLT namespace also includes a node called message . How do you differentiate between the two nodes and what their meanings are? The answer is to bind a prefix to the namespace and qualify the element names, as shown here:

 <xsl:stylesheet        xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"        xmlns="urn:schemas-vbdna:sample">        <xsl:template match="/">                 <hello>This is a sample, bound to the default namespace</hello>                 <message>This element is also bound to the default namespace</message>                 <xsl:message>However, this is a message from XSLT</xsl:message>        </xsl:template>  </xsl:stylesheet>

Two different namespaces are used in this example. One namespace was declared with a prefix, xsl , and the other used no prefix. By using the namespace prefix xsl , you actually change the name of the element. The XML parser and XPath functions now recognize the element name as <xsl:message> .

The element declared with no prefix at the root level of the document defines the default namespace. The <message> element is bound to the default namespace, as is the <hello> element.

Namespaces do not need to be declared at the root or document level ” they can be declared at the element level as well. Here's an example:

 <?xml version="1.0"?>  <doc>      <test>This is a test</test>      <data xmlns="example">          <child>Bound to the example namespace</child>          <sibling xmlns="">No longer bound</sibling>      </data>  </doc>

In this example, the doc element is bound to the null namespace. As previously mentioned, every element and attribute name consists of a local name and an associated namespace. The <doc> element is not explicitly bound to any namespace. It belongs to the null namespace. This is also true for the <test> element. The <data> element, however, is explicitly bound to the namespace example .By binding an element to a namespace, all its children are also bound to that namespace unless it's overridden. Therefore, the <child> element is bound to the example namespace.

To override a namespace, simply set the xmlns attribute to a new value. The sibling element, for example, overrode the namespace inherited from its parent. All child elements of the <sibling> element then belong to the null namespace.

Note

Namespaces are also covered in depth in Chapter 2, "XML Schemas in .NET."

Entities

Documents might need to contain angled brackets or quotation signs in element content. Containing all such data in CDATA sections to handle one or two occurrences of reserved characters would be tedious . In order to handle this, XML provides entity references that represent delimiter characters. Table 1.3 shows the five entity references provided by the XML 1.0 Recommendation.

Table 1.3. Entity References Provided by XML

Entity Reference	ASCII Code	Escape Character
`&`	38	Ampersand (&)
`"`	34	Double quote (")
`'`	39	Apostrophe (')
`<`	60	Less than (<)
`>`	62	Greater than (>)

Besides entity references, you can also use character references. Character references use either decimal or hexadecimal to represent the character. When using decimal, the value must be preceded by &# and followed by a trailing semicolon, as shown here:

 &#160;  <!-- outputs a non-breaking space-->

If using hexadecimal, the value must be preceded by &#x and followed by a trailing semicolon, as shown here:

 &#xA0;  <!-- outputs a non-breaking space -->

Some entity references or character references are only available within a given namespace. For example, the non-breaking space entity reference is defined for HTML, but not for XML or XSLT. If you're using an XSLT stylsheet to generate XHTML, you might want to output a non-breaking space. To do this, you need to use either the decimal or hexadecimal format for a non-breaking space character reference. Chapter 3, "XML Presentation," looks deeper into this issue.

only for RuBoard