XML Documents | Beginning Visual C#supAND#174;/sup 2005

A complete set of data in XML is known as an XML document. An XML document could be a physical file on your computer or just a string in memory. However, it has to be complete in itself, and it must obey certain rules (you see what these are shortly). An XML document is made up of a number of different parts. The most important of these are XML elements, which contain the actual data of the document.

XML Elements

XML elements consist of an opening tag (the name of the element enclosed in angled brackets, such as <myElement>), the data within the element, and a closing tag (the same as the opening tag, but with a forward slash after the opening bracket: </myElement>).

For example, you might define an element to hold the title of a book like this:

<book>Tristram Shandy</book>

If you already know some HTML, you might be thinking that this looks very similar — and you'd be right! In fact, HTML and XML share much of the same syntax. The big difference is that XML doesn't have any predefined elements — you choose the names of our own elements, so there's no limit to the number of elements you can have. The most important point to remember is that XML — despite its name — isn't actually a language at all. Rather, it's a standard for defining languages (known as XML applications). Each of these languages has its own distinct vocabulary — a specific set of elements that can be used in the document and the structure these elements are allowed to take. As you'll shortly see, you can explicitly limit the elements allowed in the XML document. Alternatively, you can allow any elements, and allow the program using the document to work out for itself what the structure is.

Element names are case-sensitive, so <book> and <Book> are counted as different elements. This means that if you attempt to close a <book> element using a closing tag that doesn't have identical casing (for example, </BOOK>), your XML document won't be legal. Programs that read XML documents and analyze them by examining their individual elements are known as XML parsers, and they will reject any document that contains illegal XML.

Elements can also contain other elements, so you could modify this <book> element to include the author as well as the title by adding two subelements:

<book>    <title>Tristram Shandy</title>    <author>Lawrence Sterne</author> </book>

However, overlapping elements aren't allowed, so you must close all subelements before the closing tag of the parent element. This means, for example, that you can't do this:

<book>    <title>Tristram Shandy    <author>Lawrence Sterne    </title></author> </book>

This is illegal, because the <author> element is opened within the <title> element, but the closing </title> tag comes before the closing </author> tag.

There's one exception to the rule that all elements must have a closing element. It's possible to have "empty" elements, with no nested data or text. In this case, you can simply add the closing tag straight after the opening element, as shown above, or you can use a shorthand syntax, adding the slash of the closing element to the end of the opening element:

<book />

This is identical to the full syntax:

<book></book>

Attributes

As well as storing data within the body of the element, you can also store data within attributes, which are added within the opening tag of an element. Attributes are in the form

name="value"

where the value of the attribute must be enclosed in either single or double quotes. For example:

<book title="Tristram Shandy"></book>

<book title='Tristram Shandy'></book>

These are both legal, but this is not:

<book title=Tristram Shandy></book>

At this point, you may be wondering why you need both ways of storing data in XML. What's the difference between

<book>    <title>Tristram Shandy</title> </book>

and

<book title="Tristram Shandy"></book>

The honest answer is that there isn't any earth-shatteringly fundamental difference between the two. There isn't really any big advantage to using either. Elements are a better choice if there's a possibility that you'll need to add more information about that piece of data later — you can always add a subelement or an attribute to an element, but you can't do that for attributes. Arguably, elements are more readable and more elegant (but that's really a matter of personal taste). On the other hand, attributes consume less bandwidth if the document is sent over a network without compression (with compression there's not much difference) and are convenient for holding information that isn't essential to every user of the document. Probably the best advice is to use both, selecting whichever you're most comfortable with for storing a particular item of data,. but there really are no hard and fast rules.

The XML Declaration

In addition to elements and attributes, XML documents can contain a number of constituent parts. These individual parts of an XML document are known as nodes; elements, the text within elements, and attributes are all nodes of the XML document. Many of these are important only if you really want to delve deeply into XML. However, one type of node occurs in almost every XML document. It is the XML declaration, and if you include it, it must occur as the first node of the document.

The XML declaration is similar in format to an element, but has question marks inside the angled brackets. It always has the name xml, and it always has an attribute named version; currently, the only possible value for this is "1.0". The simplest possible form of the XML declaration is, therefore:

<?xml version="1.0"?>

Note

As of February 2004 W3C (www.w3c.org) has released a recommendation for XML 1.1, but at the time of this writing there are few or no real-life implementations of this recommendation.

Optionally, it can also contain the attributes encoding (with a value indicating the character set that should be used to read the document, such as "UTF-16" to indicate that the document uses the 16-bit Unicode character set) and standalone (with the value "yes" or "no" to indicate whether the XML document depends on any other files). However, these attributes are not required, and you will probably include only the version attribute in your own XML files.

Structure of an XML Document

One of the most important things about XML is that it offers a way of structuring data that is very different from relational databases. Most modern database systems store data in tables that are related to each other through values in individual columns. Each table stores data in rows and columns — each row represents a single record, and each column a particular item of data about that record. In contrast, XML data is structured hierarchically, a little like the folders and files in Windows Explorer. Each document must have a single root element within which all elements and text data is contained. If there is more than one element at the top level of the document, the document will not be legal XML. However, you can include other XML nodes at the top level — notably the XML declaration. So this is a legal XML document:

<?xml version="1.0"?> <books>    <book>Tristram Shandy</book>    <book>Moby Dick</book>    <book>Ulysses</book> </books>

But this isn't:

<?xml version="1.0"?> <book>Tristram Shandy</book> <book>Moby Dick</book> <book>Ulysses</book>

Under this root element, you have a great deal of flexibility about how you structure the data. Unlike relational data, in which every row has the same number of columns, there's no restriction on the number of subelements an element can have. And, although XML documents are often structured similarly to relational data, with an element for each record, XML documents don't need any predefined structure at all. This is one of the major differences between traditional relational databases and XML. Where relational databases always define the structure of the information before any data can be added, information can be stored in XML without this initial overhead, which makes it a very convenient way to store small blocks of data. As you will see shortly, it is quite possible to provide a structure for your XML, but unlike the relational databases, no one will enforce this structure unless you ask for it explicitly.

XML Namespaces

As you learned in Chapter 9, everyone can define their own C# classes, and everyone can define their own XML elements, and this gives rise to exactly the same problem — how do you know which elements belong to which vocabulary? As you might gather from the title of this section, this question is answered in a similar way. Just as you define namespaces to organize your C# types, you use XML namespaces to define our XML vocabularies. This allows you to include elements from a number of different vocabularies within a single XML document, without the risk of misinterpreting elements because (for example) two different vocabularies define a <customer> element.

XML namespaces can be quite complex, so I won't go into great detail here, but the basic syntax is simple. Specific elements or attributes are associated with a specific namespace using a prefix, followed by a colon. For example, <wrox:book> represents a <book> element that resides in the wrox namespace. But how do you know what namespace wrox represents? For this approach to work, you need to be able to guarantee that every namespace is unique. The easiest way to do this is to map the prefixes to something that's already known to be unique. And this is exactly what happens: somewhere in your XML document you need to associate any namespace prefixes with a Uniform Resource Identifier (URI). URIs come in several flavors, but the most common type is simply a Web address, such as "http://www.wrox.com".

To identify a prefix with a specific namespace, use the xmlns:prefix attribute within an element, setting its value to the unique URI that identifies that namespace. The prefix can then be used anywhere within that element, including any nested child elements. For example:

<?xml version="1.0"?> <books>    <book xmlns:wrox="http://www.wrox.com">       <wrox:title>Beginning C#</wrox:title>       <wrox:author>Karli Watson</wrox:author>    </book> </books>

Here, you can use the wrox: prefix with the <title> and <author> elements, because they are within the <book> element, where the prefix is defined. However, if you tried to add this prefix to the <books> element, the XML would be illegal, as the prefix isn't defined for this element.

You can also define a default namespace for an element using the xmlns attribute:

<?xml version="1.0"?> <books>    <book xmlns="http://www.wrox.com">       <title>Beginning Visual C#</title>       <author>Karli Watson</author>       <html:img src="/books/3/459/1/html/2/begvcsharp.gif"                 xmlns:html="http://www.w3.org/1999/xhtml" />    </book> </books>

Here, the default namespace for the <book> element is defined as "http://www.wrox.com". Everything within this element will, therefore, belong to this namespace, unless you explicitly request otherwise by adding a different namespace prefix, as you do for the <img> element (you set it to the namespace used by XML-compatible HTML documents).

Well-Formed and Valid XML

I've been talking up until now about legal XML. In fact, XML distinguishes between two forms of legality. Documents that obey all the rules required by the XML standard itself are said to be well-formed. If an XML document is not well-formed, parsers will be unable to interpret it correctly, and will reject the document. To be well-formed, a document must:

Have one and only one root element
Have closing tags for every element (except for the shorthand syntax mentioned previously)
Not have any overlapping elements — all child elements must be fully nested within the parent
Have all attributes enclosed in quotes

This isn't a complete list, by any means, but it does highlight the most common pitfalls made by programmers who are new to XML.

However, XML documents can obey all these rules and still not be valid. Remember that I said earlier that XML is not itself a language, but a standard for defining XML applications. Well-formed XML documents simply comply with the XML standard; to be valid, they also need to conform to any rules specified for the XML application. Not all parsers check whether documents are valid; those that do are said to be validating parsers. But to check whether a document adheres to the rules of the application, you first need a way to specify what those rules are.

Validating XML Documents

XML supports two ways of defining which elements and attributes can be placed in a document and in what order: Document Type Definitions (DTDs) and schemas. DTDs use a non-XML syntax inherited from the parent of XML and are gradually being replaced by schemas. DTDs don't allow you to specify the data types of the elements and attributes and so are relatively inflexible and not used that much in the context of the .NET Framework. Schemas, on the other hand, are used frequently — they do allow you to specify data types, and they are written in an XML-compatible syntax. However schemas are unfortunately very complex, and there are different formats for defining them — even within the .NET world!

Schemas

There are two separate formats for schemas supported by .NET — XML Schema Definition language (XSD) and XML-Data Reduced schemas (XDR). Schemas can be either included within your XML document or kept in a separate file. These formats are mutually incompatible, and you really need to be very familiar with XML before you attempt to write one, so I won't go into great detail here. It is, however, useful to be able to recognize the main elements in a schema, so I will explain the basic principles. To do this, you look at sample XSD and XDR schemas for this simple XML document, which contains basic details about a couple of Wrox's C# books:

<?xml version="1.0"?> <books>    <book>       <title>Beginning Visual C#</title>       <author>Karli Watson</author>       <code>7582</code>    </book>    <book>       <title>Professional C# 2nd Edition</title>       <author>Simon Robinson</author>       <code>7043</code>    </book> </books>

XSD Schemas

Elements in XSD schemas must belong to the namespace http://www.w3.org/2001/XMLSchema. If this namespace isn't included, the schema elements won't be recognized.

To associate the XML document with an XSD schema in another file, you need to add a schema- location element to the root element:

<?xml version="1.0"?> <books schemalocation="file://C:\BegVCSharp\XML\books.xsd">    ... </books>

Take a quick look at an example XSD schema:

<schema xmlns="http://www.w3.org/2001/XMLSchema">    <element name="books">       <complexType>          <choice maxOccurs="unbounded">             <element name="book">                <complexType>                   <sequence>                      <element name="title" />                      <element name="author" />                      <element name="code" />                   </sequence>                </complexType>             </element>          </choice>          <attribute name="schemalocation" />       </complexType>    </element> </schema>

The first thing to notice here is that the default namespace is set to the XSD namespace. This tells the parser that all the elements in the document belong to the schema. If you don't specify this namespace, the parser will think that the elements are just normal XML elements and won't realize that it needs to use them for validation.

The entire schema is contained within an element called <schema> (with a lowercase "s" — remember that case is important!). Each element that can occur within the document must be represented by an <element> element. This element has a name attribute that indicates the name of the element. If the element is to contain nested child elements, you must include the <element> tags for these within a <complexType> element. Inside this, you specify how the child elements must occur. For example, you use a <choice> element to specify that any selection of the child elements can occur or <sequence> to specify that the child elements must appear in the same order as they are listed in the schema. If an element can appear more than once (as the <book> element does), you need to include a maxOccurs attribute within its parent element. Setting this to "unbounded" means that the element can occur as often as you like. Finally, any attributes must be represented by <attribute> elements, including your schemalocation attribute that tells the parser where to find the schema. Place this after the end of the list of child elements.

XDR Schemas

To attach an external XDR schema to an XML document, you specify a namespace for the document with the value "x-schema:<schema_filename>":

<?xml version="1.0"?> <books xmlns="x-schema:books.xdr">    ... </books>

The schema that follows is the XDR equivalent of the XSD schema you just looked at. As you can see, it is very different:

<Schema xmlns="urn:schemas-microsoft-com:xml-data">    <ElementType name="title" content="textOnly" />    <ElementType name="author" content="textOnly" />    <ElementType name="code" content="textOnly" />    <ElementType name="book" content="eltOnly">       <group order="seq">          <element type="title" />          <element type="author" />          <element type="code" />       </group>    </ElementType>    <ElementType name="books" content="eltOnly">       <element type="book" />    </ElementType> </Schema>

Again, the default namespace is set to tell the parser that all elements in the document belong to the schema definition, this time to "urn:schemas-microsoft-com:xml-data". Notice that (unlike XSD schemas) this is a proprietary format, so it won't work at all with non-Microsoft products. In fact, XDR schemas are particularly useful when working with SQL Server, Microsoft's database server, because it has in-built support for XDR.

This time our root element is <Schema> with a capital "S." This root element again contains the entire schema definition (remember that XML documents must have a single root element). After this, though, there's a big difference — the elements that will appear in your document are defined in reverse order! The reason for this is that each element in the document is represented in the schema by an <ElementType> element, and this contains an <element> element (note the lowercase e here) for each child element. Within the <element> tags, the type attribute is set to point to an <ElementType> element — and this must already have been defined. If you want to restrict how child elements can appear, you can use a <group> element within the <ElementType> and set its order attribute. In the case, it is set to "seq" to specify that the elements occur in the same sequence as in the schema — just like the <sequence> tag in the XSD schema!

Now you've covered the basic theory behind XML, so in the following Try It Out, you can have a go at creating XML documents. Fortunately, VS does a lot of the hard work for you, and it will even create an XSD schema based on your XML document without you having to write a single line of code!

Try It Out – Creating an XML Document in Visual Studio

Follow these steps to create an XML document.

Open VS and select File New File... from the menu. (you don't need to have a project already open).
In the New File menu, select XML File and click Open. VS will create a new XML document for you. As Figure 23-1 shows, VS adds the XML declaration, complete with an encoding attribute (it also colors the attributes and elements, but this won't show up well in black and white print):

Figure 23-1
Save the file by pressing Ctrl+S or by selecting File Save XMLFile1.xml from the menu. VS will ask you where to save the file and what to call the file; save it in the BegCSharp\Chapter23 folder as GhostStories.xml.
Move the cursor to the line underneath the XML declaration, and type the text <stories>. Notice how VS automatically puts the end tag in as soon as you type the greater than sign to close the opening tag.

Type in this XML file:

 <?xml version="1.0" encoding="utf-8" ?> <stories> <story> <title>A House in Aungier Street</title> <author> <name>Sheridan Le Fanu</name> <nationality>Irish</nationality> </author> <rating>eerie</rating> </story> <story> <title>The Signalman</title> <author> <name>Charles Dickens</name> <nationality>English</nationality> </author> <rating>atmospheric</rating> </story> <story> <title>The Turn of the Screw</title> <author> <name>Henry James</name> <nationality>American</nationality> </author> <rating>a bit dull</rating> </story> </stories>

Right-click on the XML in the code window and select View Data Grid from the pop-up menu. Visual Studio displays the data from the XML file in a tabular format, as shown in Figure 23-2, as though it came from a relational database.

image from book
Figure 23-2

Note

Visual Studio is now displaying two tabs with the same caption (GhostStories.xml). These two windows represent two different views for the same file — if you change something in one it is immediately reflected in the other.

You can actually edit the data in this table, so you can modify our XML document here without even having to type the tags. Click on the box for the title column in the empty row at the bottom of the grid, and type Number 13. Now move to the rating box beside it, and type mysterious. This enters a new story, but you still need to enter the author. To do this, click on the plus sign next to the new row. This will bring up a link for the <author> element, as shown in Figure 23-3.

Figure 23-3
Click this link and another table will be displayed where you can enter the name and nationality of the author. Enter MR James and English in the two columns (make sure that you press Enter after typing the nationality, or the data will be lost). You should now see the information shown in Figure 23-4.

Figure 23-4

Now return to the first tab labeled GhostStories.xml. A new <story> element should have been added just before the closing </stories> tag:

 <story> <title>Number 13</title> <rating>mysterious</rating> <author> <name>MR James</name> <nationality>English</nationality> </author> </story>

As its final party trick, you get VS to create an XSD schema for this XML document. On the XML Menu, click Create Schema to have VS create a new XSD schema file for you that represents the data in the original XML document.