What Is XML? | C# Developer[ap]s Guide to ASP. NET, XML, and ADO. NET

for RuBoard

Here's a problem you've probably faced before. A customer or colleague comes to you asking for help working with an application that was written five years ago. Nobody who originally worked on the application still works for the company; the original developer died in a bizarre gardening accident some years back. The customer wants you to write a Web-based reporting system to handle the data emitted by this dinosaur application.

You now have the unenviable task of figuring out how this thing works, parsing the data it emits, and arranging that data in some recognizable format ”a report.

Let's assume that the developer of the original application attempted to make it easy on you by expressing the data in some standardized format ”maybe one in which elements within rows of data are separated from each other by a designated character, such as a comma or a tab. This is known as a delimited format. Listing 10.1 demonstrates a comma-delimited document.

Listing 10.1 A Comma-Delimited Document

 Jones,Machine Gun,401.32,New York Janson,Hand Grenade,79.95,Tuscaloosa Newton,Artillery Cannon,72.43,Paducah

However, a few problems occur with the delimited format. First, what happens if the data itself contains a comma or a tab? In this case, you're forced to use a more complicated delimiter ”typically a comma with data enclosed in quotation marks. That different documents can use different delimiters is a problem in itself, though. There's no such thing as a single universal parse algorithm for delimited documents.

To make it even more difficult, different operating systems have different ideas about what constitutes the end of a line. Some systems (such as Windows) terminate a line with a carriage return and a line feed (ASCII 13 and 10, respectively), whereas others (such as Unix) just use a line feed.

Another problem: What is this data? Some of it, such as the customer's name and the item, is obvious. But what does the number 401.32 represent? Ideally, we want a document that is self-describing ”one that tells us at a glance what all the data represents (or at least gives us a hint).

Another big problem with delimited documents: How can you represent related data? For example, it might be nice to view all the information about customers and orders in the same document. You can do this with a delimited document, but it can be awkward . And if you've written a parser that expects the first field to be the customer name and the fourth field to be the product name, adding any new fields between them breaks the parser.

Internet technology mavens realized that this scenario is frighteningly common in the world of software development ”particularly in Internet development. XML was designed to replace delimited data, as well as other data formats, with something standard, easy to use and to understand, and powerful.

Advantages of XML

In a networked application, interoperability between various operating systems is crucial; the transfer of data from point A to point B in a standard, understandable way is what it's all about. For tasks that involve parsing data, then, using XML means spending less time worrying about the details of the parser itself and more time working on the application.

Here are some specific advantages of XML over other data formats:

XML documents are easily readable and self-describing ”Like HTML, an XML document contains tags that indicate what each type of data is. With good document design, it should be reasonably simple for a person to look at an XML document and say, "This contains customers, orders, and prices."
XML is interoperable ”Nothing about XML ties it to any particular operating system or underlying technology. You don't have to ask anyone's permission or pay anyone money to use XML. If the computer you're working on has a text editor, you can use it to create an XML document. Several types of XML parsers exist for virtually every operating system in use today (even really weird ones).
XML documents are hierarchical ”It's easy to add related data to a node in an XML document without making the document unwieldy.
You don't have to write the parser ”Several types of object-based parser components are available for XML. XML parsers work the same way on virtually every platform. The .NET platform contains support for the Internet-standard XML Document Object Model, but Microsoft has also thrown in a few XML parsing widgets that are easier to use and that perform better than the XML DOM; we'll cover these later in this chapter.
Changes to your document won't break the parser ”Assuming that the XML you write is syntactically correct, you can add elements to your data structures without breaking backward compatibility with earlier versions of your application.

Is XML the panacea to every problem faced by software developers? XML won't wash your car or take out the garbage for you, but for many tasks that involve data, it's a good choice.

At the same time, Visual Studio .NET hides much of the implementation details from you. Relational data expressed in XML is abstracted in the form of a DataSet object. XML schemas (a document that defines data types and relationships in XML) can be created visually, without writing code. In fact, Visual Studio .NET can generate XML schemas for you automatically by inspecting an existing database structure.

XML Document Structure and Syntax

XML documents must adhere to a standard syntax so that automated parsers can read them. Fortunately, the syntax is pretty simple to understand, especially if you've developed Web pages in HTML. The XML syntax is a bit more rigorous than that of HTML, but as you'll see, that's a good thing. There are a million ways to put together a bogus , sloppy HTML document, but the structure required by XML means that you get a higher level of consistency; no matter what your document contains, the rules that govern how an XML document can be parsed are the same.

Declaration

The XML declaration is the same for all XML documents. An XML declaration is shown in Listing 10.2.

Listing 10.2 XML 1.0 Declaration

 <?xml version="1.0"?>

The declaration says two things: This is an XML document (duh), and this document conforms to the XML 1.0 W3C recommendation (which you can get straight from the horse's mouth at http://www.w3.org/TR/REC-xml). The current and only W3C recommendation for XML is version 1.0, so you shouldn't see an XML declaration that's different from what's in Listing 10.2. But you might in the future, when the specification is revised into new versions.

NOTE

A W3C recommendation isn't quite the same as a bona fide Internet standard, but it's close enough for our purposes.

The XML declaration, when it exists, must exist on the first line of the document. The declaration does not have to exist, however; it is an optional part of an XML document. The idea behind a declaration is that you may have some automated tool that trawls document folders looking for XML. If your XML files contain declarations, it'll be much easier for such an automated process to locate XML documents (as well as to differentiate them from other marked -up documents, such as HTML Web pages).

Don't sweat it too much if you don't include a declaration line in the XML documents you create. Leaving it out doesn't affect how data in the document is parsed.

Elements

An element is a part of an XML document that contains data. If you're accustomed to database programming or working with delimited documents, you can think of an element as a column or a field. XML elements are sometimes also called nodes.

XML documents must have at least one top-level element to be parsable. Listing 10.3 shows an XML document with a declaration and a single top-level element (but no actual data).

Listing 10.3 A Simple XML Document with a Declaration and a Top-Level Element

 <?xml version="1.0"?> <ORDERS> </ORDERS>

This document can be parsed, even though it contains no data. Note one important thing about the markup of this document: It contains both an open tag and a close tag for the <ORDERS> element. The closing tag is differentiated by the slash ( / ) character in front of the element name. Every XML element must have a closing tag ”lack of a closing tag will cause the document to be unparsable. The XML declaration is the only part of an XML document that does not require a closing tag.

This is an important difference between XML and HTML. In HTML, some elements require close tags, but many don't. Even for those elements that don't contain proper closing tags, the browser often attempts to correctly render the page (sometimes with quirky results).

XML, on the other hand, is the shrewish librarian of the data universe. It's not nearly as forgiving as HTML and will rap you on the knuckles if you cross it. If your XML document contains an element that's missing a close tag, the document won't parse. This is a common source of frustration among developers who use XML. Another kicker is that, unlike HTML, tag names in XML are case sensitive. This means that <ORDERS> and <orders> are considered to be two different and distinct tags.

Elements That Contain Data

The whole purpose of an XML element is to contain pieces of data. In the previous example, we left out the data. Listing 10.4 shows an evolved version of this document, this time with data in it.

Listing 10.4 An XML Document with Elements That Contain Data

 <?xml version="1.0"?> <ORDERS>   <ORDER>     <DATETIME>1/4/2000 9:32 AM</DATETIME>     <ID>33849</ID>     <CUSTOMER>Steve Farben</CUSTOMER>     <TOTALAMOUNT>3456.92</TOTALAMOUNT>   </ORDER> </ORDERS>

If you were to describe this document in English, you'd say that it contains a top-level ORDERS element and a single ORDER element, or node. The ORDER node is a child of the ORDERS element. The ORDER element itself contains four child nodes of its own:

DATETIME , ID , CUSTOMER , and TOTALAMOUNT .

Adding a few additional orders to this document might give you something like Listing 10.5.

Listing 10.5 An XML Document with Multiple Child Elements Beneath the Top-Level Element

 <?xml version="1.0"?> <ORDERS>   <ORDER>     <DATETIME>1/4/2000 9:32 AM</DATETIME>     <ID>33849</ID>     <CUSTOMER>Steve Farben</CUSTOMER>     <TOTALAMOUNT>3456.92</TOTALAMOUNT>   </ORDER>   <ORDER>     <DATETIME>1/4/2000 9:32 AM</DATETIME>     <ID>33856</ID>     <CUSTOMER>Jane Colson</CUSTOMER>     <TOTALAMOUNT>401.19</TOTALAMOUNT>   </ORDER>   <ORDER>     <DATETIME>1/4/2000 9:32 AM</DATETIME>     <ID>33872</ID>     <CUSTOMER>United Disc, Incorporated</CUSTOMER>     <TOTALAMOUNT>74.28</TOTALAMOUNT>   </ORDER> </ORDERS>

Here's where developers sometimes get nervous about XML. With a document like this, you can see that there's far more markup than data. Does this mean that all those extra bytes will squish your application's performance?

Maybe, but not necessarily . Consider an Internet application that uses XML on the server side. When this application needs to send data to the client, it first opens and parses the XML document (we'll discuss how XML parsing works later in this chapter). Then some sort of result ”in all likelihood , a tiny subset of the data, stripped of markup ”will be sent to the client Web browser. The fact that there's a bunch of markup there doesn't slow the data transfer down significantly.

At the same time, there is a way to express data more succinctly in an XML document, without the need for as many open and closing markup tags. You can do this through the use of attributes.

Attributes

An attribute is another way to enclose a piece of data in an XML document. An attribute is always part of an element; it typically modifies or is related to the information in the node. In a relational database application that emits XML, it's common to see foreign key data expressed in the form of attributes.

For example, a document that contains information about a sales transaction might use attributes as shown in Listing 10.6.

Listing 10.6 An XML Document with Elements and Attributes

 <?xml version="1.0"?> <ORDERS>   <ORDER id="33849" custid="406">     <DATETIME>1/4/2000 9:32 AM</DATETIME>     <TOTALAMOUNT>3456.92</TOTALAMOUNT>   </ORDER> </ORDERS>

As you can see from the example, attribute values are always enclosed in quotation marks. Using attributes tends to reduce the total size of the document (because you don't need to store open and close tags for the element). This has the effect of reducing the amount of markup at the expense (in some cases) of readability. Note that you are allowed to use either single or double quotation marks anywhere XML requires quotes.

This element/attribute syntax may look familiar from HTML, which uses attributes to assign values to elements the same way XML does. But remember that XML is a bit more rigid than HTML; a bracket out of place or a mismatched close tag will cause the entire document to be unparsable.

Enclosing Character Data

At the beginning of this chapter, we discussed the various dilemmas involved with delimited files. One of the problems with delimiters is that if the delimiter character exists within the data, it's difficult or impossible for a parser to know how to parse the data.

This problem is not confined to delimited files; XML has similar problems with containing delimiter characters . The problem arises because the de facto XML delimiter character (in actuality, the markup character) is the left angle bracket, also known as the less-than symbol. In XML, the ampersand character (&) can also throw off the parser.

You've got two ways to deal with this problem in XML: Either replace the forbidden characters with character entities or use a CDATA section as a way to delimit the entire data field.

Using Character Entities

You might be familiar with character entities from working with HTML. The idea is to take a character that might be interpreted as a part of markup and replace it with an escape sequence to prevent the parser from going haywire. Listing 10.7 provides an example of this.

Listing 10.7 An XML Document with Escape Sequences

 <?xml version="1.0"?> <ORDERS>   <ORDER id="33849">     <NAME>Jones &amp; Williams Certified Public Accountants</NAME>     <DATETIME>1/4/2000 9:32 AM</DATETIME>     <TOTALAMOUNT>3456.92</TOTALAMOUNT>   </ORDER> </ORDERS>

Take a look at the data in the NAME element in the code example. Instead of an ampersand, the & character entity is used. (If a data element contains a left bracket, it should be escaped with the < character entity.)

When you use an XML parser to extract data with escape characters, the parser will automatically convert the escaped characters to their correct representation.

Using CDATA Elements

An alternative to replacing delimiter characters is to use CDATA elements. A CDATA element tells the XML parser not to interpret or parse characters that appear in the section.

Listing 10.8 demonstrates an example of the same XML document from before, this time delimited with a CDATA section rather than a character entity.

Listing 10.8 An XML Document with a CDATA Section

 <?xml version="1.0"?> <ORDERS>   <ORDER id="33849">     <NAME><![CDATA[Jones & Williams Certified Public Accountants]]></NAME>     <DATETIME>1/4/2000 9:32 AM</DATETIME>     <TOTALAMOUNT>3456.92</TOTALAMOUNT>   </ORDER> </ORDERS>

In this example, the original data in the NAME element does not need to be changed, as in the previous example. Here, the data is wrapped with a CDATA element. The document is parsable, even though it contains an unparsable character (the ampersand).

Which technique should you use? It's really up to you. You might prefer to use the CDATA method because it doesn't require altering the original data, but it has the disadvantage of adding a dozen or so bytes to each element.

Abbreviated Close-Tag Syntax

For elements that contain no data, you can use an abbreviated syntax for element tags to reduce the amount of markup overhead contained in your document. Listing 10.9 demonstrates this.

Listing 10.9 An XML Document with Empty Elements

 <?xml version="1.0"?> <ORDERS>   <ORDER id="33849" custid="406">     <DATETIME>1/4/2000 9:32 AM</DATETIME>     <TOTALAMOUNT />   </ORDER> </ORDERS>

You can see from the example that the TOTALAMOUNT element contains no data. As a result, we can express it as <TOTALAMOUNT /> instead of <TOTALAMOUNT> </TOTALAMOUNT> . It's perfectly legal to use either syntax in your XML documents; the abbreviated syntax is generally better, though, because it reduces the size of your XML document.

for RuBoard