Why XML? | XML and ASP.NET

only for RuBoard

XML is many things, but it is not a panacea. Simply adding an XML document or two to your application is not going to make anything magically better. There are reasons for XML, and there are certainly cases where XML is not appropriate. Admittedly, much hype surrounds XML. Before jumping into a discussion on what XML is, it is important to ask why you're learning about XML in the first place.

Standardized Generalized Markup Language (SGML) has been around since the late 1960s. But SGML is a broad and complex technology, and did not gain wide acceptance. Then Hypertext Markup Language (HTML) came along, and the world of markup languages changed.

Tim Berners-Lee created HTML in 1989. HTML is a descendant of SGML that displays data by using a fixed set of tags to signify different display elements. HTML is a greatly simplified version of SGML, and quickly gained widespread acceptance because of its ease of use. Although great for presentation, the actual data for an HTML page is intermingled with the display elements. Soon, many people realized that HTML was too simple for complex data requirements.

The designers of XML recognized this and sought to develop a markup language that would truly separate data from its presentation. Thus, XML 1.0 was presented to the W3C as a working draft on November 14, 1996. The W3C working group had the following design goals in mind for XML:

It must be easily usable over the Internet.
It needs to support a wide variety of applications.
It must be compatible with SGML.
It must be easy to write programs that process XML documents.
The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The design needs to be prepared quickly.
The design needs to be formal and concise .
XML documents must be easy to create.
Terseness is of minimal importance.[1]

As you can see, ease of use and understandability were the main goals of XML from its beginning. More importantly, XML is a widely adopted standard markup format that actually achieves the preceding goals. Open a Microsoft Word file in Notepad and look at the markup involved in making a .doc file or a .rtf file: It is not trivial. XML makes the overall structure of a document trivial, leaving the implementation and dialect up to you.

In short, you use XML to describe data in a universal fashion: It's common across the boundaries of language and operating system (OS). You can use XML to create views of data that express relationships through hierarchies. You can use XML because it is clearer and less restrictive than many alternative languages.

Self-Describing Data

The most associated term with XML has to be self-describing data . How can something be self-describing? Look at this example.

 <?xml version="1.0"?>  <Customer>        <Name>Deanna Evans</Name>       <Age>29</Age>       <MaritalStatus>Married</MaritalStatus>  </Customer>

You can tell a lot of information by looking at the preceding document. Because the XML is human-readable , you can discern that this data describes a single customer whose first name is Deanna and whose last name is Evans, respectively. You can infer that this customer's age is 29, and can read that the customer's marital status is Married. It's easier to work with the document because of the fact that it's self-describing. Suppose, instead, the information was represented differently, as shown here:

 $Deanna$Evans#1D$Married

Try to decipher this odd dialect. Any string value whole word must be prefixed with a dollar-sign ( $ ). If a dollar sign is followed by a numerical value, this indicates the number of empty spaces. All actual numbers are prefixed with a pound or hash symbol ( # ) and must be represented in hexadecimal format. This dialect cannot be easily read, and you certainly cannot intimate that this data belongs to a customer.

Looking at the XML example, you can see the data in a structured manner. XML imposes structure on the data, which makes it usable and understandable. By using readable tag names , you can designate a type with each element.

The term self-describing is overused and somewhat arbitrary. Consider the-following example:

 <b/>

This document is well- formed because it contains a single root element, the element name is a valid NCName, and the element tag is properly closed. From this example, we can infer that the root element has no content. Instead of saying "self-describing," it is more important to make a distinction between "human-readable" and "machine-readable." While the previous example does not seem very human-readable, it is certainly machine-readable.

Toolset Support

One of the benefits of XML is that there's no fixed API to work with XML. It is flexible and easy to understand without requiring a single rigid set of methods . Instead, XML enjoys support from a variety of tools, ranging from different parsing APIs targeted for different languages to different text editors to help create XML documents easily. Examples of these tools include the following:

Xselerator by MarrowSoft is a great XSLT development tool that enables you to debug XSLT stylesheets.
XmlSpy by Tibco Extensibility is a great IDE for XML development and for working with XML Schemas.

While there is no fixed and rigid API for working with XML, there does exist a recommendation for representing XML as a programmable object model. This recommendation is known as the Document Object Model (DOM), specified in the W3C DOM Level 1 and 2 Core recommendations.

Implementors are urged to adhere to the recommendation within their implementations . Because different vendors adhere to the same recommendation, consistent behavior is expected between different versions (for instance, a Java implementation should work exactly like a Visual Basic implementation). The DOM is explored in Chapter 5, "MSXML Parser," and in Chapter 6, "Exploring the System.Xml Namespace."

Separation of Content and Presentation

Now that you can see how ugly markup can get, look at what is meant by "separation of content and presentation." I assume that you are familiar with HTML at this point, and likely are familiar with XML, at least in passing. What you might not have realized is what XML is achieving: An XML document can contain the data that's being displayed, and a number of different presentation methods can be applied for displaying the data.

For example, a paragraph can actually be spread out over different table rows and table cells to make it more visually appealing. Because the presentation is intermixed with the content, it's difficult to extract the data from an HTML page in a common manner. Screen scraper programs that extract data from other websites typically run into problems when a website's content changes. The parser functions in the screen scraper application might look for something like this:

 <div id="stockquote">     <table>        <tr>           <td>MSFT</td><td>50.27</td>           <td>YHOO</td><td>8.11</td>        </tr>     </table>  </div>

The web page's authors, however, decide to change the look and feel of their page. They change the data to the following:

 <p>Stock Quote for <i>MSFT</i>: <b>50.27</b></p>  <p>Stock Quote for <i>YHOO</i>: <b>8.11</b></p>

The screen scraper program navigates to the remote site, looks for the text in the previous example, but can't grab stock quotes because the data changed and you could not reliably know where in the HTML the stock quote really resided unless you recode your parser.

XML data, however, separates the presentation of data from the actual data itself. If you were interested in the raw data, you might access an XML file that did not contain display logic, such as the following:

 <?xml version="1.0"?>  <quotes>  <quote symbol="MSFT" price="50.27"/>  <quote symbol="YHOO" price="8.11"/>  </quotes>

The XML data is then displayed using HTML, XHTML, PDF, or a variety of other formats, but the raw data remains the same. If the display format changes, the raw data is unaffected.

Interoperability and Data Transfer

One of the design goals of XML was, "XML shall be straightforwardly usable over the Internet." This challenge is more difficult than it first appears. The Internet is composed of many different binary formats, messaging protocols, and operating systems. To complicate matters, users of data might be from different parts of the world and might speak different languages or even use different characters in their text. By standardizing encoding, or the character set used in a document, XML allows a generic template for creating documents without compromising the design goals of making a document readable by humans . Again, the content is separate from the presentation.

XML relies on the International Organization for Standardization's Specification ISO 10646 to represent what a character is and what encoding is allowed. Relying on this standard makes XML accessible through different operating systems and networks. Imagine if no standard existed to represent a character: One operating system might consider a character four bytes of information, while another operating system might consider it two bytes.

only for RuBoard