Introduction to XML Data | MCAD/MCSD: Visual Basic .NET XML Web Services and Server Components Study Guide

Extensible Markup Language (XML) is a language for marking up (or tagging) data so that the meaning of the data items and the overall structure and relationships between data is easy to understand. XML markup can be read and understood by users, but it is equally easy to use any of a wide range of software tools to parse and process the data. XML data files are simple text documents that can be read by software on any computing platform and travel over the Internet via the HTTP protocol.

Because XML was designed and its specification is maintained by the World Wide Web Consortium (W3C), www.w3.org, it is primarily thought of as an Internet or web technology. (The W3C is an international standards body that oversees Internet application standards such as HTML and XML.) However, XML is also useful in application integration. Because the XML format is not platform or programming language specific, it provides a quick way to pass data between applications with a minimal amount of conversion code.

The .NET Framework uses XML as the format for its configuration files and as a means to serialize object state when passing an object to a remote component. In this section, you will first learn about the basic rules for creating well-formed XML data files and see how a schema defines a particular format of XML. You will then learn the basics of working with XML data and the XSD language.

Understanding XML Basics

XML markup uses angle brackets (<…>) to enclose tag names that describe each data item, very much like HTML does. Matching pairs of tags enclose the data. These are called elements. The closing element tag begins with the forward slash (/) character.

Here is an example of a simple XML element that contains data, or what is called text content.

<job>Chief Executive Officer</job>

Elements can also contain data in the form of attributes. Attributes are enclosed inside the angle brackets and always take the form of a name/value pair. The value is enclosed in quotes.

Here is an example of an XML element that has an attribute named id, with a value of 1.

<job >Chief Executive Officer</job>

Because one of the goals of XML is to be a universal medium for data exchange, XML files must follow some standard rules, resulting in a document that is said to be well formed. These rules are part of a W3C specification. Computer programs that read XML data are called XML parsers and they depend on XML data files to be well formed in order to interpret their content correctly. The standard behavior for an XML parser is to stop reading a file and report an error at the first point that it finds an incorrect character. If your XML data file conforms to the rules, and therefore is well formed, then any standard parser can read the data. Microsoft Internet Explorer (version 5 and later) is capable of parsing XML data and then displaying it with special formatting. Figure 7.1 shows a simple XML data file displayed in Internet Explorer.

click to expand
Figure 7.1: An XML data file displayed in Internet Explorer

The rules for creating well-formed XML files are as follows:

Every XML document must have a uniquely named root element that encloses all of the data.
Every element must have matching opening and closing tags.
Elements at each level of the document hierarchy must be completely nested inside their parent elements (opening and closing tags of different elements cannot overlap).
Element tag names and attribute names are case sensitive (<Job> and </job> are not considered a match).
All attribute values must be enclosed in quotes (either single or double quotes).
Attribute names cannot repeat for a single element.

Listing 7.1 shows an XML data file that follows these rules.

Listing 7.1: An XML Data File

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <!-- This is a comment --> <joblist>   <jobs >     <job_desc>New Hire - Job not specified</job_desc>     <min_lvl>10</min_lvl>     <max_lvl>50</max_lvl>   </jobs>   <jobs >     <job_desc>Chief Executive Officer</job_desc>     <min_lvl>200</min_lvl>     <max_lvl>225</max_lvl>   </jobs> </joblist>

You can see that a uniquely named root element <joblist> is at the beginning of the data and that its matching closing tag </joblist> is the last line in the file.

The first line in the file is a processing instruction, indicated by the <? syntax. This is a special processing instruction, called the XML declaration, and is always the first line of an XML data file. Processing instructions provide information that the parser can use while processing the file. The XML declaration indicates three attribute values: the version of the XML language that we are using, the encoding (for interpreting any extended characters), and the stand-alone attribute, which indicates (when set to yes) that no other files are needed to process this document. Other processing instructions can be included anywhere in the XML data file. They can contain information that is widely understood (such as a stylesheet instruction), or useful only to a custom parser.

Following the processing instruction is a comment. This uses the same <!-- syntax that HTML comments use.

Now you understand the basics of XML markup language, you will see variations in the basic format as you work through the examples in this chapter. Next you’ll learn how schema definition language can be used to define and validate a specific format for XML markup.

Understanding XML Schema Definition

XML inherently enables you to create any element and attribute names that best describe your data and offers lots of flexibility in defining the hierarchical structure of a data file. This flexibility is useful, but when you are designing a format for XML that will be processed by your application code, or trying to conform to the format requirements of a system you want to exchange data with, you need a way to verify that data files are in the correct format.

When XML first became popular, the only means to validate the format of a data file was the Document Type Definition (DTD). DTD was inherited from an older markup language version. DTD was limited in what it could validate and used an unfamiliar syntax. Most of the tools in the .NET Framework that can validate by using XSD schema can also validate by using DTD, if you need to support legacy data that usesa DTD.

Note

We will not cover DTD in detail here, but information about that technology is available in most XML reference books.

To improve on the shortcomings of DTD, the W3C designed and standardized what we now know as XML Schema Definition (XSD) language, or XSD. You might sometimes see references to an intermediate version called XML Data Reduced (XDR) that was used before the W3C finalized XSD. Although there are some similarities between XDR and XSD, XSD is much more sophisticated. Most of the tools available in the .NET Framework to perform validation provide support for the older technologies as well as XSD.

Listing 7.1 showed a simple XML data file with data from the jobs table of the pubs sample database. Listing 7.2 shows the XSD that describes this format.

Listing 7.2: The XSD Schema for the jobs Table

<?xml version="1.0" standalone="yes"?> <xs:schema  xmlns=""     xmlns:xs="http://www.w3.org/2001/XMLSchema"     xmlns:msdata="urn:schemas-microsoft-com:xml-msdata">   <xs:element name="joblist" msdata:IsDataSet="true">     <xs:complexType>       <xs:choice maxOccurs="unbounded">         <xs:element name="jobs">           <xs:complexType>             <xs:sequence>               <xs:element name="job_id"                    type="xs:short" minOccurs="0" />               <xs:element name="job_desc"                    type="xs:string" minOccurs="0" />               <xs:element name="min_lvl"                   type="xs:unsignedByte" minOccurs="0" />               <xs:element name="max_lvl"                    type="xs:unsignedByte" minOccurs="0" />             </xs:sequence>           </xs:complexType>         </xs:element>       </xs:choice>     </xs:complexType>   </xs:element> </xs:schema>

The first thing to notice about this XSD file is that it is a well-formed XML document. This file can be parsed or processed by any program that can parse a well-formed XML data file. This enables the standard XML processing tools in the .NET Framework, as well as your custom code, to read, change, or create schema information programmatically. An XSD file is also a valid XML document because the element and attribute names are defined by the XSD specification. If you were to enter a tag name incorrectly (using uppercase letters in place of lowercase, for example) or to add a tag name that was not recognized, your parser would report an error and do no further processing on the files.

The schema file contains a standard XML declaration as its first line. This is followed by the root element <xs:schema> that has several namespace declarations. XML namespaces are used much the same way that they are used in your .NET Framework applications, although the syntax is different. In XML, the namespace is defined once and assigned prefix characters. As you read through the XML file, all element names using the prefix characters belong to that namespace. A colon character separates the prefix from the tag name. Namespaces are used to add another level of qualification to an element name—either to resolve naming conflicts (by distinguishing one element name from another of the same name originating in another namespace, or simply to indicate where a particular element name is defined. In this schema snippet, first shows the namespace defining the xs: prefix, by using a Uniform Resource Identifier (URI) that references the W3C, and then shows a tag name of element that is prefixed by xs:, to indicate that it is part of that namespace:

xmlns:xs="http://www.w3.org/2001/XMLSchema"   <xs:element name="joblist" msdata:IsDataSet="true">

Note

All element tag names that begin with the xs: prefix are defined by the W3C XSD definition.

Another namespace prefix that is defined is msdata:. Elements prefixed with msdata: contain information that is specific to a schema created and used by Microsoft .NET Framework tools, and can be ignored by parsers on other platforms. The following code snippet shows the namespace declaration and an attribute that is added to the definition of the <joblist> element. The attribute with the msdata: prefix shows that the origin of this item of data was an ADO.NET DataSet:

xmlns:msdata="urn:schemas-microsoft-com:xml-msdata"> <xs:element name="joblist" msdata:IsDataSet="true">

The rest of the schema file contains an <xs:element> definition for each of the element tag names that occur in the data. These element definitions are nested inside each other in the same way that they are shown in the data file. First is the <xs:joblist> definition of the unique root element. That is followed by an <xs:complexType> element. <xs:complexType> indicates that the <joblist> element contains a hierarchy of child elements or attributes, not only simple text content. This is followed by an <xs:choice maxOccurs="unbounded"> element. This indicates that the <joblist> root element can contain any number of child elements, although our example contains only one, the <jobs> element.

The <jobs> element is a direct child of the <joblist> root element and it is also a complex type. The <jobs> element has four child elements, which are listed inside a set of <xs:sequence> tags. The <xs:sequence> tag means that the child elements listed must always appear in the same order as shown in the schema. These elements do not contain any further child elements or attributes, only text content (the data). They are known as simple types. Their definition includes a name attribute, which is taken from the column name in the DataSet, and a data type attribute, which enables you to verify that appropriate data types are being used. The attributes of minOccurs (minimum number of occurrences) and maxOccurs (maximum number of occurrences) are also in this definition. By default, the ADO.NET methods create schema that sets all the minOccurs attributes to zero (see Listing 7.2). A setting of minOccurs="0" indicates that the element is optional (that is, if the child element is missing from any of the <jobs> elements, the data file will still be considered valid). You might want to change the value to 1 to indicate that the element is required. You might also want to specify a maxOccurs value (use the value of unbounded to indicate that the element can be repeated any number of times) for some of your elements when it is compatible with your format to have repeating elements and data, as seen here:

<xs:element name="job_id" type="xs:short" minOccurs="1" /> <xs:element name="job_desc" type="xs:string" minOccurs="1"     maxOccurs="unbounded" /> <xs:element name="min_lvl" type="xs:unsignedByte"     minOccurs="1" /> <xs:element name="max_lvl" type="xs:unsignedByte"     minOccurs="1" />

Notice that these simple type elements are defined on one line. Their tags carry all pertinent data as attribute values so they do not need opening and closing tags to enclose any data. In this case, you can use a short version of the closing tag. Simply place the / character at the end of the opening tag.

Much more information can be added to an XSD schema to describe your data. This simple example is designed to show you the basics and help you understand the XSD files that are created for your applications in Visual Studio .NET. You can learn more about XSD schemas in the Visual Studio .NET documentation or at http://msdn.Microsoft.com/xml.

In the next section, you will learn how to create XML data files and XSD schemas directly from your ADO.NET DataSets.