Creating Valid Documents | XML and ASP.NET

only for RuBoard

A schema or Document Type Definition (DTD) contains the rules by which an XML document must abide. Schemas and DTDs are discussed in detail in a moment, but first, think of them as a set of rules. If the XML document conforms to the rules, it is said to be a valid document.

More precisely, XML documents are said to be valid if their content can be validated against either an XML Schema (XSD or XDR) or a DTD. This section discusses both schemas and DTDs, their benefits, shortcomings, and their prospective roles in .NET.

Why Validate Documents?

As with the other sections in this chapter, let's begin by answering some basic questions. Before jumping into how to validate documents, you must understand why you would want to validate a document.

Providing the rules and vocabulary for a document helps to communicate the grammar associated with the document. By describing what is valid content for the document, you develop a vocabulary that can be extended and understood by other developers.

Without explicitly stating the rules to which a document must conform, you put the burden of data validation on the program that generated the data, the program that consumes the data, or both to know the implicit rules associated with the XML document. Instead of relying on implicit validation, the DTD or schema can require explicit conformity to the referenced set of rules that makes the document valid.

Consider a system where you receive a data feed from a customer. How would your customer know what is valid data and what is a valid structure of the document? You could give extensive wording and diagrams, but you risk losing something in the interpretation of a verbose document.

As a benefit, XML tools can also use DTDs or schemas to assist the developer in creating the document. For example, the XML editor included in Visual Studio .NET uses schemas to provide code completion and on-the-fly validation so that you know whether a document is valid during its creation.

Now you're ready to look at DTDs and schemas in more detail.

DTDs

A DTD is simply a syntax for declaring the grammar and vocabulary of an XML document. The DTD enables the developer to convey the structure of XML documents, as well as the content of XML documents.

This section covers DTDs only enough to convey their use and existence. DTDs are quickly losing ground to XML Schemas as the preferred validation mechanism for XML documents. This book focuses on the use of XML Schemas for validation over the use of DTDs.

What Is a DTD?

A DTD declares the structure and content of an XML file. It defines the content model of a document. Consider the following XML document:

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <SITES>       <LINKS>            <LINK>http://www.Microsoft.com</LINK>            <LINK>http://www.xmlandasp.net</LINK>       </LINKS>  </SITES>

Suppose that business rules are associated with the structure of this document, and that these rules are not immediately obvious. For example, suppose that you only accept one LINKS node as a child of the SITES node. Furthermore, there can be zero or more LINK elements as a child of the LINKS element. Finally, each LINK element contains text content. Here is the sample DTD that validates this data:

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <!DOCTYPE SITES [ <!ELEMENT SITES (LINKS)?>  <!ELEMENT LINKS (LINK)*>  <!ELEMENT LINK (#PCDATA)>  ]>

To explain this a little further, begin by declaring the XML processing instruction, version, and the standalone attribute. The next line declares the root element, SITES , using the DOCTYPE keyword. The DOCTYPE is part of the document's prolog, so it appears before the XML body content. Remember that an XML document can have one, and only one, root node so that the root node is associated with the DOCTYPE definition.

The following line declares an element as a child of the SITES node. The child node is named LINKS , and the question mark ( ? ) declares that the element appears only zero or one time(s). Using this definition, the following XML is valid:

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <SITES/>

This is because the LINKS element can occur zero times or one time, but cannot occur more than once.

The following line declares that a child element can occur either zero or many times.

 <!ELEMENT LINKS (LINK)*>

If you want to declare that at least one LINK element must be a child of the LINKS element, use the + notation to signify a cardinality of greater than one:

 <!ELEMENT LINKS (LINK)+>

The following line in the sample DTD declares that the LINK elements content is made up of PCDATA , or character data:

 <!ELEMENT LINK (#PCDATA)>

You can also declare a sequence of child elements. Suppose that you want to add an ARTICLES element as a child of the SITES root node. Furthermore, the LINKS element must always precede the ARTICLES node. To add this, you must change the sample DTD to the following:

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <!DOCTYPE SITES [  <!ELEMENT SITES (LINKS,ARTICLES)?>  <!ELEMENT LINKS (LINK)*>  <!ELEMENT LINK (#PCDATA)>  <!ELEMENT ARTICLES (ARTICLE*)>  <!ELEMENT ARTICLE (#PCDATA)>  ]>

The following XML document is now valid using this DTD definition:

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <!DOCTYPE SITES [ <!ELEMENT SITES (LINKS,ARTICLES)?>  <!ELEMENT LINKS (LINK)*>  <!ELEMENT LINK (#PCDATA)>  <!ELEMENT ARTICLES (ARTICLE*)>  <!ELEMENT ARTICLE (#PCDATA)>  ]>  <SITES>       <LINKS>            <LINK>http://www.Microsoft.com</LINK>            <LINK>http://www.xmlandasp.net</LINK>       </LINKS>          <ARTICLES>                 <ARTICLE>This is where an article may go</ARTICLE>          </ARTICLES>  </SITES>

The following XML, however, is not valid because the sequence was defined so that LINKS must precede ARTICLES :

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <!DOCTYPE SITES [  <!ELEMENT SITES (LINKS,ARTICLES)?>  <!ELEMENT LINKS (LINK)*>  <!ELEMENT LINK (#PCDATA)>  <!ELEMENT ARTICLES (ARTICLE*)>  <!ELEMENT ARTICLE (#PCDATA)>  ]>  <SITES>  <ARTICLES>  <ARTICLE>This is where an article may go</ARTICLE>          </ARTICLES>  <LINKS>  <LINK>http://www.Microsoft.com</LINK>            <LINK>http://www.xmlandasp.net</LINK>       </LINKS>  </SITES>

Suppose that you want to have a choice between two different element types. Instead of requiring both elements and articles as a child of the root node, suppose that you want one or the other. To represent this, use the OR notation .

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <!DOCTYPE SITES [  <!ELEMENT SITES (LINKS  ARTICLES)?>  <!ELEMENT LINKS (LINK)*>                   <!ELEMENT LINK (#PCDATA)>      <!ELEMENT ARTICLES (ARTICLE*)>                    <!ELEMENT ARTICLE (#PCDATA)>  ]>  <SITES>          <ARTICLES>                 <ARTICLE>This is where an article may go</ARTICLE>          </ARTICLES>  </SITES>

This document is considered valid. The following document, however, is not valid because both elements are used when the OR condition was specified in the DTD:

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <!DOCTYPE SITES [ <!ELEMENT SITES (LINKS  ARTICLES)?>      <!ELEMENT LINKS (LINK)*>                  <!ELEMENT LINK (#PCDATA)>      <!ELEMENT ARTICLES (ARTICLE*)>                  <!ELEMENT ARTICLE (#PCDATA)>  ]>  <SITES>          <ARTICLES>                 <ARTICLE>This is where an article may go</ARTICLE>          </ARTICLES>  <LINKS>  <LINK>http://www.xmlandasp.net</LINK>                 <LINK>http://www.microsoft.com</LINK>       </LINKS>  </SITES>

So far, only elements have been a focus. What if you want to specify an attribute? For example, what if you want to associate a name attribute with each link element? To use attributes, simply specify them in an ATTLIST section in the DTD, as shown here:

 <?xml version="1.0" encoding="utf-8" standalone="yes"?>  <!DOCTYPE SITES [      <!ELEMENT SITES (LINKS  ARTICLES)+>       <!ELEMENT LINKS (LINK*)*>       <!ELEMENT LINK (#PCDATA)>       <!ELEMENT ARTICLES (ARTICLE*)*>       <!ELEMENT ARTICLE (#PCDATA)>       <!ATTLIST LINK       name CDATA #REQUIRED  >  ]>  <SITES>       <ARTICLES>            <ARTICLE>This is where an article may go</ARTICLE>       </ARTICLES>       <LINKS>            <LINK name="xmlandasp.net">http://www.xmlandasp.net</LINK>            <LINK name="Microsoft">http://www.microsoft.com</LINK>       </LINKS>  </SITES>

The #REQUIRED modifier for the attribute specifies that the attribute is required.

As you can see, DTDs can be useful for defining the structure and content model of an XML document.

Drawbacks of Using DTDs

The first drawback to using DTDs is that an XML parser cannot parse them. DTDs use a syntax that's difficult for parsers to represent. Two types of DTDs exist: internal and external . In this section, only internal DTDs are used, just to keep things simple. External DTDs are DTDs that are external to the XML document and are referenced from with the XML document. The XML parser cannot represent external DTDs, so working with DTDs becomes problematic because a different toolset must be used outside the XML parser.

As you might have seen throughout the DTD examples, the content of attributes and elements were specified as #PCDATA . This is because DTDs do not support typing of data. You cannot restrict that the content of an element will be a number or a string because everything in an XML document is a string, according to DTDs. You cannot specify the acceptable length of a string, nor can you specify restrictions on the string's contents.

Another drawback to using DTDs is that names in a DTD must be unique. You can reference other elements, but you cannot define a new one by using the same name. For example, suppose that you want to change the DTD so that each ARTICLE element contains a child LINKS node, as follows :

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <!DOCTYPE ROOT [ <!ELEMENT ROOT (  LINKS  ,ARTICLES)?>  <!ELEMENT LINKS (LINK)*>  <!ELEMENT ARTICLES (ARTICLE)*>  <!ELEMENT LINK (#PCDATA)>  <!ELEMENT ARTICLE (LINKS)>  ]>  <ROOT>        <LINKS>            <LINK>NewRiders.com</LINK>       </LINKS>       <ARTICLES>            <ARTICLE>  <LINKS>   <LINK>Microsoft.com</LINK>   <LINK>Xmlandasp.net</LINK>   </LINKS>  </ARTICLE>       </ARTICLES>  </ROOT>

Notice that both the ROOT and ARTICLE elements declare a child element of type LINKS .

You can reuse definitions, but you cannot redefine names. Suppose that, instead of reusing the LINKS definition, you want to redefine it. You might try something like the following, but this code is invalid because the name is already declared:

 <?xml version="1.0" encoding="utf-8" standalone="yes" ?>  <!DOCTYPE ROOT [ <!ELEMENT ROOT (LINKS,ARTICLES)?>  <!ELEMENT LINKS (LINK)*>  <!ELEMENT ARTICLES (ARTICLE)*>  <!ELEMENT LINK (#PCDATA)>  <!ELEMENT ARTICLE (LINKS)>  <!ELEMENT LINKS (#PCDATA)>  ]>

Finally, using namespaces with DTDs is difficult. It is not impossible because using namespaces involves using a qualified name ( QName ) that consists of a namespace prefix and an associated local name. To the DTD, there is no concept of a namespace; it just treats the name with a colon as any other XML name. But defining a namespace in a DTD is difficult because you must use attribute lists for elements that use the namespace prefix.

XML Schemas are gaining ground on DTDs because schemas can easily represent different names, extend existing definitions, and easily use namespaces.

Validating XML Documents

The easiest way to validate an XML document against a DTD or schema is to load the XML document and its DTD in the XML editor in Visual Studio .NET. Select XML Validate XML Data from the menu, and Visual Studio .NET reports any errors found in the DTD or in the XML document that violates the DTD.

Validation Using Internet Explorer

Internet Explorer doesn't validate documents with either DTDs or schemas by default. iexmltls.exe , however, is a free add-on to IE that performs validation. See http://msdn.microsoft.com/downloads/default.asp?url=/downloads/sample.asp?url=/MSDN-FILES/027/000/543/msdncompositedoc.xml to download this add-on.

To validate against a DTD using the ValidatingReader class in .NET, set the ValidationType property of the ValidatingReader object to ValidationType.DTD ,as shown here:

 Sub Validate()          Dim xmlReader As System.Xml.XmlTextReader = New  System.Xml.XmlTextReader("c:\temp\xmlfile.xml")          Dim vReader As System.Xml.XmlValidatingReader = New  System.Xml.XmlValidatingReader(xmlReader)          vReader.ValidationType = ValidationType.DTD          AddHandler vReader.ValidationEventHandler, AddressOf ValidateCallback          While vReader.Read()          End While      End Sub      Public Sub ValidateCallback(ByVal sender As Object, ByVal args As  System.Xml.Schema.ValidationEventArgs)          Debug.WriteLine(args.Message)      End Sub

You can also use the .NET base classes or MSXML to programmatically validate against an XML document. Chapter 2 discusses the ValidatingReader class for validating XML documents, and the ValidatingReader class is discussed in greater detail in Chapter 6, "Exploring the System.Xml Namespace." Chapter 5, "MSXML Parser," discusses how to validate documents by using MSXML.

only for RuBoard