The XML Language | XML Programming Bible

XML is a subset, or a pared down version, of SGML. If you are familiar with SGML you will notice the similarities. Unlike SGML, the XML syntax is simple, which is one reason why it has seen much success, especially over the last year. In this section we go over some basic language semantics and syntax that you need to understand, such as elements, entities, comments, and processing instructions.

In various parts of this chapter we will refer to the term "schema." We are not speaking of XSD, but of a simple data schema. Specific references to XSD are made using a capitalized S, such as in "Schema," or the abbreviation XSD.

Elements

Elements represent the tags, or language, that you create with XML. To define an element in a DTD you use the following syntax.

 <!ELEMENT name type>

The name is the element name you want to define and the type is the type of content the element contains. This could be text, other elements, or a combination of the two. Here's an example to give you a better idea of what this means.

Suppose you want to create a customer schema that includes their name and contact information. To provide another level of granularity we break the name into first, last, and middle names, and contact into address and phone. Taking it one more step, we will break address into street, city, state, and zip, while we break phone into home, work, and mobile. Figure 2-1 is a visual representation of this data model.

Because declaration within XML DTDs must appear in a specific order, we first define our customer element. This element contains two child elements as its content, so the DTD representation of this XML element takes the following form:

 <!ELEMENT customer (name , contact)>

Figure 2-1 A visual representation of our customer data model.

In this definition, for the document to be valid (more on what valid is later in the chapter), it must contain a single instance of the <name> element followed by a single instance of the <contact> element. You can impose some rules on this, such as having name OR contact or making one or both optional, but we will discuss those later.

Defining <name> and <contact> is similar because they are parent elements of child elements. So are <address> and <phone>. The definition for these can be represented as follows:

 <!ELEMENT name (first , middle , last)> <!ELEMENT contact (address , phone)> <!ELEMENT address (street , city , state , zip)> <!ELEMENT phone (home , work , mobile)>

The final step is to define the elements that actually hold the instance data. These include the child elements first, middle, last, street, city, state, zip, home, work, and mobile. We want these elements to hold only regular text, which is referred to as parsed character data, so we will define them as having #PCDATA. These definitions look like:

 <!ELEMENT first (#PCDATA)> <!ELEMENT middle (#PCDATA)> <!ELEMENT last (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)> <!ELEMENT home (#PCDATA)> <!ELEMENT work (#PCDATA)> <!ELEMENT mobile (#PCDATA)>

The last piece we need, which is actually the first item in our schema definition, is the <?xml> declaration. This is nothing more than

 <?xml version='1.0' encoding='UTF-8' ?>

Our completed schema shown in Listing 2-1, customer.dtd, can now be referenced and used in instance documents such as johndoe.xml in Listing 2-2, which is displayed in Microsoft Internet Explorer in Figure 2-2. We added comments in the johndoe.xml file to help you understand what information is included.

Listing 2-1 customer.dtd: The XML definition of the customer data model.

 <?xml version='1.0' encoding='UTF-8' ?> <!ELEMENT customer (name , contact)> <!ELEMENT name (first , middle , last)> <!ELEMENT contact (address , phone)> <!ELEMENT address (street , city , state , zip)> <!ELEMENT phone (home , work , mobile)> <!ELEMENT first (#PCDATA)> <!ELEMENT middle (#PCDATA)> <!ELEMENT last (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)> <!ELEMENT home (#PCDATA)> <!ELEMENT work (#PCDATA)> <!ELEMENT mobile (#PCDATA)>

Listing 2-2 johndoe.xml: An instance of the customer.dtd document.

 <?xml version = "1.0"?> <!DOCTYPE customer SYSTEM "customer.dtd"> <customer>  <!--(name , contact)--> <name> <!--(first , middle , last)--> <first>John</first> <middle>Smithy</middle> <last>Doe</last> </name> <contact> <!--(address , phone)--> <address> <!--(street , city , state , zip)--> <street>123 Some Street</street> <city>Anytown</city> <state>NC</state> <zip>25555</zip> </address> <phone> <!--(home , work , mobile)--> <home>919.555.1212</home> <work>919.555.1213</work> <mobile>919.555.1214</mobile> </phone> </contact> </customer>

Figure 2-2 Loading our johndoe.xml document in Microsoft Internet Explorer.

Element Attributes

Sometimes an element needs to be annotated by adding additional information. This step is called "adding attributes." This information can be used for a variety of purposes, but most commonly falls within one of the following categories:

As a distinguishing factor on the type of element
Additive descriptive information
Information to be used by the application in processing the data

Attributes are fairly easy to add to schemas and are often shared across elements. The basic syntax for adding attributes is:

 <!ATTLIST element name datatype #use >

The element refers to the element that the attribute should be associated with, while the name is the name you give the attribute. In your document instances you will use a name="value" pair to assign attribute values.

The datatype can be one of three types. It can either be a string (CDATA), a set of tokenized types (ID, IDREF, IDREFS, ENTITY, ENTITIES, NMTOKEN, or NMTOKENS), or enumerated types. The string type can take any literal string, while the tokenized types have varying lexical and semantic constraints. Enumerated types can take one of a list of possible values. If you need a refresher on what the tokenized data types mean, go to http://www.w3.org/TR/2000/REC-xml-20001006#sec-attribute-types.

Finally the #use is a requirement for specifying if the attribute is required or not. If it is required, this will contain #REQUIRED, and if not, #IMPLIED. Because an element can have more than one attribute, you are able to repeat the name, datatype, and #use combination within the same <!ATTLIST> instance. To help you understand this, let's create a new version of our customer.dtd schema, called customer_v2.dtd, and add attributes. This new schema will then be used to create instance documents such as johndoe_v2.xml in Listing 2-3

We add a type attribute to our <customer> element and define it as an enumerated data type with the option of being either "current" or "past", which will require this attribute to be set. We also add a type attribute to our <mobile> element and define it as a string. This will point out how more than one element can have the same attribute name but with a different meaning. Finally we define a tokenized ID type for <customer>, which will allow us to identify this element by a character ID, though the attribute is not required.

To create these attributes we need to add the following lines to our schema.

 <!ATTLIST customer type (current | past ) #REQUIRED id  ID #IMPLIED > <!ATTLIST mobile type CDATA #IMPLIED >

As you can see, this is a simple task. You see the names of the elements we created attributes for, the type of attributes they are, and whether they are required or not. You should also notice how our enumeration type attribute for <customer> is specified within the parentheses. The | character is a choice, or an OR condition, between the possible values of type.

Listing 2-3 johndoe_v2.xml: An updated document instance supporting our new attributes.

 <?xml version = "1.0"?> <!DOCTYPE customer SYSTEM "customer_v2.dtd"> <customer type = "current" id = "abc"> <name> <first>John</first> <middle>Smithy</middle> <last>Doe</last> </name> <contact> <address> <street>123 Some Street</street> <city>Anytown</city> <state>NC</state> <zip>25555</zip> </address> <phone> <home>919.555.1212</home> <work>919.555.1213</work> <mobile type = "phone">919.555.1214</mobile> </phone> </contact> </customer>

Extra Flexibility

As we saw in the previous attribute example, conditions can be imposed on attributes. This functionality is also available in defining elements. You can represent OR statements and specify whether elements are optional or repeatable. Additionally, you can nest items within parentheses to add more flexibility. The lists below contain a list of the characters you should use to define these statements.

Character	Description
?	Optional
+	Repeatable
*	Optional and Repeatable
\|	OR

For example, let's create a customer_v3.dtd, building on our second version, which does the following:

<middle> element optional and repeatable
<home> element optional
<mobile> element optional

To create this variation we need to change two lines of code to the following:

 <!ELEMENT name (first , middle* , last)> <!ELEMENT phone (home? , work , mobile?)>

Now we can create johndoe_v3.xml as shown in Listing 2-4 with two middle names and no home or mobile number, and the document will be valid.

Listing 2-4 johndoe_v3.xml: Changing our example document instance to include two middle names and no home or work number.

 <?xml version = "1.0" encoding = "UTF-8"?> <!DOCTYPE customer SYSTEM "customer_v3.dtd"> <customer type = "current" id = "abc"> <name> <first>John</first> <middle>Smithy</middle> <middle>Anderson</middle> <last>Doe</last> </name> <contact> <address> <street>123 Some Street</street> <city>Anytown</city> <state>NC</state> <zip>25555</zip> </address> <phone> <work>919.555.1213</work> </phone> </contact> </customer>

Entities

XML is made of entities and parsed or unparsed data. Entities are a single character construct or a collection of named constructs that are referenced in the document. Parsed data is made up of character data or markup and is processed by an XML processor. Unparsed data, on the other hand, is raw text not processed as XML. In this part we look at internal and external entity declarations and how they're referenced in instance documents.

Internal Entities

Internal entities are entities whose content is defined within the current schema. We focus on the definition first and then show you how to reference them.

Suppose you want to define an entity to hold the name of the author of the DTD so that it can be referenced in instance documents without explicitly typing it. The following code would construct this entity. By defining this entity, all an instance document author needs to do is place &dtd-author in the document to reference "R. Allen Wyke".

Processors and Parsers

We need to clarify the difference between a processor and a parser (which we will talk about in Chapter 3). A processor is a software module that reads an XML instance document and provides access to its content and structure. The parser is the part of the processor that analyzes the markup and determines the structure of the document data. If it is a validating parser it can also perform a validation of the structure against a DTD.

 <!ENTITY dtd-author "R. Allen Wyke">

This entity instance is not the only type of entity that can be created within a schema and used in an instance document. It is also possible to define a list of elements or attributes as an entity so they can all inherit the same list. The syntax for defining groups of elements under a single name is

 <!ENTITY % name "elements">

Just like in a content model for an element, you can have one or more elements defined in elements, and you can impose conditions using the characters listed on page 8. Defining reusable attribute entities, however, is slightly different. For these you must include whole attribute definition, which is accomplished using the <!ATTLIST> declaration, like in customer_v2.dtd.

 <!ATTLIST % name attrname datatype #use > attrname datatype #use >

These attribute-lists entities also differ from attribute lists because they are referenced in the schema, not the instance document. This is accomplished by placing a % in front of the entity name.

To make sure we're all on the same page, let's build customer_v4.dtd on our version 3 schema and define some internal entities. We are going to

Define a dtd-author entity
Create an element entity called cus-basic that contains <name> and <contact>
Define <author>, <internal>, and <external> elements and define the content model of <internal> and <external> with %cus-basic
Replace the current content module of <customer> with a choice of <internal> or <external> AND include an <author> element instance
Define an attribute entity called attr-basic that contains ID and type
Include %attr-basic list as attributes for <customer>, <internal>, and <external>

The first step is to create the dtd-author entity, which we showed you how to do earlier. Next we need to add an element entity named cus-basic that contains <name> and <contact> and define three new elements. The content model for <internal> and <external> should include our newly defined cus-basic. These three steps will require the following additions to our schema.

 <!ENTITY dtd-author "R. Allen Wyke"> <!ENTITY % cus-basic "name , contact"> <!ELEMENT internal (%cus-basic;)> <!ELEMENT external (%cus-basic;)> <!ELEMENT author (#PCDATA)>

Next we want to redefine the <customer> content model to include a choice of <internal> or <external> and an <author> instance. Following this step we will define an attribute entity list called attr-basic, which contains ID and type, and use it as the only attributes for <customer>, <internal>, and <external>.

 <!ELEMENT customer ((internal | external) , author)>  <!ENTITY % attr-basic " type CDATA #IMPLIED id  CDATA #IMPLIED"> <!ATTLIST customer %attr-basic; > <!ATTLIST internal %attr-basic; > <!ATTLIST external %attr-basic; >

Figure 2-3 is a visual representation of our completed schema, and the DTD is shown in customer_v4.dtd.

Figure 2-3 A visual representation of our revised schema.

Here is what customer_v4.dtd contains:

 <?xml version='1.0' encoding='UTF-8' ?> <!ENTITY % attr-basic " type CDATA #REQUIRED id  CDATA #IMPLIED"> <!ENTITY % cus-basic "name , contact"> <!ELEMENT customer ((internal | external) , author)> <!ATTLIST customer %attr-basic; > <!ELEMENT name (first , middle* , last)> <!ELEMENT contact (address , phone)> <!ELEMENT address (street , city , state , zip)> <!ELEMENT phone (home? , work , mobile?)> <!ELEMENT first (#PCDATA)> <!ELEMENT middle (#PCDATA)> <!ELEMENT last (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)> <!ELEMENT home (#PCDATA)> <!ELEMENT work (#PCDATA)> <!ELEMENT mobile (#PCDATA)> <!ATTLIST mobile type CDATA #IMPLIED > <!ENTITY dtd-author "R. Allen Wyke"> <!ELEMENT internal (%cus-basic;)> <!ATTLIST internal %attr-basic; > <!ELEMENT external (%cus-basic;)> <!ATTLIST external %attr-basic; > <!ELEMENT author (#PCDATA)>

Listing 2-5 johndoe_v4.xml: A sample document using the new schema.

 <?xml version = "1.0"?> <!DOCTYPE customer SYSTEM "customer_v4.dtd"> <customer type = "current" id = "xyz"> <internal type = "current" id = "xyz"> <name> <first>John</first> <middle>Smithy</middle> <middle>Smithy</middle>  <last>Doe</last> </name> <contact> <address> <street>123 Some Street</street> <city>Anytown</city> <state>NC</state> <zip>25555</zip> </address> <phone> <work>919.555.1213</work> </phone> </contact> </internal> <author>&dtd-author;</author> </customer>

Figure 2-4 shows a sample document, johndoe_v4.xml, loaded into Microsoft Internet Explorer. Notice how the parser replaced the entity reference to dtd-author with the string "R. Allen Wyke".

Figure 2-4 The dtd-author entity reference replaced.

External Entities

External entities are what the name implies: entities defined in external files. This file is "imported" using the <!ENTITY> declaration and includes a name, identifier (SYSTEM or PUBLIC), and a literal (URI or path), in the form:

 <!ENTITY name identifier "literal">

Although we will not go into much detail about external entities in this chapter, the detail you should note is that this is the method you use to include other elements, like images, schemas, and the like, into your schema.

A good example of how this might be used is if you want to build your schema in modules. If you wanted to include customer_v4.dtd as part of another schema so that it would inherit the data model, you could include the following:

 <!ENTITY % customer_v4.dtd SYSTEM "customer_v4.dtd">

If customer_v4.dtd is not in the current working directory, you need to be sure to include a path to the file. Additionally, if you have stored your schema on a server accessible to your customers or partners, you could replace SYSTEM with PUBLIC and include the URI in the schema.

 <!ENTITY % customer_v4.dtd PUBLIC  "http://www.microsoft.com/dtds/customer_v4.dtd">

Comments

Comments within your XML schemas and documents are useful and important, as they are in any type of language. Although the ability to create human-readable markup is one of the objectives of XML, sections, elements, attributes, or other items might need more description.

XML follows the syntax HTML uses for comments. They begin with . A few examples follow. In the second example we show how these comments can span multiple lines.

 <!-- here is a one line comment --> <!-- here is a comment  that spans more than one line -->

The XML grammar does not allow your comment to end with ---> (three hyphens). This violates the document's requirement to be well-formed, which we discuss at the end of the chapter.

Processing Instructions

Another function of XML is to pass processing instructions within the document. The thought behind this functionality is the ability to remove, as much as possible, any software-specific markup that your schema might require. This prevents any need to include elements and attributes in your schema that do not increase the description of the data, keeping only what it should be used for by the application.

As an example, say you're passing an instance of the customer_v4.dtd schema to an application that will transform the schema into another XML dialect, like a Microsoft BizTalk compatible. Say this application is a Web Service and is accessible through a URL. To include this reference in your schema you could have something like the following, where dtd2biztalk is the target name and href="http://www.microsoft.com/scripts/dtd2biztalk.exe" is the instruction your application would understand:

 <?dtd2biztalk  href="http://www.microsoft.com/scripts/dtd2biztalk.exe?>