Building XML

XML is considered the building block for Web services. It provides a standard format that can cross all boundaries, enabling us to share information with literally anyone. However, this empowerment comes with a cost. XML is not the most efficient mechanism for transferring or working with data. To use XML as efficiently as possible, we need a good understanding of what XML is, when to use it, and how to use it.

The Background on Markup Languages

XML is a markup language for defining structured data. Markup languages are sometimes referred to as metalanguages. These languages are standardized mechanisms for describing text. These mechanisms are designed to communicate information from the writer to the reader through a series of predetermined notations. For instance, if you print out this page and put a line through some words that don't belong (like this), most readers will understand that you think those words should be removed. You can almost think of markup languages as programmatic shorthand applied to a document.

Note 

The word markup refers to the process of using some annotation to communicate handling instructions to a processor. This technique was originally used to communicate instructions to typesetters regarding the treatment of content for publishing.

Any markup language is considered a member of the family of languages that has been derived from SGML (Standard Generalized Markup Language). SGML is an international standard (ISO 8879:1986) for a markup language established in 1986 and is considered to be the father of computer-based markup languages.

Note 

Unfortunately, the detailed history of SGML seems to be up for debate and a contentious issue for many of the people who were involved. What can be said is that various efforts of multiple individuals dating back to the 1960s led to the development and standardization of SGML.

HTML (HyperText Markup Language) and XML are the most popular markup languages. There are also many other markup languages, including offshoots of HTML and XML that are starting to blur in distinction from their counterparts. Theoretically, the various markup languages of this family have rules that make them somewhat unique, but they are all markup languages.

The common bond between these languages is the standardized notation used to describe the actual data. There is nothing inherently powerful or special about these tags except that everyone has agreed to recognize these demarcations as tags and distinguish them from the data they are treating. A tag itself is simply some metadata enclosed in greater-than and less-than signs, as shown here:

 <tag> 

  • Data that defines other data is often referred to as metadata.

Depending on the implementation, the metadata in this tag will be communicating some information concerning the handling of the data it is related to. The specification for a group of implementations would define the encoding for a markup language. Specifications are typically drafted and maintained through a standards body. The World Wide Web Consortium (W3C) is the standards organization that maintains the HTML and XML specifications. For more information on W3C's organization and processes for drafting and revising specifications, visit its Web site at http://www.w3.org.

Every language has syntax to define what is considered valid and invalid code. The specification for a markup language defines its syntax and/or usage. The latest version of the entire specification for XML is on the W3C site at http://www.w3.org/TR/REC-xml.

XML and markup languages in general have a much less complicated syntax than actual programming languages like C++, Java, or even Visual Basic. The simplicity is both a byproduct and a goal. It is a byproduct because the requirements of XML are much less than those of a programming tool because we are defining data, not building applications. In fact, it is necessary to have a clear distinction between programming languages and markup languages. One allows you to develop an application, and the other can only be used to create the interface or instruction set for an application. In fact, markup language code doesn't even have the ability to interpret or validate itself. All it really does is communicate state to the application on a very simplistic level.

Take, for instance, the following HTML code:

 <table border="0">      <tr>           <td>City</td>           <td>State</td>      </tr> </table> 

As any programmer will attest, tables are one of the more complex structures in HTML. However, even with this set of tags, the HTML isn't actually creating the table; the browser reading the code is. The HTML is laying out instructions for the browser concerning how many cells are in the table, some aspects of what it should look like, and what the cells' content is. If you were to remove the first </td> tag, what would happen? The HTML would be invalid, but the code has no way of validating itself. Browsers have become very intelligent and forgiving over the years because it is up to them to either handle this issue or ignore it. A browser won't get any help resolving an error from the HTML code itself.

At a glance, HTML and XML look very similar. If you weren't an experienced developer in one or the other, you might not be able to distinguish the difference. For instance, if you saw the following line of code, would you be able to determine if it was HTML or XML?

 <H1>Fundamental Web Services</H1> 

Of course, if you were an HTML developer, you might recognize this as the Header 1 tag and assume it is HTML. However, this could also be a valid XML tag if the appropriate definitions were given. Given the rest of the document, you could likely determine which language it is because you would have more data to either qualify or disqualify the code as HTML or XML. Let's go ahead and look at the details of XML so we can make a distinction.

The Purpose of XML

The first thing we need to understand about XML is why it was created. After all, we have been using HTML a long time, and it has obviously been accepted worldwide. Why complicate things?

HTML was created for one reason: to present and link academic documents over a network. We have obviously taken HTML a long way, but we couldn't take the next step within the confines of that original design. HTML was designed as a markup language for people. Tags are available for defining fonts and tables and colors and images and links, but they are all tags for instructing the browser how to display data. There are no tags to define actual data contained in the document.

Once Web pages became commonplace, organizations wanted to accomplish more. They wanted to take the information they were sharing between people over the Internet and share it between systems. This has proven to be a challenge using existing technologies. Too often applications have been written to read HTML delivered over the Web and to extract the data they wanted. And we thought building client-server applications meant the end of screen scraping!

  • Screen scraping is the practice of programmatically reading in a screen of data meant for human eyes and parsing through it to identify and extract the pertinent data.

The following code is a sample of how HTML might present some data to a user:

 <b><i>Last Name: </i></b>Smith 

By quickly scanning this code, you know that it will display the last name of some person named Smith. Would a computer be able to read this code and know that Smith is the last name of some person? Certainly you could design an algorithm to process this one instance, but would it also work for the following example?

 <b><i>Last Name:</i></b>Van Allen 

If the preceding algorithm parsed through the HTML code to look for Last Name, continued past the tags </i> and </b>, and read Smith, would it have stopped at the first space it read? If so, you would get a last name of Van from this line of code. Assuming your algorithm dealt with that scenario, would it have handled this?

 <b><i>Last Name:</b></i>Van Allen 

You can imagine the additional issues that would come from adding font treatments, table elements, or rearranging the page layout. The point is that HTML was designed as a very flexible language that could easily change the presentation of a document-for people, not computers.

For programmatic access to information, we needed more structure. An application doesn't care about the treatment of information; it just wants to be able to work with the data. To do so, the application has to have the ability to interpret the data. The current set of HTML tags does not provide that functionality.

Additionally, the forgivingness of HTML (the tag structure necessary for validity in HTML is very loose) made it convenient for a whole industry of developers to learn a new language, but that approach is counterproductive to the use of markup languages by applications. If a piece of information is supposed to be annotated by a beginning tag and an ending tag, then it must be that way to work. Guessing at the intention programmatically is not only difficult, but also risky.

Along came XML to help with this problem. It was designed to address these issues by describing the data itself and not the presentation of the data. This approach obviously gives much more consideration to presenting information to a computer system than to a human user.

Note 

When you realize the distinction between the intended use of XML and HTML, you realize just how inaccurate it is to think that XML will replace HTML.

Now that we understand why XML was developed and for what it was intended, let's roll up our sleeves and get into it!

XML Structures

When writing in a metalanguage, first you need to consider the components involved. All XML code is defined through documents, elements, and attributes. For an XML data set to be well-formed, it must use these components in the correct manner. Otherwise, any XML parser will reject it.

  • Well-formed XML is XML that complies with the syntax rules of the XML specification. Well-formed XML is often confused with valid XML, but the two are very different things. Valid XML refers to XML that complies with any declared definition(s).

Elements

A data point in XML is called an element. Elements are the basic building blocks for XML. They are used to define every piece of data. The following code is an example of an element called "city":

 <city>Dallas</city> 

Tags

You'll notice that the element is annotated with two tags. The metadata in each tag is often called the element name. However, as we will see later, this may be misleading if additional information is provided in the tag.

A distinction is also made between the two tags surrounding the element data. The start tag is the XML code preceding the data, and the end tag is the segment of XML code following the data. Together, these tags define an element of XML. At a minimum, the difference between the end tag and the start tag is the / sign preceding the metadata in the end tag. This is standard notation. Additionally, any attributes that are defined for the element will be declared only in the start tag, not in the end tag. We will look at attributes in the next section.

Tip 

As we go through some XML in this book, you may want to modify and test some of the code on your own. Several XML parsers are available, but the easiest option may be to use Internet Explorer (version 5.0 or higher) to run quick tests. (See Figure 3-1.) Internet Explorer includes the Microsoft XML parser, which has been regularly updated to keep pace with the changes made by revisions to the specification. It provides a quick view of any XML file with a collapsible/expandable view, but unfortunately the current version can check only how well-formed code is, not its validity. You can update your browser to support validation if you install the MSXML 4.0 parser.

click to expand
Figure 3-1: Internet Explorer view of an XML document

Empty Elements

In the "Dallas" example, the element contains some data. However, it can also be acceptable for an element to be defined, but contain no data. Such an element, as seen below, is called an empty element.

 <city></city> 

For reasons we will discuss later, it is often better to have an empty element than simply to remove the element or leave it out. Even if there is no data, the element must have both start and end tags. Without them, it would not be well-formed XML. For example, the following code would produce an error when run through an XML parser:

 <location>      <city> </location> 

However, there is an acceptable shorthand method for defining elements that are present but empty. This is done by adding a / to the end of what would normally be the start tag. Here is an example of how this shorthand would look:

 <city/> 

Nesting

Any procedural or functional language depends on nesting for its code structure, and metalanguages possibly even more so than other languages. An understanding of proper nesting is crucial to writing syntactically proper XML code. If tags are not nested properly, a parsing application will have no way of understanding the relationships between the various elements in your data. This is one area in which HTML parsers are very forgiving, so you will want to make sure that you don't continue any bad habits you may have from working with HTML code.

Nesting is the idea of grouping start and end tags in the appropriate sequence. Here is an example of nesting using just two elements:

 <element>      <nested_element>      </nested_element> </element> 

Notice how the nested element's start tag comes after the element's start tag. That means the nested element's end tag must come before the element's end tag.

This makes the nested element entirely contained within the element. The idea is to avoid the intersecting of elements. Here is an example of an intersection:

 <element>      <nested_element>      </element> </nested_element> 

In this example the element is closed before the nested element is closed. This would cause an error in an XML parser, but would be overlooked by most HTML parsers. Because start tags can be combined and used in various ways, the key is keeping track of the end tags. Make sure that each end tag only follows its start tag mate or a nested element's end tag.

Tip 

It is a good idea to use indentation to keep track of your nested pairs when writing your XML even though it makes the writing go more slowly. If you back out of an element's indention only when the end tag closes the element, it will help you catch errors. Some editors may do this for you, although not without some inconsistency.

Element Relationships

Elements that are nested within other elements are referred to as children or child nodes. Not surprisingly, the elements that contain other elements are referred to as parent nodes.

Elements do not have to be nested inside other elements. Elements that reside at the same level are often called peers or siblings. Here is an example of two nested element siblings:

 <element>      <nested_element/>      <nested_element/> </element> 

Notice that the nested elements have the same name. This is allowed and acceptable. In fact, this is an example of the extensible nature of the XML specification, which gives you a lot of flexibility. With multiple instances of the same element allowed, it is almost the equivalent data structure to a dynamically allocated array used in some programming languages. We will see later how there are additional mechanisms to disallow our XML data from using this feature in error.

Attributes

A structure is available to provide more information about an element. Attributes are declared and valuated in the start tag of an element. The value is declared within a pair of double quotes, as shown here:

 <element attribute=" value"/> 

The attribute value has to be defined even if it is empty, as shown here:

 <element attribute=""/> 

Tip 

Single quotes can also be used to define the values of attributes but are merely translated into double quotes by most parsers. This can be helpful if you are building XML strings with programming languages in which double quotes have significance. Whenever feasible, you are better off sticking with just double quotes to ensure compliance with applications and maintaining consistency throughout your code.

Attributes allow you to define information without using extra elements and thus take less space. An attribute requires less space to declare and define because there are no start and end tags. However, some people think that attributes do not provide any advantages because they are more difficult to parse and work with. We will look at some of these arguments when we start working with XML data in the next chapter, but remember that you can choose whether to define your data through attributes or elements. Let's look at some examples.

Data defined through XML elements only:

 <customer>      <firstname>John</firstname>      <lastname>Smith</lastname>      <birthday>071671</birthday>      <favorite_sport>hockey</favorite_sport> </customer> 

The same data defined through XML attributes:

 <customer firstname=" John" lastname=" Smith" birthday="071671"   favorite_sport=" hockey"/> 

If you look closely at both sets of code, you will notice that they both expose the same data set. However, the strict element approach takes 140 characters, while the attribute approach takes only 87 characters. That is a difference of about 38 percent. Depending on your priorities and situation, this may be pertinent to your application. I recommend reserving judgment until we try working with the XML built with each of these approaches.

Tip 

Referencing the metadata within a tag as the name of the element is complicated if you have an attribute of that element defined as "name." For instance, in the tag <customer name="John Smith"/>, is the name of the element "customer" or "John Smith"? I recommend avoiding such an attribute for this reason along with the fact that ambiguously named attributes have limited value.

Documents

Up to this point, we have been talking about XML as free-floating code. One aspect of XML that differentiates it from many other data languages is its top-level structure. Any well-formed XML must always have a single root-level element encapsulating the entire data set. This qualifies an XML data set as a document.

Note 

Information can be provided prior to the root node in a well-formed XML document. This can largely be regarded as optional header information, which I will discuss in the next section.

The Document Node

Often the top-level element is called the document node. Here is an example of an XML document:

 <document>      <element>           <nested_element>           </nested_element>      </element>      <element attribute=" value"/> </document> 

The Invalid Document

Looking at just this structure of XML data, it may seem hard to conceive of another structure in which the XML data would not be considered a well-formed document. To reinforce this definition, here is an example of XML data that is not well formed because it is not in a document.

 <customer>      <firstname>John</firstname>      <lastname>Smith</lastname> </customer> <order>      <number>1198054</number>      <total>157.45</total> </order> 

Here we have some XML defining a customer and an order, presumably related to the customer. We are presuming because there is nothing explicitly telling us they are related because no document node is provided. So we see here an example of what is not a document as well as why we have documents. With a root node, a parser can theoretically look at the first node of an XML document and discern what the information contained pertains to. This helps us to make our algorithms a little more efficient so that we can avoid spending time digging into XML and finding nothing of value.

The Well-formed Document

To make the preceding XML a well-formed document, we have a couple of choices. One is to simply add a root-level node, as shown here:

 <document>      <customer>           <firstname>John</firstname>           <lastname>Smith</lastname>      </customer>      <order>           <number>1198054</number>           <total>157.45</total>      </order> </document> 

Adding a root-level node made this XML well formed, but it is really the easy way out. If we put a little more thought into it, we can come up with a more usable solution. For instance, since the order belongs to the customer, we could just make the customer the root node and relate everything to that entity. This way a parser will know at the first level what information it can find in the document. Here is an example:

 <customer>      <firstname>John</firstname>      <lastname>Smith</lastname>      <order>           <number>1198054</number>           <total>157.45</total>      </order> </customer> 

In this implementation, we have also chosen to nest order information as subelements under the root customer node. Other implementations could have the order number and total listed as distinct elements directly under the customer node, as seen later. While this data set provides the same information, it does so in a subtly different way that may have bigger implications than you realize.

 <customer>      <firstname>John</firstname>      <lastname>Smith</lastname>      <order_number>1198054</order_number>      <order_total>157.45</order_total> </customer> 

One advantage of the document approach is that we are helping our applications to discern, through the parser, what type of data is contained in the document. This same approach can and should be carried throughout the document as a best practice. Whenever you group information in categories, you limit the amount of "crawling" your applications have to do to work with the data. This concept is very similar to the idea of normalizing a database. XML can, after all, be considered a flat file representation of a data source.

  • Normalization refers to the practice of designing a data model based on normal forms that eliminate redundant data by defining relationships.

The Structure of Your Data

Just as you can overnormalize a database, you can "overtier" your XML data set. For instance, if we take our customer example, we can break it down as shown here:

 <customer>      <name>           <first>John</first>           <last>Smith</last>      </name>      <order>           <number>1198054</number>           <total>157.45</total>      </order> </customer> 

Here we are pulling the first and last name into a subelement called name. Hopefully this seems pretty ridiculous for this scenario because it is overkill. To help discern when information should be broken out and contained in a subelement, I try to follow a few rules:

  • The data can be grouped into a sensible category.

  • The data in the category may possibly change.

  • The data is typically accessed as a group and rarely accessed separately.

  • The data is optional for some processes needing other information in the document.

  • The data never or only occasionally contains just one nonempty node.

If we apply these rules to the data set we have been working with, we can probably agree that the following is the most appropriate structure for our document:

 <customer>      <firstname>John</firstname>      <lastname>Smith</lastname>      <order>           <number>1198054</number>           <total>157.45</total>      </order> </customer> 

The XML Declaration

The XML specification also recommends that you include an XML declaration at the top of a document. This is encouraged, but is not a requirement, since no parser will produce an error from the lack of such a declaration. When you use an XML declaration, it is important to list it before any elements.

This declaration would read as shown here:

 <?xml version="1.0"?> 

This is basically the equivalent to the standard HTML header tag, as shown here:

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 

Note 

If and when the XML specification is updated and backward compatibility is not maintained, this declaration might play a much more vital role. As such, it can be treated as optional header information.

The Text Declaration

One piece of additional information that can be specified in the XML declaration is the text declaration. This contains the encoding type of the document. Usually you will use one of the standard character encoding types of UTF-8 and UTF-16, which all parsers must support by default and do not need to be explicitly declared. Other encoding schemes you can specify include ISO-10646-UCS-2,

ISO-10646-UCS-4, and ISO-8859-2. Any encoding type specified by the Internet Assigned Numbers Authority (IANA) is recommended. Only one type per document can be specified without error and would look like this:

 <?xml version="1.0" encoding=" EUC-JP"?> 

  • Encoding is the process of defining data on a binary level.

Syntax Rules

Now that we have an idea of how to structure our XML documents and the data in them, let's take a look at some of the syntax rules.

Character Conventions

First, we need to cover the character conventions supported by XML. The most common issue with new XML developers is the use of spaces in tag names. Spaces are not legal characters in XML element names. This can be disturbing unless you realize the reason. Consider the following two lines of code:

 ...<first name/>... ...<first name=""/>... 

When XML parsers encounter a space after the element name, they think you are declaring an attribute. Of course you have to define the value even if it is null, and the parser has no way to discern how the data should be interpreted.

Lowercase Code

Notice that in all of these examples I use lowercase letters. XML is case sensitive, and I prefer to avoid problems in that area by just using lowercase letters when possible. For more advanced or complex names, I prefer to concatenate them with uppercase characters or use the underscore character for readability. The important thing, obviously, is to standardize on an approach in your environment to keep down the number of errors.

Special Characters and the CDATA Tag

As usual, there are some special characters that we need to be aware of. These are special characters because of the meaning they have for XML in defining entities and values. These characters need to be replaced by numeric references.

Table 3-1 identifies these special characters and the numeric equivalents that can be used in their stead.

Table 3-1: XML Special Characters

CHARACTER

NAME

NUMERIC REFERENCE

<

less than

&#60;

>

greater than

&#62;

"

double quote

&#32;

&

ampersand

&#38;

'

apostrophe

&#39;

Another way to contain the special characters in a nonfunctional mode is by using the CDATA tag. This tag tells the parser to treat the entire content as character data and not markup. By using brackets, the CDATA tag encapsulates the tags as such, as shown here:

 <![CDATA[<city>San Jose</city>]]> 

This tag should not be confused with the element tag because it cannot be nested.

Comments

Comments should also be a part of any language's syntax, and XML is no different. Fortunately the XML syntax is the same as HTML's, so if you are familiar with that, you don't have to learn anything new! Comments are handled through a single tag along with some additional annotations. Here is a sample:

 <!-- This is a comment --> 

Special HTML Tags

Special HTML tags are usually not valid in XML. For example, the nonbreaking space special character in HTML is "&nbsp;", which produces an error in an XML validator. There are some special tags for XML that allow you to accomplish many of the same things. We will look at some of those a little later.

Note 

This issue may seem rather moot since XML wasn't intended for presentation delivery. However, this becomes an issue once you get into a situation where your HTML needs to be XML compliant.

While we now understand how to build XML data sets, we need to establish definitions for our documents. This is important not only for maintaining consistency in our own usage, but also for communicating data to our partners. If we communicate a standard definition for each document, others will be able to parse and understand what our data means. This is a critical component to the implementation of another's Web service.




Architecting Web Services
Architecting Web Services
ISBN: 1893115585
EAN: 2147483647
Year: 2001
Pages: 77

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net