9.3 The All-Powerful Wizard of XML | Internet-Enabled Business Intelligence


Team-Fly

	Internet-Enabled Business Intelligence By William A. Giovinazzo
	Table of Contents

	Chapter 9. eXtensible Markup Language

9.3 The All-Powerful Wizard of XML

The previous sections describe the wonderfulness that is XML. Actually, there is a problem with XML. The problem is that XML is very simple. In fact, it is so simple that people like me have a difficult time writing substantive books on it. How do you fill 700 pages with something as easy to understand as XML? Well, in this section, we will show you how to put together an XML document. Now, there are many very good books on XML that provide all sorts of information relating to the markup language, but to understand the guts of XML, the real substantive parts , takes just a few pages. We'll prove it by doing so here.

As we have already stated several times in this chapter, XML provides a means to create a structure into which you place your document. There are three simple syntactical rules to follow to create this structure. The tree consists of a root with elements, attributes, and text. The structure begins at the root from which extends branches, which are the elements. These branches can lead to other branches or leaves , which ultimately contain the data. The rules to create the structure are as follows :

Everything is based on one and only one root.
Any tag that is opened must also be closed.
Attributes must be quoted.

The simplicity of these rules provides almost limitless variations. We may wish, for example, to create a structure that defines an address. In Figure 9.4, we have diagrammed the address structure and shown the XML that describes this structure. This is a simple structure composed of one tag, address. To create this structure, we open a tag and then close it. The tag is opened with the angle brackets: < tag name >. The tag is then closed with angle brackets as well, but with a slash preceding the tag name: </ tag name >. This tag creates the components of the structure. In our example, we have just one component, address. All the data for the address is within this one structure. In some cases, this would be fine, but what if we were interested in the components of the address? We are no better off with this structure than if we had simply created the address with free-formatted text.

Figure 9.4. Simple address structure.

graphics/09fig04.gif

In order to access the components of the address, we can define a more elaborate structure. This is shown in Figure 9.5. Here we see a complete layout of the address with each component fully expanded. In this example, we have a tree-like structure where Address is the root. From this root the tree expands to the different branches of Name, House, and Locality. Each branch expands to the individual data elements, or leaves.

Figure 9.5. A complete address.

graphics/09fig05.gif

By simply parsing this XML document, we can create the structure of the address. This structure of course makes it much simpler to extract the individual elements of the address. If, for example, we wish to extract the city or zip code from the address, we can do so easily. This gives us a great advantage over simple text blocks or even the previous XML address structure. No longer are we dealing with one amorphous blob of ASCII data, but with a well-defined structure whose individual components have a distinct beginning, middle, and end. Applications working with this document can easily parse the document correctly. Once the document is parsed, the receiving application can apply the appropriate style sheets or embed any desired link.

This document is described as valid that is, it meets all the syntactical XML requirements. First, the structure of the document has one root, Address. Second, every tag that has been opened is also closed (unless I made a syntactical error that proofreading did not catch.) The third rule concerning attributes at this point is inapplicable, since the address does not have any attributes. Note that valid is the base level for XML documents. A valid document creates a tree-like structure. Documents that are not valid do not create this structure and cannot be considered an XML document.

In reviewing the example, we see that there is really no way with a valid document to verify the structure itself. The example does not include the country of the address. This may be correct, but then again, it may not be. There is no structure arbitrator in a well- formed document to mandate what should and shouldn't be included in a structure. The receiving application assumes that a well-formed XML document is correct.

A well-formed document is a valid document whose structure is defined in a DTD. The question of whether the developer requires XML documents to be valid or well-formed is up to the individual. As with all design issues, the system architect needs to determine the need, as well as the cost, for meeting the higher standard. If the cost of dealing with invalid yet valid documents is less than the cost of requiring well- formedness , then it is probably not worth enforcing the higher standard.

The DTD provides a definition of our document's structure. It enables a parser to understand the structural pieces of the document and make them available to the receiving application. In Figure 9.6, we see the DTD for an annual report.

Figure 9.6 DTD of an annual report.

 <!ELEMENT report    (header, unit, corp-sum)> <!ATTLIST year_qtr  NUMBER #REQUIRED> <!ELEMENT header    (rpt_title, date, executive-sum)> <!ELEMENT rpt_title (#PCDATA)> <!ELEMENT date      (#PCDATA)> <!ELEMENT exec-sum  (title, paragraph+)> <!ELEMENT paragraph (#PCDATA)> <!ELEMENT unit      (title, overview, table)> <!ELEMENT overview  (paragraph+)> <!ELEMENT table     (head-row, data-row+)> <!ELEMENT head-row  (item-col, data-col)> <!ELEMENT item-col  (CDATA)> <!ELEMENT data-col  (CDATA)> <!ELEMENT data-row  (item-name, data-value)> <!ELEMENT title     (#PCDATA)> <!ELEMENT item-name (CDATA)> <!ELEMENT data-value (CDATA)> <!ELEMENT corp-sum  (title, overview)>

We see in the figure above how to declare the document and the structures that comprise the document. Each of these lines follows the same basic format.

 <!ELEMENT name (content specification)>

where

<!ELEMENT Key phrase specifying that this is an element type declaration.
Name The name of the element being declared.
( content specification ) Specifies the content of the element. An element can contain
- Other Elements In this case, the name of the element is enclosed in the parentheses. Multiple elements are delimited by commas. This is demonstrated in the first line of the example with the declaration of the book element.
- EMPTY Signifies that the element is empty.
- CDATA Signifies that the element contains character data.
- #PCDATA Signifies that the element contains parsed character data.
Content specification+ A plus sign signifies that there can be multiple occurrences of the content. This can be seen in the specification of the executive summary element that allows for multiple paragraphs to be included within the executive summary.
> Closes the element specification.

We also see in this DTD a specification for an attribute. The following statement specifies the attribute:

 <!ATTLIST year_qtr  NUMBER #REQUIRED>

The attribute allows the developer to add explanatory notes to the documentin effect, metadata. The attribute is not part of the content of the element, but a way of providing descriptive information about the XML document or the specific element. In this particular example, the attribute year_qtr provides information on the fiscal year and quarter of the report. Often, the attributes provide hyperlinks to other data elements.

The DTD, however, simply describes the structure of the document. Figure 9.7 charts this structure. It does not provide the content. The structure tells us what components are expected and where they fit together. The content of this structure is something else completely. We present content of the quarterly report in Figure 9.8. We can see part of the appeal of XML. This annual report can be easily archived within a database. Later, applications can compare individual components of this report with past reports . It may be desirable to break up the report and distribute just the corporate summary along with the individual departments' results. The point here is that we now have a structure with which we can work.

Figure 9.7. Quarterly report structure.

graphics/09fig07.gif

The DTD provides a means for the application receiving the document to determine if the structure is complete. In order for a document to be well-formed, it must conform to its DTD. Documents can, however, meet the standard of valid without a DTD. Each has its purpose. Often, when working over the Internet, an application may not have access to a DTD. There may be nothing wrong with the actual document or with the XML structure it describesit simply may not have an associated DTD. In such cases, meeting the valid standard may suffice.

Figure 9.8 Quarterly report.

 <report> <header> <rpt_title>Big Time Corporation Third Quarter Report</rpt_title> <date>March 17, 2001</date> <executive-sum> <title>Another Record Breaking Quarter</title> <paragraph>This had been a really terrific.. </paragraph> ..more paragraphs.. </executive-sum> </header> <year_qtr = "2001"> <unit> <title>Western Sales Region</title> <overview>..The west is wild with sales ..</overview> <table> <head-row> <item-col>Products</item-col><data-col>Units Sold</data-col> </head-row> <data-row> <item-name>Tapioca</item-name><4000> </data-row> ..more data rows.. </table> </unit> ..more business units.. <corp-sum> <title>A Healthy Outlook</title> <overview>..This has been a very good quarter ..</overview> </corp-sum> </report>

We discussed how the XML standard can be extended for specific industries and applications. The creation of a DTD is how we achieve this extension. Organizations wishing to create a standard by which applications can communicate over the Internet create and distribute the DTD for that standard. Documents are passed and validated against this DTD. For example, if we wish to establish an electronic data interchange (EDI) standard so that we can communicate more easily to both customers and suppliers, we simply create a DTD that describes the documents we expect to receive and transmit. We now have a common language by which our organizations can communicate with one another.


Team-Fly

Top