Making Sense of XML Data


When you look at XML data for the first time, it can be somewhat confusing. XML data can run together in a big chunk or be indented with lots of white space, unlike Excel worksheets and Access data tables, which present data in well-defined rows, columns, and cells. Pieces of XML data can be defined by a vast number of symbols, such as greater-than (>), less-than (<), and division (/) symbols, foreign characters such as &amp; and &gt;, and lots of quotation marks. Understanding the W3C XML rules and how these symbols are used in XML is the first task you must master before you can create, read, analyze, and exchange XML data.

To familiarize yourself with what XML data looks like, use Internet Explorer to open and view the CustList.xml file in the Chap10 folder. We’ll use this file in several of the exercises in this chapter. As you learn more, the structure and meaning of the XML document will become more clear.

Note

To view XML files in Internet Explorer, you must be using version 5 or later.

Note

For the definitive explanation of the XML rules, you can go to the World Wide Web Consortium’s (W3C) Web site at http:// www.w3.org/xml. You should be aware that these rules are written in a very precise, complex manner. Reading them is similar to reading architectural blueprints—if you aren’t a trained architect, it takes some patience to understand blueprints, and unless you understand technical notation such as Backus-Naur form (BNF), reading the W3C XML rules can similarly test your patience!

Basic XML Terminology

As with any computing technology, XML includes terminology that describes and simplifies what can seem at first to be hard-to-understand concepts. In this section, I’ll highlight some of the key XML terms that you should become familiar with.

The term XML data is used to define the letters, numbers, and symbols that follow the rules of XML. A group of XML data can be stored in an XML document, such as a text file with the file extension .xml. An XML document that contains XML data that adheres to the rules of XML is referred to as a well-formed XML document. When a well-formed XML document adheres to one or more XML schemas, the XML document is also known as a valid XML document.

XML documents consist of one or more elements. An XML element consists of the element’s name, zero or more element characteristics (or properties) known as attributes, and possibly some content. Content can consist of data such as letters, numbers, or symbols, and possibly even more elements (known as child elements) that contain even more data, and so on. Every element is defined by a start tag, which begins with the symbol < and ends with the symbol >, and an end tag, which begins with the symbols </ and ends with the symbol >. For instance, a simple firstname element would be represented as <firstname>Paul</firstname>. A language attribute could be added to make the firstname element look like this: <firstname language=“us-en”>Paul</firstname>. Elements that contain only the element’s name or just attributes (these elements are known as empty elements) can be defined by a single shorthand tag that begins with the symbol < and ends with the symbols /> instead of using adjoining start and end tags. For example, the simplest form of an empty address element would be represented as <address></address> or just <address/>.

Namespaces can be used to distinguish identical element names that are used by two or more schemas referred to by a single XML document. Namespaces consist of two required parts. The first is the namespace prefix, a series of characters that’s used to identify the namespace. The namespace prefix, along with a colon symbol, precedes an element’s name to indicate that the element is defined by that particular namespace. The second required part is a Uniform Resource Identifier (URI), which is a longer version of the namespace. A URI can be any series of letters, numbers, and symbols that are reasonably assured to be unique across all space and time. By convention, a URI consists of an organization’s public Internet address for its home Web page, along with a unique series of letters, numbers, and symbols administered and handed out by the organization to its members for the purpose of defining namespaces.

Note

A URI doesn’t necessarily match a real Web address. Some organizations, including Microsoft, have begun to use a special series of characters that do not even look like Web addresses, such as the namespace URI urn:schemas-microsoft-com:officedata, to be sure that no confusion exists about whether these URIs are real Web addresses.

Note

Some software applications require that specific namespaces and URIs be included in XML documents for the application’s features to work correctly with the XML data. Unrecognized namespaces and URIs are either ignored or rejected back to the user for clarification by XML-aware software applications.

For example, a sample namespace could be http://www.microsoft.com/ schemas/12-01-2002 with a namespace prefix of msft. In XML, this would be represented as xmlns:msft=“http://www.microsoft.com/schemas/12-01-2002”, and a postalcode element that’s part of the msft namespace would be represented as <msft:postalcode>98052</msft:postalcode>.

Elements, which are also sometimes referred to as nodes, can be organized into a structure that looks much like a genealogical family tree. Elements can contain child elements, grandchild elements, great-grandchild elements, and so on. Elements can also have relationships with sibling elements, elements can be contained by parent elements, grandparent elements, great-grandparent elements, and so on. For example, a customerlist element might contain several customer child elements. Each customer element would have other customer sibling elements. Furthermore, each customer element might contain a customername child element. Each customername element’s grandparent element is the customerlist element, as shown in this very simple XML document:

<customerlist> <customer> <customername>Paul Cornell</customername> </customer> <customer> <customername>Nancy Davolio</customername> </customer> </customerlist> 

Processing instructions are special XML elements that can be used by applications to carry out specific actions. Processing instructions are single XML statements that begin with the symbols <? and end with the symbols ?>. For example, the processing instruction <?xml version=“1.0”?> identifies to XML applications that the file’s contents contain XML formatted data.

XML applications can change elements in XML documents from one schema into another schema by applying a style sheet. Style sheets are well- formed XML documents that adhere to the Extensible Stylesheet Language Transformations (XSLT) specification. For example, if one organization uses the purchaseorder element to describe a purchase order, and another organization uses the requisition element to describe the same purchase order, XSLT can transform all the purchaseorder elements to requisition elements.

Note

A complete discussion of XSLT is outside the scope of this book. For more information about XSLT, see the W3C Web site at http://www.w3.org/tr/xslt.

Basic XML Rules

Now that you understand some of the basic terminology of XML, you can learn some of its basic rules. Understanding the following eight rules will help you create, analyze, and exchange well-formed XML.

  1. XML element names and attributes are case-sensitive. XML tags such as <region></region>, <Region></Region>, and <REGION></REGION> do not refer to the same element. The case-sensitivity of XML tags is different from the rules of Hypertext Markup Language (HTML), which does not have this requirement.

  2. Every XML start tag must have a matching end tag. In HTML, you can type <p>This is a paragraph to represent a paragraph. However, in XML you must type <p>This is a paragraph</p>.

  3. An empty XML element is represented as an adjoining start tag and end tag or a special XML tag known as an empty-element tag. For instance, an empty list element could be represented as <list></list> or just <list/>.

  4. Every XML document should start with a special XML processing instruction called the XML declaration. This processing instruction should contain, at a minimum, the text <?xml version=“1.0”?>. This declaration signals to XML-aware applications that any accompanying data should be treated as XML.

  5. Every XML document must have one and only one root element, known also as the document element. The document element appears after the XML declaration and contains all the other elements in the XML document.

  6. All attribute values must be enclosed within quotation marks. In HTML, you can type attributes such as <table rows=5 cols=2></table>. In XML, however, you must type attributes as <table rows=“5” cols=“2”></table>.

  7. Certain keyboard characters and other symbols have special uses in XML, for example the < and > symbols used in element tags. When these characters or symbols are part of the content—when they are not used as part of the XML markup—they must be represented by replacement characters known as escape sequences. For example, using the greater-than (>) symbol other than to denote start, end, or empty tags is not allowed in XML (except in a CDATA section as described in the next item). Instead, you use the escape sequence &gt;. Here’s an example:

    <sentence>This is how you represent a greater-than (&gt;) symbol. </sentence>

    Other symbols that require escape sequences include less-than (&lt;), ampersand (&amp;), quotation mark (&quot;), and apostrophe (&apos;).

  8. If you want to type a series of letters, numbers, and symbols without worrying about escape sequences, you can enclose the content in a CDATA section. A CDATA section starts with the symbols <!CDATA[ and ends with the symbols ]]>. Here’s an example

    <!CDATA[Look, I can type whatever I want here, including symbols such as <, >, and &, without breaking the XML rules!]]>

    The XML rules are suspended for the most part inside CDATA sections. CDATA sections are typically used to exchange complex or precise series of letters, numbers, and symbols.




Accessing and Analyzing Data With Microsoft Excel
Accessing and Analyzing Data with Microsoft Excel (Bpg-Other)
ISBN: 073561895X
EAN: 2147483647
Year: 2006
Pages: 137
Authors: Paul Cornell

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net