24.2. Understanding XML
Databases aren't the only place you can find massive amounts of information. All kinds of specialized software programs store vast quantities of data in their own way. More often than not, when you need to get information out of (or into) one of these programs, you'll use XML .
XML is an all-purpose system for structuring and organizing data in a file. XML lets you exchange information with just about anyone , and send your spreadsheet data to other businesses that don't use Excel, or analyze raw information created with other programs.
For example, instead of saving data in Word documents, Excel spreadsheets, or ordinary text files, you can save data in an XML file. XML alone sounds pretty modest, but this simplicity is deceiving. Two factors make XML really special:
Note: In a confusing move, Microsoft's decided to limit XML support to certain editions of Office 2007. If you don't have Office's full Professional or Enterprise Editions, you may want to stop reading now. All other Office versions (including Standard, Small Business, Student and Teacher) omit the XML goodies that this chapter explores.
24.2.1. What Is XML, Really?
Contrary to what many people believe, XML is not a data format (as is, for example HTML, the format used to create Web pages). If XML were an ordinary data format, it wouldn't be nearly as useful because, no matter how good a format is, it can't suit everyone. Even though almost every company needs to create invoices, most companies wouldn't be happy with a generic format for storing invoice information. One company might need to track customer names , while another might track customer ID numbers . The bottom line is that most companies need to store slightly different data in slightly different ways. A one- size -fits-all solution is pretty much always doomed to failure.
So if XML isn't a data format, what is it? Technically, XML is a meta-language , which is a fancy way of saying that XML is a language for creating other languages. XML does this by setting out a few simple rules that let you build your own data format that's just right for your data.
For example, Acme Company can build an XML format for invoices and call it AcmeInvoice. Meanwhile, Budget Company can build its own XML invoice format and call it BudgetInvoice. Even though both these formats are designed to store invoice information, they can contain completely different kinds of data. XML's flexibility is its strength.
At the same time, XML's flexibility can create problems. Imagine that a bank named Worldwide Green sets up a system to automatically process XML invoices in a specific format. The system works smoothly until Acme Corporation sends along its own homegrown invoice. Even though Acme's invoice uses XML, it doesn't conform to the XML that the bank expects, and so it gums up the bank's automated invoice-processing application. Suddenly, XML doesn't look so useful.
The bottom line: XML holds the promise of universal data sharingbut if you don't create some rules and follow them, you're left with a bunch of incompatible formats.
Note: XML is really quite simple. However, you'll find a slew of other standards with names like XML Schema and XSLT that work in conjunction with XML, and provide solutions for validating XML, searching XML, transforming XML, and so on. These other standards are quite complex and aren't discussed in this book. For more information, pick up a book like Learning XML by Erik Ray (O'Reilly), or see the Web site www.w3schools.com/xml.
24.2.2. Three Rules of XML
To better understand how to configure Excel to handle XML, look at a simple example. Technically, you don't need to know what XML looks like in order to use Excel's XML features, but the more you understand, the less confusing life will be. In this section, you'll learn the three most important rules that shape all XML documents. If you already know a little about XML, feel free to skip ahead.
By the way, good news before you even start: XML is written in a text-based, human-readable format. That means you can use a program like Notepad to crack open an existing XML file and get a basic idea of its format and structure. You can even write an XML file from scratch using Notepad. You can't do the same with the average Excel spreadsheet file because they're stored in a binary format that you can read only when you're looking at the file in Excel. (If you open an ordinary Excel file in Notepad, you see a jumble of indecipherable symbols.)
220.127.116.11. The prolog
All respectable XML documents start with something called a document prolog . This bit simply announces that what you're looking at is an XML document. It can also indicate the encoding of the document, which sometimes specifies that the document uses a special character set (for example, a non-English alphabet).
Here's a typical document prolog, indicating that this document uses Version 1.0 of the XML standard (the most prevalent version):
<?xml version="1.0" ?>
If you're creating an XML document by hand, you should make sure you place the document prolog as the very first line of the file.
The element is the basic building block of any XML document. Elements are information containers. If you wanted to store a person's name, you might create an element called Name . (For more on the infinite variety of elements anyone can create, see the box in Section 18.104.22.168.)
A typical element is composed of a start tag and an end tag. The actual information goes between these two tags. You can easily recognize start tags and end tags because they use angle brackets <>. For example, here's one possible start tag:
This tag marks the start of the Name element. The end tag looks almost identical, except it begins with the characters </ instead of just <. Here's what you need to end the Name element:
To actually store some information in an XML document, you just insert the content between the start and end tag of an element. Here's how you might store someone's name in an XML document:
You could create a list of names by putting one <Name> element after the other, or you could add other elements that store different types of information, like address, title, employer, and so on. You put all these tags together in a file to make an XML document.
So far, you've seen examples of XML elements that contain text. You can also create an element that contains one or more additional elements. This is a basic principle for organizing information in XML.
Imagine you want to keep track of several people's names and ages. The following format isn't especially clear because it's hard to tell which person connects to which age:
<Name>Lisa Chen</Name> <Age>19</Age> <Name>Bill Harrison</Name> <Age>48</Age>
Better to group the <Name> and <Age> elements together for each person, and put them inside another element. Here's an example:
<Person> <Name>Lisa Chen</Name> <Age>19</Age> </Person> <Person> <Name>Bill Harrison</Name> <Age>48</Age> </Person>
Here, the two <Person> elements each represent a distinct individual. Information about each person is stored in <Name> and <Age> elements that are nested inside the appropriate <Person> element.
There's no limit to how many layers deep you can nest information, making this method of organizing information extremely flexible. In fact, it's part of the reason that XML can work with so many different types of data.
XML imposes one more rule. Every document must start with a single element that you place right after the document prolog. You place all the other content inside this element, which is called the root or document element . So far, the examples you've seen are only excerpts of XML. The following listing shows a complete, valid XML documenta list with information about two peoplethat starts off with the document element <PeopleList>:
<?xml version="1.0" ?> <PeopleList> <Person> <Name>Lisa Chen</Name> <Age>19</Age> </Person> <Person> <Name>Bill Harrison</Name> <Age>48</Age> </Person> </PeopleList>
You could enhance this document by adding more <Person> elements, or by adding different elements to track additional information about each person.
You've probably noticed that these XML examples indent each level of elements. That indentation makes the overall structure easier to read, but it's not required. In fact, applications that read XML (including Excel) ignore all the white space between elements, so it doesn't matter if you add spaces, tabs, and blank lines. In fact, as far as computers are concerned , the document above is exactly the same as the following, much less human-friendly version:
<?xml version="1.0" ?> <PeopleList><Person><Name>Lisa Chen</Name><Age>19</ Age></Person><Person><Name>Bill Harrison</Name><Age>48< /Age></Person></PeopleList>
24.2.3. XML Files and Schemas
As you've already learned, a file is one place you can store XML documents. But you can just as easily place XML documents in databases or other storage locations. In fact, sometimes XML data isn't stored anywhereinstead, you just use it to send information between applications over the Internet. However, when you use XML with Excel, you're always using XML files (unless your company's created a custom solution using Excel's heavy-duty programming features). Most XML files have the extension .xml. It makes perfect sense to take the person list document shown earlier and place it in a text file named PersonList.xml.
Tip: The Excel XML features make the most sense when you have information that's already in an XML format. If you don't have XML information that you need to work with, there's really no reason to play around with Excel's XML features.
There's another type of XML document that's extremely important: XML schemas . Schemas are designed to solve a common problemnamely, defining the rules for a specific XML-based format. A schema may indicate the element names you can use, how you can arrange the elements, and the type of information each element can contain. An XML-friendly application like Excel can use the schema to verify that an XML document uses the right structure and contains the appropriate content. In an ideal world, every time a company created an XML format, they'd write an XML schema that defines it. (You probably won't be surprised to learn this doesn't always happen.) Excel doesn't include schemas of its own, so it's up to you to supply the schema based on the XML you want to use.
This book doesn't look at XML schemas in detail because they're a more complicated standard than XML. However, even if you don't know how to build your own schemas, you can still use other people's schemas with Excel. In fact, you'll find that doing so actually simplifies your life. In order to use a schema, you simply need to have a copy of it in a file (schemas themselves are complex and ugly and beyond the scope of what a typical office needsor wantsto learn). Usually, schema files have the extension .xsd.
Tip: For a more comprehensive beginner's introduction to XML and XML schemas, check out the excellent online tutorial provided by W3 Schools at www.w3schools.com/xml.