There are several different ways to load XML data into Excel. Some are useful mostly for quick exploration and maybe some editing, while others are more appropriate for creating spreadsheets that use XML as a data source that can be easily replaced with new data whenever appropriate. All of these mechanisms share a common approach for showing XML data in the spreadsheet, so it's worth taking a moment to examine how Excel handles XML structures before moving into the mechanics of importing data.
When Excel opens an XML file, it imports data from it. If you make changes to the XML file while Excel is working with the data it has imported from that file, changes to the XML will not be reflected in the Excel spreadsheet.
6.2.1 Tables and Trees
Excel, like all spreadsheets, is built on a grid. Information is organized into rows and columns, and this worksheet grid (as well as relationships among multiple worksheet grids in a workbook) is used to create cross-references between different sections of information. Within the grid, Excel is enormously flexible. Information doesn't have to follow neat table structures pricing data could, if desired, run diagonally down a spreadsheet. It's easier to work with ranges of information if it stays in a single row or column, though, so most spreadsheets combine table areas that contain raw data and then either tables of results or cells along the fringes of the tables.
XML has no built-in notion of a grid. While it's certainly possible to represent a spreadsheet's rows and columns of cells within a worksheet as XML (and Chapter 7 will explore how Microsoft's chosen to do this), there's no guarantee that any given XML document will neatly fit into the native structures of Excel. There are a few simple but critical conditions that must apply to XML documents for them to be used easily as source data for Excel:
Effectively, Excel only works well with a small subset of the many possible XML document structures. The Excel subset, however, is an extremely common subset in practice. Enormous amounts of data are available in XML formats that work well with Excel.
6.2.2 Opening XML Documents Directly
The standard Excel dialog box for opening files shows XML files (or files ending in the extension .xml) right along with Excel spreadsheets, as shown in Figure 6-1.
Figure 6-1. XML files appearing in the Excel Open dialog box
Only one of the choices presented here is a traditional Excel spreadsheet, twoPlusTwo.xls. The other files are XML files. XML files that Excel knows belong to Microsoft Word (thanks to the mso-application processing instruction), the ch02-x series, are marked with the Word icon, while ch0601.xml, an Excel SpreadsheetML file, has the Excel icon. XML files using other vocabularies get a different icon. On my system, they get a Mozilla logo, but they may have a different logo on your system, depending on what XML-processing software you have installed.
Whatever logo appears, however, you can attempt to open any XML file. If the XML contains anything other than Excel's own SpreadsheetML, covered in Chapter 7, you'll see the dialog box shown in Figure 6-2.
Figure 6-2. Dialog box for choosing how to handle XML document importation
220.127.116.11 Opening documents as a list
We'll start with a simple XML document recording (imaginary) sales of books to explore how these different options work, shown in Example 6-1.
Example 6-1. A simple XML document for analysis in Excel
<?xml version="1.0" encoding="UTF-8"?> <sales> <sale> <date>10/5/2003</date> <ISBN>0596005385</ISBN> <Title>Office 2003 XML Essentials</Title> <PriceUS>34.95</PriceUS> <quantity>200</quantity> <customer >Zork's Books</customer> </sale> <sale> <date>10/5/2003</date> <ISBN>0596002920</ISBN> <Title>XML in a Nutshell, 2nd Edition</Title> <PriceUS>39.95</PriceUS> <quantity>90</quantity> <customer >Zork's Books</customer> </sale> <sale> <date>10/5/2003</date> <ISBN>0596002378</ISBN> <Title>SAX2</Title> <PriceUS>29.95</PriceUS> <quantity>300</quantity> <customer >Zork's Books</customer> </sale> <sale> <date>10/7/2003</date> <ISBN>0596005385</ISBN> <Title>Office 2003 XML Essentials</Title> <PriceUS>34.95</PriceUS> <quantity>10</quantity> <customer >Books of Glory</customer> </sale> <sale> <date>10/7/2003</date> <ISBN>0596002920</ISBN> <Title>XML in a Nutshell, 2nd Edition</Title> <PriceUS>39.95</PriceUS> <quantity>25</quantity> <customer >Books of Glory</customer> </sale> <sale> <date>10/7/2003</date> <ISBN>0596002378</ISBN> <Title>SAX2</Title> <PriceUS>29.95</PriceUS> <quantity>5</quantity> <customer >Books of Glory</customer> </sale> <sale> <date>10/18/2003</date> <ISBN>0596002378</ISBN> <Title>SAX2</Title> <PriceUS>29.95</PriceUS> <quantity>15</quantity> <customer >Title Wave</customer> </sale> <sale> <date>10/21/2003</date> <ISBN>0596002920</ISBN> <Title>XML in a Nutshell, 2nd Edition</Title> <PriceUS>39.95</PriceUS> <quantity>15</quantity> <customer >Books for You</customer> </sale> </sales>
If you open this document from Excel and choose "Open as an XML List," you'll see the dialog box shown in Figure 6-3.
Figure 6-3. Excel's warning that no schema is in use
If you just go ahead and click OK, Excel will look at the document, infer a schema for it, build a list based on that schema, and import the contents of the XML document into that list. You'll be rewarded with the spreadsheet result shown in Figure 6-4.
Figure 6-4. The XML document shown in Example 6-1 presented as an XML list in Excel
Excel not only imports the data from the XML document, it uses the element and attribute names as list headers. The drop down tabs to the right of the list headers let you organize the information as you'd like, as shown in Figure 6-5.
Figure 6-5. Choosing a sort or filter from a drop-down
If you choose "Sort Ascending," for instance, you'd see the list sorted by ISBN, as shown in Figure 6-6.
Figure 6-6. Sorting the data from the XML document by ISBN
Excel also offers some basic functionality for totaling and averaging the contents of these lists. Right-clicking on the list anywhere inside the blue box brings up a menu. If you choose List Total Row, you'll see an extra row appear at the bottom of the list, as shown in Figure 6-7.
Figure 6-7. A total row added to the spreadsheet
By default, Excel just does a sum of the right-most column. That's common practice for spreadsheets, though in this case it works badly, since the IDs aren't exactly addable. Clicking on the cells in the total row brings up a drop-down tab. Figure 6-8 shows the choices it offers.
Figure 6-8. Total row options
For quantity, it might be nice to know the total number of units ordered. We'll select Count for ISBN so we know how many orders we have. Figure 6-9 shows the results.
Figure 6-9. Total row results
This is somewhat useful, but odds are good that we want to be able to perform more sophisticated calculations on the information. Fortunately, we can access the information in the list from the rest of Excel. For starters, we might well want a column that provides the total cost of an order the quantity times the price. Because this is just Excel data, that's easily done. We'll add a "Total" header in cell H1, and then a formula, =D2*E2, in cell H2. If we copy that formula from H2 to cells H3-9, we get the results shown in Figure 6-10.
Figure 6-10. Total column results
Because we put this column right next to the XML data, Excel added this column to the list, and gave it the same sort and total capabilities of the rest of the list. Formulas can reference this data from other workbooks or from non-adjacent cells, though they won't be built into the list the same way.
While Excel provides no means of referring to data in this list by list name and column, you can safely reference the range and have Excel automatically adjust if a user reloads the XML document or modifies the information.
To show how to reference data, we'll create some formulas on Sheet2 that reference the range containing the XML data in Sheet1, as shown in Figure 6-11.
Figure 6-11. Calculations on the XML data
If we go back to Sheet1, and right-click on the XML list area, the XML sub-menu lets you Import... new data. When we import the data in ch0602.xml, a slightly extended version of the same information, Excel presents the data as shown in Figure 6-12.
Figure 6-12. Adding more XML data
It's the same data, with a few extra sales. If we now return to Sheet2, as shown in Figure 6-13, we can see that the sales figures and the formulas have updated smoothly.
Figure 6-13. Automatically updated calculations on the XML data
Excel only does this updating when new data is imported, not when changes are made to the original file, so you should set user expectations appropriately. This is a very simple example, admittedly, but you can build much more sophisticated spreadsheet applications on these same principles.
18.104.22.168 Opening documents as a read-only workbook
If you're extracting data from XML documents, you may find it useful to open them as read-only workbooks. The presentation of the information is very different, and there's no option for exporting the XML back out of the spreadsheet, but more explicit information about where the information came from is provided in the header rows. If we open ch0601.xml and select "Open as a read-only workbook," we see the result shown in Figure 6-14.
Figure 6-14. Sales data loaded as a read-only workbook
Information about where the data came from is provided in the XPath-like headers. The #agg information seems to be aggregated information, though in this case there's only one item per column.
The name "read-only workbook" is slightly misleading. You can make changes to the data, and you can save this file elsewhere. The "read-only" just means that you can't make changes to the original XML document using this approach; if you save the file, it's saved as an Excel workbook. It isn't nearly as flexible as the list approach, but it also lets you extract information from a wider variety of documents. Given its lack of flexibility and Microsoft's lack of documentation for the resulting format, this feature is probably best used only when you want to dump content from a document into Excel and don't mind doing a lot of organization yourself.
22.214.171.124 Using the XML source task pane
Opening an XML document using the XML Source task pane produces results that are much like the list created by opening the document as an XML list, but it allows you to have more control over what appears in the list and what doesn't. Many XML documents, for example, have header information followed by repeating sections. If opened directly as a list, Excel will produce a lot of columns that repeat the header information, when it really only appeared once. Using the XML Source task pane lets you choose what elements or attributes you want to appear in the Excel grid, and is especially useful when you only want to see or work with a subset of the information used in a document.
To show off the source task pane, we'll open ch0601.xml and select "Use the XML Source task pane." If, like ch0601.xml, the XML document doesn't contain a reference to an XML Schema, Excel displays the same warning that was shown in Figure 6-3, and then generates a schema based on what it finds in the document, producing the XML task pane contents shown in Figure 6-15.
Figure 6-15. Using the XML Source task pane to select XML document parts for display in Excel
To put information on to the spreadsheet, click on items in the task pane and drag them over to the grid. If you drag the date over to cell A3, you'll see the result shown in Figure 6-16.
Figure 6-16. Adding a component from the XML Source pane adds a column, but not the data
While we originally started out as if we were loading a document into Excel, Excel instead loaded the structure of the document rather than its contents. Using the XML Source task pane means building the structure you want in the spreadsheet from the parts in the XML document and then importing the XML document's content. If you drag more of them over and align them side by side, Excel will create a single large list, as shown in Figure 6-17 (If some parts of a document don't repeat, you can place them in cells that are not adjacent to the main body of the list.).
Figure 6-17. A list, created from the task pane
Populating that list takes an extra step. If the List toolbar is visible (and you can find it at View Toolbars if it isn't already visible), you can click on the Import XML Data button, as the task pane advises, find your XML document, and import it. If the toolbar isn't visible, right-clicking on the list will bring up a menu with an XML entry. Select Import . . . from that menu, choose your XML document, and Excel will import the data. Figure 6-18 shows what Excel produces if you import ch0601.xml into this list.
Figure 6-18. A filled-in list, with the List toolbar turned on
At this point, you can work with the list the same way you could when the list was loaded directly. One important feature of building the list this way that you don't get when documents are loaded directly is that you can also place non-repeating elements on the spreadsheet. Let's suppose the sales element also contained an element named store, identifying which store had these sales. Figure 6-19 shows the store element placed above the rest of the list, displaying the value of store once and only once.
Figure 6-19. A filled-in list, with a single element above the repeating portion of the list
6.2.3 Working with XML Maps
Opening XML documents directly is a great way to get started or to quickly analyze information, but in the long run you'll probably want to build spreadsheets that take a more structured approach. The XML Source pane lets you define XML Maps, describing the relationships between XML document structures and the lists that actually appear in your spreadsheet. These maps are built on XML Schemas, though they may either be schemas you specify or, as the previous examples showed, schemas that Excel derived by example from documents.
Most interactions with XML Maps take place through the task pane's XML Source view, or through schemas or documents which you use as a foundation for the map. Once you've created a map, there isn't much you can do through the Excel interface to change its basic structures, so getting your schema right in the first place is a critical step in creating spreadsheets that work with XML.
126.96.36.199 Excel and XML Schema
The XML Schema Recommendation provides a much more comprehensive set of tools for describing information than Excel actually needs. As noted earlier, Excel is good primarily for interacting with certain kinds of document structures, so some document-oriented features of XML Schema (like types that use mixed content) don't work with Excel. Similarly, Excel has had its own set of types for internal consumption for over a decade, and retrofitting Excel with the complete XML Schema datatype system probably would not be wise. Microsoft uses a combination of existing types to support the larger XML Schema system, as shown in Table 6-1.
These differences mean that you should not expect Excel to keep close track of the validation specified by the schema. Excel will behave as it has always behaved, with a set of rules for mapping between Excel and XSD. Formats that are represented as text will be imported or exported as they appear, while formats that have a more complex type may be formatted by Excel in the spreadsheet and according to XSD rules in the XML.
Excel also performs similar simplifications on content models. Excel is not designed as an über-XML-document processor, and it doesn't need the structural type information provided by XML Schema. From Excel's perspective, it needs to know what data goes together as a row and in what columns. Simpler structures are more manageable, and far less likely to break. While there may be times you need to work with XML that arrived with a complex schema, it may be easier in such cases to break the documents into smaller pieces and use simpler schemas if possible.
188.8.131.52 Creating an XML Map
Although some of the techniques described earlier in Section 6.2.2Section 6.2.2 create XML Maps, there are times when you'll want to incorporate data from XML documents in an existing spreadsheet, and those techniques don't work as well for that.
To create an XML Map in an existing spreadsheet, you need to bring up the task pane (View Task Pane, if it's not already there) and select XML Source from the drop-down menu at the top of the task pane. Unless you've done XML work with this spreadsheet before, you should have an empty task pane, like the one shown in Figure 6-20.
Figure 6-20. The XML Source task pane, before any sources are listed
To create an XML Map, click on the XML Maps... button. The XML Maps dialog box, shown in Figure 6-21, will appear, empty.
Figure 6-21. A fresh XML Maps dialog box
To add a map, click the Add . . . button. For an example, we'll use a document structure that both fits the Excel grid approach and tests out how it handles a variety of the XML Schema datatypes listed Table 6-1, using a contract management system as a guide.
These are some pretty simple contracts, which are just about payments on birthdays, enough information to get a sense of how Excel treats different datatypes. The schema for the contract description is listed in Example 6-2.
Example 6-2. A simple schema for contracts
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://simonstl.com/ns/example/contract" xmlns:contract="http://simonstl. com/ns/example/contract"> <xs:element name="contracts"> <xs:complexType> <xs:sequence> <xs:element maxOccurs="unbounded" ref="contract:contract"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="contract"> <xs:complexType> <xs:sequence> <xs:element ref="contract:recipient"/> <xs:element ref="contract:signing_date"/> <xs:element ref="contract:signing_time"/> <xs:element ref="contract:birthyear"/> <xs:element ref="contract:birthday"/> <xs:element ref="contract:male"/> <xs:element ref="contract:payment_amount"/> <xs:element ref="contract:years_to_pay"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="recipient" type="xs:string"/> <xs:element name="signing_date" type="xs:date"/> <xs:element name="signing_time" type="xs:time"/> <xs:element name="birthyear" type="xs:gYear"/> <xs:element name="birthday" type="xs:gMonthDay"/> <xs:element name="male" type="xs:boolean"/> <xs:element name="payment_amount" type="xs:decimal"/> <xs:element name="years_to_pay" type="xs:integer"/> </xs:schema>
We'll use this schema to create an XML map by clicking the Add... button and selecting this schema from the browse dialog that appears. When the Multiple Roots dialog box shown in Figure 6-22 appears, select "contracts" from the list and click OK.
Figure 6-22. Choosing the root element for the map
You'll be rewarded with the result shown in Figure 6-23, a new XML Map that is named contract_map, after the root element, which describes the namespace http://simonstl.com/ns/example/contract.
Figure 6-23. The XML Map, ready to go
If you click OK, you'll see XML components ready to be used in the XML Source task pane, as shown in Figure 6-24.
Figure 6-24. XML components, ready for use
The XML Source pane uses a number of icons to describe the structure of the XML document, much like those used to represent files and folders in the Windows Explorer. These icons are shown in Table 6-2.
If you drag the ns1:contract icon to cell A1, you'll get a list based on this map set up and ready for use, as shown in Figure 6-25.
Figure 6-25. A list based on the XML Map, ready for use
I tend to clean these up and remove the "ns1" prefixes, as you'll see in later examples. You can also dismiss the XML Source pane, and bring it back up only if you need its "Verify Map for Export . . . " option.
If you select a cell in Row 2, and select Format Cells, you can see how Excel has formatted the data automatically. For example, if you do this to the signing date, you'll see the result in Figure 6-26.
Figure 6-26. Cell formatting applied by Excel to dates
While Excel has used the schema to determine cell formatting, it is not currently using the datatypes in the schema for any kind of data validation. If you import an XML document (or type in data) that doesn't correspond to Excel's expectations, it will format it as text. To make Excel use the schema for validation which only happens on import and export in any event you need to right-click on the list, select the XML sub-menu, and then select XML Map Properties. The dialog box shown in Figure 6-27 will appear.
Figure 6-27. The XML Map Properties dialog box
The "Validate data against schema for import and export" is always turned off by default. While that may seem strange in contexts where you want validation to check user data, it also avoids some odd problems. It's possible that users will want to import documents that have problems so that they can repair them. It's also possible that users will be so frustrated by a document that they want to send it to someone else to sort out, without being told they can't save the file.
We'll use two test documents to explore how this works. The first one, shown in Example 6-3, is a deliberately invalid XML document, with all kinds of data that doesn't match the datatypes used by the schema. The second, shown in Example 6-4, is a document that is valid against the schema we've used.
Example 6-3. An invalid document for use in the map
<contracts xmlns="http://simonstl.com/ns/example/contract"> <!--This document is NOT VALID.--> <contract> <recipient>Jedidiah Smith</recipient> <signing_date>June 27, 1992</signing_date> <signing_time>4 PM</signing_time> <birthyear>62</birthyear> <birthday>23 November</birthday> <male>yes</male> <payment_amount>$27</payment_amount> <years_to_pay>two</years_to_pay> </contract> <contract> <recipient>Jane Zinger</recipient> <signing_date>April 22, 2001</signing_date> <signing_time>6:30 PM</signing_time> <birthyear>75</birthyear> <birthday>19 July</birthday> <male>no</male> <payment_amount>$42</payment_amount> <years_to_pay>four</years_to_pay> </contract> </contracts>
Example 6-4. A valid document for use in the map
<contracts xmlns="http://simonstl.com/ns/example/contract"> <!--This document is VALID.--> <contract> <recipient>Josiah Smith</recipient> <signing_date>1999-06-03</signing_date> <signing_time>09:03:22</signing_time> <birthyear>1962</birthyear> <birthday>--06-21</birthday> <male>true</male> <payment_amount>0004002.00200</payment_amount> <years_to_pay>26</years_to_pay> </contract> <contract> <recipient>Jane Zang</recipient> <signing_date>1999-04-03</signing_date> <signing_time>11:04:28</signing_time> <birthyear>1968</birthyear> <birthday>--04-23</birthday> <male>false</male> <payment_amount>000401.0200</payment_amount> <years_to_pay>2</years_to_pay> </contract> </contracts>
If a user attempts to import Example 6-3 into this map, they'll get the list of warnings shown in Figure 6-28.
Figure 6-28. "Some data was imported as text" errors on import
While this may dissuade some users, it doesn't sound like a big deal, and all those "Complete"s are pretty reassuring. The map also looks all right in Excel, if you aren't cued in to the formatting. Figure 6-29 shows the import results.
Figure 6-29. Bad results that look like they might be okay in Excel
If you export this map, as shown in the next section, Excel goes right ahead with it. If you use the "Verify Map for Export" link on the XML Source task pane, Excel notifies you that "contract_map is exportable." The results of the export, shown in Example 6-5, make it clear that Excel has imported and exported the document, as it's added the ns1 prefix everywhere.
Example 6-5. The exported version of the broken document
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <ns1:contracts xmlns:ns1="http://simonstl.com/ns/example/contract"> <ns1:contract> <ns1:recipient>Jedidiah Smith</ns1:recipient> <ns1:signing_date>June 27, 1992</ns1:signing_date> <ns1:signing_time>4 PM</ns1:signing_time> <ns1:birthyear>62</ns1:birthyear> <ns1:birthday>23 November</ns1:birthday> <ns1:male>yes</ns1:male> <ns1:payment_amount>$27</ns1:payment_amount> <ns1:years_to_pay>two</ns1:years_to_pay> </ns1:contract> <ns1:contract> <ns1:recipient>Jane Zinger</ns1:recipient> <ns1:signing_date>April 22, 2001</ns1:signing_date> <ns1:signing_time>6:30 PM</ns1:signing_time> <ns1:birthyear>75</ns1:birthyear> <ns1:birthday>19 July</ns1:birthday> <ns1:male>no</ns1:male> <ns1:payment_amount>$42</ns1:payment_amount> <ns1:years_to_pay>four</ns1:years_to_pay> </ns1:contract> </ns1:contracts>
If, however, we turn on "Validate data against schema for import and export," we'll get an extra error message, shown in Figure 6-30.
Figure 6-30. An error message produced by failed validation
Close that window, however, and you have the same imported result shown in Figure 6-29. Excel is not very interested in blocking bad data. If you select the schema validation error and click Details . . . you'll see the information presented in Figure 6-31.
Figure 6-31. A report on schema validation failure
This is somewhat more meaningful, but:
Hopefully, future versions of Excel will provide better support for validation on import that is more helpful to users and more useful for developers. Similarly, exporting this document produces the same result already shown in Example 6-5, but produces a warning message, shown in Figure 6-32.
Figure 6-32. The error message from exporting an invalid XML document
Hopefully users will see this and at least know there's a problem, and perhaps being able to export XML will make it easier for them to pass it to someone else who can clean it up before schema-dependent processing takes place.
If, on the other, we import Example 6-4, the valid document, we get no error messages and a spreadsheet whose formatting conforms to the expectations described in Table 6-1. Figure 6-33 shows this valid document after import into our list area.
Figure 6-33. A valid XML document, imported
This document also exports perfectly well. Given good examples, users should be able to produce good results. If users get used to seeing error messages with no obvious ill effects, though, it may cause trouble down the road.