7.2 Reading XML Spreadsheets

The SpreadsheetML vocabulary is generally much smaller than the WordML vocabulary, and more approachable. While it also comes with lots of metadata, the structured nature of spreadsheets is easily captured with relatively concise XML. We'll start with a very simple test spreadsheet, adding two numbers, as shown in Figure 7-5.

Figure 7-5. A simple spreadsheet for an initial test

This spreadsheet adds 2 and 2, using the SUM function in cell A3 to add the values of cells A1 and A2. If we save the spreadsheet shown in Figure 7-5 as an XML Spreadsheet, Excel generates the XML file shown in Example 7-1.

Example 7-1. A simple Excel spreadsheet saved as XML
<?xml version="1.0"?> <?mso-application prog?> <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"  xmlns:o="urn:schemas-microsoft-com:office:office"  xmlns:x="urn:schemas-microsoft-com:office:excel"  xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"  xmlns:html="http://www.w3.org/TR/REC-html40">  <DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">   <Author>Simon St.Laurent</Author>   <LastAuthor>Simon St.Laurent</LastAuthor>   <Created>2003-03-19T20:21:31Z</Created>   <LastSaved>2003-03-19T20:23:08Z</LastSaved>   <Company>simonstl.com</Company>   <Version>11.4920</Version>  </DocumentProperties>  <OfficeDocumentSettings xmlns="urn:schemas-microsoft-com:office:office">   <DownloadComponents/>   <LocationOfComponents HRef="file:///C:\MSOCache\All%20Users\20000409-6000-11D3      8CFE-0150048383C9\"/>  </OfficeDocumentSettings>  <ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">   <WindowHeight>8955</WindowHeight>   <WindowWidth>11355</WindowWidth>   <WindowTopX>360</WindowTopX>   <WindowTopY>120</WindowTopY>   <ProtectStructure>False</ProtectStructure>   <ProtectWindows>False</ProtectWindows>  </ExcelWorkbook>  <Styles>   <Style ss: ss:Name="Normal">    <Alignment ss:Vertical="Bottom"/>    <Borders/>    <Font/>    <Interior/>    <NumberFormat/>    <Protection/>   </Style>  </Styles>  <Worksheet ss:Name="Sheet1">   <Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="3" x:FullColumns="1"    x:FullRows="1">    <Row>     <Cell><Data ss:Type="Number">2</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">2</Data></Cell>    </Row>    <Row>     <Cell ss:Formula="=SUM(R[-2]C, R[-1]C)"><Data ss:Type="Number">4</ Data></Cell>    </Row>   </Table>   <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">    <Print>     <ValidPrinterInfo/>     <HorizontalResolution>600</HorizontalResolution>     <VerticalResolution>600</VerticalResolution>    </Print>    <Selected/>    <Panes>     <Pane>      <Number>3</Number>      <ActiveRow>1</ActiveRow>      <ActiveCol>1</ActiveCol>     </Pane>    </Panes>    <ProtectObjects>False</ProtectObjects>    <ProtectScenarios>False</ProtectScenarios>   </WorksheetOptions>  </Worksheet>  <Worksheet ss:Name="Sheet2">   <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">    <ProtectObjects>False</ProtectObjects>    <ProtectScenarios>False</ProtectScenarios>   </WorksheetOptions>  </Worksheet>  <Worksheet ss:Name="Sheet3">   <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">    <ProtectObjects>False</ProtectObjects>    <ProtectScenarios>False</ProtectScenarios>   </WorksheetOptions>  </Worksheet> </Workbook>

The spreadsheet begins with an XML declaration and a processing instruction identifying this document as an Excel.sheet:

<?xml version="1.0"?> <?mso-application prog?>

After those formalities, the Workbook element appears. The Workbook element is the root element for all SpreadsheetML files, and contains most of the namespace declarations that will be used in the rest of the document:

<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"  xmlns:o="urn:schemas-microsoft-com:office:office"  xmlns:x="urn:schemas-microsoft-com:office:excel"  xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"  xmlns:html="http://www.w3.org/TR/REC-html40">

Unlike Word, which prefixed all of its element and attribute names with w, Excel uses no namespace prefix by default, using xmlns="urn:schemas-microsoft-com:office:spreadsheet" to declare its default namespace, the namespace you'll undoubtedly find most important if you need to get to the data contained in the spreadsheet grid.

Because unprefixed attributes don't have a namespace, Excel also uses the declaration xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" to associate the ss (for spreadsheet) prefix with the same URI. The elements in the document will be unprefixed, while their attributes will be prefixed with ss, but all of these components will have precisely the same namespace URI, "urn:schemas-microsoft-com:office:spreadsheet".

As we'll see, the declaration for the o prefix doesn't actually get used in this document. The x prefix is used for a few attributes later, and the html prefix is used if there is HTML in the spreadsheet somewhere.

The first child element, DocumentProperties, contains the metadata about the document. While all of these elements use no namespace prefix, the DocumentProperties element redefines the default namespace with its own xmlns attribute. Unprefixed elements in this space have the same namespace URI as the o prefix elsewhere.

<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">   <Author>Simon St.Laurent</Author>   <LastAuthor>Simon St.Laurent</LastAuthor>   <Created>2003-03-19T20:21:31Z</Created>   <LastSaved>2003-03-19T20:23:08Z</LastSaved>   <Company>simonstl.com</Company>   <Version>11.4920</Version>  </DocumentProperties>

Most of this information is pretty straightforward. Perhaps the most interesting aspect is that the markup is extremely similar to its counterpart in Word, except for whitespace and the meaningless namespace prefix. Excel stores less information than Word, as Example 7-2 demonstrates, but content managers can rely on these pieces to collect metadata from both Word and Excel files without concern for the surrounding context.

Example 7-2. WordML document properties (whitespace added for readability)
<o:DocumentProperties>   <o:Title>Hello World</o:Title>   <o:Author>Simon St.Laurent</o:Author>   <o:LastAuthor>Simon St.Laurent</o:LastAuthor>    <o:Revision>2</o:Revision>   <o:TotalTime>0</o:TotalTime>   <o:Created>2003-03-14T00:21:00Z</o:Created>   <o:LastSaved>2003-03-14T00:21:00Z</o:LastSaved>   <o:Pages>1</o:Pages>   <o:Words>1</o:Words>   <o:Characters>12</o:Characters>   <o:Company>O'Reilly &amp; Associates</o:Company>   <o:Lines>1</o:Lines>   <o:Paragraphs>1</o:Paragraphs>   <o:CharactersWithSpaces>12</o:CharactersWithSpaces>   <o:Version>11.4920</o:Version> </o:DocumentProperties>

Getting back to the Excel markup, the next piece is pretty application-specific and probably not very useful to other applications. Like the DocumentProperties element, it declares its own default namespace rather than using the o prefix defined at the start of the document.

<OfficeDocumentSettings xmlns="urn:schemas-microsoft-com:office:office">   <DownloadComponents/>   <LocationOfComponents HRef="file:///C:\MSOCache\All%20Users\20000409-6000-11D3      8CFE-0150048383C9\"/>  </OfficeDocumentSettings>

Next we have the ExcelWorkbook element, with information about the window settings and protected status of the workbook:

<ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">   <WindowHeight>8955</WindowHeight>   <WindowWidth>11355</WindowWidth>   <WindowTopX>360</WindowTopX>   <WindowTopY>120</WindowTopY>   <ProtectStructure>False</ProtectStructure>   <ProtectWindows>False</ProtectWindows>  </ExcelWorkbook>

Again, this element could have used the x prefix defined in the root element, but opts to redeclare the default namespace. After this information about the presentation of the spreadsheet generally, we have information about the styles used in the document, stored in the Styles element:

<Styles>   <Style ss: ss:Name="Normal">    <Alignment ss:Vertical="Bottom"/>    <Borders/>    <Font/>    <Interior/>    <NumberFormat/>    <Protection/>   </Style>  </Styles>

Because this spreadsheet is very very simple, there are just a few defaults here. All of the cells in this stylesheet use the Normal style, with no special formatting. Nevertheless, this empty set of elements gives you some idea of what you can do here.

After these preparations, we reach the Worksheet elements. Each of these elements represents one complete worksheet in Excel. Since Excel created three worksheets by default, there are three Worksheet elements here. Spreadsheets with more or fewer worksheets will have as many Worksheet elements as appropriate. The first of the three Worksheet elements is the one containing our data:

<Worksheet ss:Name="Sheet1">   <Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="3" x:FullColumns="1"        x:FullRows="1">    <Row>     <Cell><Data ss:Type="Number">2</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">2</Data></Cell>    </Row>    <Row>     <Cell ss:Formula="=SUM(R[-2]C, R[-1]C)"><Data ss:Type="Number">4</ Data></Cell>    </Row>   </Table>   <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">    <Print>     <ValidPrinterInfo/>     <HorizontalResolution>600</HorizontalResolution>     <VerticalResolution>600</VerticalResolution>    </Print>    <Selected/>    <Panes>     <Pane>      <Number>3</Number>      <ActiveRow>1</ActiveRow>      <ActiveCol>1</ActiveCol>     </Pane>    </Panes>    <ProtectObjects>False</ProtectObjects>    <ProtectScenarios>False</ProtectScenarios>   </WorksheetOptions>  </Worksheet>

The guts of the worksheet are stored in the Table element, while other information about the worksheet is stored in the WorksheetOptions element. For the most part, if you're trying to extract the contents of spreadsheets or create new spreadsheets from existing information, the Table element will be at the heart of your work. The Table element defines the space it contains:

<Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="3" x:FullColumns="1"       x:FullRows="1">

The ss:ExpandedColumnCount indicates that this spreadsheet has one column, while ss:ExpandedRowCount indicates that this spreadsheet has three rows. Knowing the number of rows and columns gives Excel a chance to prepare for the incoming data. The x:FullColumns and x:FullRows attributes appear to do nothing.

In current versions of SpreadsheetML, multiple Table elements are permitted, but Excel only uses the first of them. According to Microsoft's "Overview of SpreadsheetML," which comes with the Microsoft Office XML Schemas mentioned at the start of this chapter, this will let future versions of Excel "support multiple overlapping ranges by having multiple Table elements."

The contents of the Table element represent the stylesheet as a set of Row elements which themselves contain Cell elements:

   <Row>     <Cell><Data ss:Type="Number">2</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">2</Data></Cell>    </Row>    <Row>     <Cell ss:Formula="=SUM(R[-2]C, R[-1]C)"><Data ss:Type="Number">4</ Data></Cell>    </Row>

The first two of these Row elements are identical, containing a Cell element whose Data element contains the value 2. The ss:Type attribute identifies this information as a Number a notable departure from the W3C XML Schema data typing used elsewhere in the Office applications, but consistent with the mapping previously described in Table 6-1. The third row contains a calculated result, the 4 inside of the Data element, as well as the type information and the formula by which that result was calculated. The inclusion of calculated values may make some kinds of import from Excel much easier, and you can always check for the presence of the ss:Formula attribute if you want to exclude calculated values from your processing.

Looking more closely at the ss:Formula attribute, it's fairly clear that using these formulas in other contexts will require reconstructing the table:

ss:Formula="=SUM(R[-2]C, R[-1]C)"

The formula reflects Excel's internal expectations for working with the information, most notably the expectation that the entire table will be available for navigation using relative references between cells. The Row and Cell elements reflect this same structure, so programs built around this XML have a good chance of interpreting these formulas, but decoding them will take some custom logic (XSLT 1.0 won't easily build and navigate this grid) and an object model for storing all the rows and cells at any given time. Depending on the type of information you need from the spreadsheet, this may not matter. If you're importing it into another spreadsheet-like structure, you may have a lot of work to do. If you just want the data, ignoring the formulas shouldn't be a problem.

The WorksheetOptions element contains other information about the worksheet's presentation and operation:

  <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">    <Print>     <ValidPrinterInfo/>     <HorizontalResolution>600</HorizontalResolution>     <VerticalResolution>600</VerticalResolution>    </Print>    <Selected/>    <Panes>     <Pane>      <Number>3</Number>      <ActiveRow>1</ActiveRow>      <ActiveCol>1</ActiveCol>     </Pane>    </Panes>    <ProtectObjects>False</ProtectObjects>    <ProtectScenarios>False</ProtectScenarios>   </WorksheetOptions>

Again, the WorksheetOptions element redefines the default namespace, assigning the same value to no prefix that was assigned to the x prefix at the start of the document. All of this information is considered specific to Excel, not to the spreadsheet generally. While the data here can be useful if you're creating spreadsheet applications, it's not information you'll use for the spreadsheet data itself.

The next two worksheets are empty, so they are represented by relatively minimal placeholders, followed by the closing tag of Workbook:

<Worksheet ss:Name="Sheet2">   <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">    <ProtectObjects>False</ProtectObjects>    <ProtectScenarios>False</ProtectScenarios>   </WorksheetOptions>  </Worksheet>  <Worksheet ss:Name="Sheet3">   <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">    <ProtectObjects>False</ProtectObjects>    <ProtectScenarios>False</ProtectScenarios>   </WorksheetOptions>  </Worksheet> </Workbook>

If all you're concerned with is extracting the data from the spreadsheet, you now have a solid set of basic parts: the Workbook, Worksheet, Row, Cell, and Data elements. For getting information into and out of Excel, that core provides most of the substance you'll need.

7.2.1 Working with More Complex Spreadsheets

While the 2+2=4 example does a nice job of showing the basic structure Excel uses to store spreadsheets in XML, the odds are excellent that you'll need to work with more complicated spreadsheets and formulas. Excel also offers a few structures notably named cells and ranges that can make it much easier to work with Excel data, reducing the otherwise constant need to keep track of how an XML cell corresponds to a particular location on the spreadsheet grid.

We'll start with the spreadsheet shown in Figure 7-6, a list of items sold, with IDs, descriptions, prices, named ranges for all of those, and a calculated total for each transaction.

Figure 7-6. A spreadsheet with more data and named ranges

The "Critters" named range includes the contents of the Critter column, and so on. When this spreadsheet is saved as an XML document, the Worksheet element representing Sheet1 looks like Example 7-3.

Example 7-3. The Worksheet portion of the XML representation of Figure 7-6
 <Worksheet ss:Name="Sheet1">   <Table ss:ExpandedColumnCount="5" ss:ExpandedRowCount="9" x:FullColumns="1"        x:FullRows="1">    <Column ss:AutoFitWidth="0" ss:Width="73.5"/>    <Column ss:AutoFitWidth="0" ss:Width="96.75"/>    <Column ss:Index="5" ss:AutoFitWidth="0" ss:Width="56.25"/>    <Row>     <Cell><Data ss:Type="String">ID Number</Data></Cell>     <Cell><Data ss:Type="String">Critter</Data></Cell>     <Cell><Data ss:Type="String">Price</Data></Cell>     <Cell><Data ss:Type="String">Quantity</Data></Cell>     <Cell><Data ss:Type="String">Total</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">4627</Data><NamedCell ss:Name="ID"/ ></Cell>     <Cell><Data ss:Type="String">Diplodocus</Data><NamedCell ss:Name="Critters"/>        </Cell>     <Cell ss:Style><Data ss:Type="Number">22.5</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">127</Data><NamedCell ss:Name= "Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data  ss:Type="Number">2857.5</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">3912</Data><NamedCell ss:Name="ID"/ ></Cell>     <Cell><Data ss:Type="String">Brontosaurus</Data><NamedCell ss         Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">17.5</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">74</Data><NamedCell ss:Name= "Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number"         1295</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">9845</Data><NamedCell ss:Name="ID"/ ></Cell>     <Cell><Data ss:Type="String">Triceratops</Data>         <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">12</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">91</Data><NamedCell ss:Name= "Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">         1092</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">9625</Data><NamedCell ss:Name="ID"/ ></Cell>     <Cell><Data ss:Type="String">Vulcanodon</Data><NamedCell ss:Name= "Critters"/>        </Cell>     <Cell ss:Style><Data ss:Type="Number">19</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">108</Data><NamedCell ss:Name= "Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">         2052</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">5903</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Stegosaurus</Data>        <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">18.5</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">63</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">         1165.5</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">1824</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Monoclonius</Data>         <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">16.5</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">133</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">         2194.5</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">9728</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Megalosaurus</Data>        <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">23</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">128</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">         2944</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">8649</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Barosaurus</Data><NamedCell ss:Name="Critters"/>        </Cell>     <Cell ss:Style><Data ss:Type="Number">17</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">91</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">         1547</Data></Cell>    </Row>   </Table>   <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">    <Print>     <ValidPrinterInfo/>     <HorizontalResolution>600</HorizontalResolution>     <VerticalResolution>600</VerticalResolution>    </Print>    <Selected/>    <Panes>     <Pane>      <Number>3</Number>      <ActiveRow>8</ActiveRow>      <ActiveCol>4</ActiveCol>     </Pane>    </Panes>    <ProtectObjects>False</ProtectObjects>    <ProtectScenarios>False</ProtectScenarios>   </WorksheetOptions>  </Worksheet>

It has the same pattern of Row elements containing Cell elements (the Column information is strictly for formatting), and the same surrounding metadata, but it also now contains additional information in many of its Cell elements:

   <Row>     <Cell><Data ss:Type="Number">4627</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Diplodocus</Data>        <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">22.5</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">127</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]">        <Data ss:Type="Number">2857.5</Data></Cell>    </Row>

With the addition of the NamedCell element and its ss:Name attribute, we now have a way to select cells from the row by name in addition to position. The XML spreadsheet also contains a summary of the named ranges in a Names element that precedes the Worksheet elements:

 <Names>   <NamedRange ss:Name="Critters" ss:RefersTo="=Sheet1!R2C2:R9C2"/>   <NamedRange ss:Name="ID" ss:RefersTo="=Sheet1!R2C1:R9C1"/>   <NamedRange ss:Name="Price" ss:RefersTo="=Sheet1!R2C3:R9C3"/>   <NamedRange ss:Name="Quantity" ss:RefersTo="=Sheet1!R2C4:R9C4"/>  </Names>

While the Names element is useful to Excel in loading a document, you may not find processing it (or even creating it) with other applications, notably XSLT and XPath, to be much fun once again, you need to have the grid available to figure out (or assign) the references. Fortunately, Excel can recreate named ranges from just the NamedCell information, so you don't need to worry about this extra step unless you want to.

Office 2003 XML
Office 2003 XML
ISBN: 0596005385
EAN: 2147483647
Year: 2003
Pages: 135

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net