7.3 Extracting Information from XML Spreadsheets


When the spreadsheet data arrives in a form like Example 7-3, it's easy to extract the data using tools like XSLT. All the cells in the area used contain data, and it's just a simple table. If, for example, we wanted to extract the data in this spreadsheet and produce a much lighter XML document containing just the data, the stylesheet might look like that shown in Example 7-4.

Example 7-4. A simple stylesheet for extracting data from Excel tables
<xsl:stylesheet version="1.0"    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   xmlns="http://simonstl.com/ns/dinosaurs/"   xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"  > <xsl:output method="xml" omit-xml-declaration="yes" indent="yes" encoding="US-    ASCII"/> <xsl:template match="/">   <xsl:apply-templates select="ss:Workbook"/>  </xsl:template> <xsl:template match="ss:Workbook">   <dinosaurs>        <xsl:apply-templates select="ss:Worksheet[@ss:Name = 'Sheet1']"/>    </dinosaurs> </xsl:template> <xsl:template match="ss:Worksheet">    <xsl:apply-templates select="ss:Table" /> </xsl:template> <xsl:template match="ss:Table">    <xsl:apply-templates select="ss:Row[position( ) &gt; 1]" /> </xsl:template> <xsl:template match="ss:Row"> <sale>    <IDnum><xsl:apply-templates select="ss:Cell[1]" /></IDnum>    <critter><xsl:apply-templates select="ss:Cell[2]" /></critter>    <price><xsl:apply-templates select="ss:Cell[3]" /></price>    <quantity><xsl:apply-templates select="ss:Cell[4]" /></quantity>    <total><xsl:apply-templates select="ss:Cell[5]" /></total> </sale> </xsl:template> </xsl:stylesheet>

Note the namespace declarations in the root xsl:stylesheet element. If you forget any of these, your stylesheet won't behave as expected, even though everything else looks right.

See Appendix B for more information if you're unfamiliar with XSLT and XSLT processing tools.


Most of the work here is done by the last template, which just matches the rows in Sheet1. The prior templates guide the stylesheet past all the Excel metadata, into Sheet1, and make sure that it skips the first Row element, which contains the column titles. The last template puts the contents of the first Cell element into an element named IDnum, the second Cell element into an element named critter, and so on. The results of running this stylesheet against the XML document in Example 7-2 are shown in Example 7-5.

Example 7-5. Simple XML produced by using XSLT on SpreadsheetML
<dinosaurs xmlns="http://simonstl.com/ns/dinosaurs/" xmlns:ss="urn:schemas-microsoft-com: office:spreadsheet"> <sale> <IDnum>4627</IDnum> <critter>Diplodocus</critter> <price>22.5</price> <quantity>127</quantity> <total>2857.5</total> </sale> <sale> <IDnum>3912</IDnum> <critter>Brontosaurus</critter> <price>17.5</price> <quantity>74</quantity> <total>1295</total> </sale> <sale> <IDnum>9845</IDnum> <critter>Triceratops</critter> <price>12</price> <quantity>91</quantity> <total>1092</total> </sale> <sale> <IDnum>9625</IDnum> <critter>Vulcanodon</critter> <price>19</price> <quantity>108</quantity> <total>2052</total> </sale> <sale> <IDnum>5903</IDnum> <critter>Stegosaurus</critter> <price>18.5</price> <quantity>63</quantity> <total>1165.5</total> </sale> <sale> <IDnum>1824</IDnum> <critter>Monoclonius</critter> <price>16.5</price> <quantity>133</quantity> <total>2194.5</total> </sale> <sale> <IDnum>9728</IDnum> <critter>Megalosaurus</critter> <price>23</price> <quantity>128</quantity> <total>2944</total> </sale> <sale> <IDnum>8649</IDnum> <critter>Barosaurus</critter> <price>17</price> <quantity>91</quantity> <total>1547</total> </sale> </dinosaurs>

This kind of extraction is easy, but it's fairly unusual that real-world spreadsheets will be this convenient. It's not impossible, of course I get a spreadsheet whose first sheet is structured like this once a week but there are many tougher cases. Lots of spreadsheets skip rows and cells, have areas that are used for different kinds of content, and present additional challenges to developers who need to extract information from them. Fortunately, while every spreadsheet is different, there are a few basic patterns that can help you reach into them. Figure 7-7 shows a spreadsheet with much the same data as that in Figure 7-6, but with a few complicating factors.

Figure 7-7. A spreadsheet with gaps and individual data components
figs/oxml_0707.gif


The first row contains a date identifying when the data is from, the second row is blank, rows three to eleven contain the same data shown in Figure 7-6, and row twelve shows a total. Examining the Table element in the SpreadsheetML, listed in Example 7-6, shows how Excel treats these skipped rows and columns.

Example 7-6. More complex XML produced from the spreadsheet in Figure 7-7
<Table ss:ExpandedColumnCount="5" ss:ExpandedRowCount="12" x:FullColumns="1"    x:FullRows="1">    <Column ss:AutoFitWidth="0" ss:Width="73.5"/>    <Column ss:AutoFitWidth="0" ss:Width="96.75"/>    <Column ss:Index="5" ss:AutoFitWidth="0" ss:Width="56.25"/>    <Row>     <Cell><Data ss:Type="String">Sales for:</Data></Cell>     <Cell ss:Style><Data ss:Type="DateTime">2004-01-01T00:00:00.000        </Data><NamedCell       ss:Name="Date"/></Cell>    </Row>    <Row ss:Index="3">     <Cell><Data ss:Type="String">ID Number</Data></Cell>     <Cell><Data ss:Type="String">Critter</Data></Cell>     <Cell><Data ss:Type="String">Price</Data></Cell>     <Cell><Data ss:Type="String">Quantity</Data></Cell>     <Cell><Data ss:Type="String">Total</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">4627</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Diplodocus</Data><NamedCell ss:Name="Critters"/>        </Cell>     <Cell ss:Style><Data ss:Type="Number">22.5</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">127</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">        2857.5</Data><NamedCell       ss:Name="Total"/></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">3912</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Brontosaurus</Data>        <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">17.5</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">74</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">        1295</Data><NamedCell       ss:Name="Total"/></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">9845</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Triceratops</Data>        <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">12</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">91</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">        1092</Data><NamedCell       ss:Name="Total"/></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">9625</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Vulcanodon</Data>        <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">19</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">108</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">        2052</Data><NamedCell       ss:Name="Total"/></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">5903</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Stegosaurus</Data>        <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">18.5</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">63</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">        1165.5</Data><NamedCell       ss:Name="Total"/></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">1824</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Monoclonius</Data>        <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">16.5</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">133</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">        2194.5</Data><NamedCell       ss:Name="Total"/></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">9728</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Megalosaurus</Data>        <NamedCell ss:Name="Critters"/></Cell>     <Cell ss:Style><Data ss:Type="Number">23</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">128</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">         2944</Data><NamedCell       ss:Name="Total"/></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">8649</Data><NamedCell ss:Name="ID"/></Cell>     <Cell><Data ss:Type="String">Barosaurus</Data><NamedCell ss:Name="Critters"/>        </Cell>     <Cell ss:Style><Data ss:Type="Number">17</Data><NamedCell       ss:Name="Price"/></Cell>     <Cell><Data ss:Type="Number">91</Data><NamedCell ss:Name="Quantity"/></Cell>     <Cell ss:Style ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">        1547</Data><NamedCell       ss:Name="Total"/></Cell>    </Row>    <Row>     <Cell ss:Index="4"><Data ss:Type="String">Total:</Data></Cell>     <Cell ss:Style ss:Formula="=SUM(R[-8]C:R[-1]C)">        <Data ss:Type="Number">15147.5</Data><NamedCell       ss:Name="GrandTotal"/></Cell>    </Row>   </Table>

Excel doesn't report blank rows or cells. Instead, the first Row or Cell element after the blanks has an ss:Index attribute identifying its position. This means that stylesheets and other processors can't just count their way through the grid they have to keep track of where the SpreadsheetML says things go.

Converting this spreadsheet to XML like that shown in Example 7-4 will be somewhat more difficult. There are two approaches that can be applied to this. The first approach, the stylesheet in Example 7-7, modifies the stylesheet shown in Example 7-4, and the changes are highlighted.

Example 7-7. A modified stylesheet for dealing with the new spreadsheet
<xsl:stylesheet version="1.0"    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   xmlns="http://simonstl.com/ns/dinosaurs/"   xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"  > <xsl:output method="xml" omit-xml-declaration="yes" indent="yes" encoding="US-    ASCII"/> <xsl:template match="/">   <xsl:apply-templates select="ss:Workbook"/>  </xsl:template> <xsl:template match="ss:Workbook">   <dinosaurs>        <xsl:apply-templates select="ss:Worksheet[@ss:Name = 'Sheet1']"/>    </dinosaurs> </xsl:template> <xsl:template match="ss:Worksheet">    <date><xsl:value-of select="ss:Table/ss:Row/ss:Cell[@ss:StyleID = 's21']" />      </date>    <xsl:apply-templates select="ss:Table" /> </xsl:template> <xsl:template match="ss:Table">    <xsl:apply-templates select="ss:Row[position( ) &gt; 2]" /> <!--Note that because Excel skips the blank row, the third row is in position 2--> </xsl:template> <xsl:template match="ss:Row[ss:Cell[4]]"> <sale>    <IDnum><xsl:apply-templates select="ss:Cell[1]" /></IDnum>    <critter><xsl:apply-templates select="ss:Cell[2]" /></critter>    <price><xsl:apply-templates select="ss:Cell[3]" /></price>    <quantity><xsl:apply-templates select="ss:Cell[4]" /></quantity>    <total><xsl:apply-templates select="ss:Cell[5]" /></total> </sale> </xsl:template> <xsl:template match="ss:Row"> <total><xsl:apply-templates select="ss:Cell[2]" /></total> </xsl:template> </xsl:stylesheet>

Running this stylesheet against the SpreadsheetML produces XML much like that shown in Example 7-6, shown here in Example 7-8.

Example 7-8. XML produced by using XSLT on more complex SpreadsheetML
<dinosaurs xmlns="http://simonstl.com/ns/dinosaurs/" xmlns:ss="urn:schemas-microsoft-com: office:spreadsheet"> <date>2004-01-01T00:00:00.000</date> <sale> <IDnum>4627</IDnum> <critter>Diplodocus</critter> <price>22.5</price> <quantity>127</quantity> <total>2857.5</total> </sale> <sale> <IDnum>3912</IDnum> <critter>Brontosaurus</critter> <price>17.5</price> <quantity>74</quantity> <total>1295</total> </sale> <sale> <IDnum>9845</IDnum> <critter>Triceratops</critter> <price>12</price> <quantity>91</quantity> <total>1092</total> </sale> <sale> <IDnum>9625</IDnum> <critter>Vulcanodon</critter> <price>19</price> <quantity>108</quantity> <total>2052</total> </sale> <sale> <IDnum>5903</IDnum> <critter>Stegosaurus</critter> <price>18.5</price> <quantity>63</quantity> <total>1165.5</total> </sale> <sale> <IDnum>1824</IDnum> <critter>Monoclonius</critter> <price>16.5</price> <quantity>133</quantity> <total>2194.5</total> </sale> <sale> <IDnum>9728</IDnum> <critter>Megalosaurus</critter> <price>23</price> <quantity>128</quantity> <total>2944</total> </sale> <sale> <IDnum>8649</IDnum> <critter>Barosaurus</critter> <price>17</price> <quantity>91</quantity> <total>1547</total> </sale> <total>15147.5</total> </dinosaurs>

A smarter approach uses the NamedCell element's ss:Name attribute, producing a similar result without relying on changeable details like row and cell positions. The stylesheet in Example 7-9 uses XSLT predicates to test for these attributes, yielding a stylesheet whose functionality is easier to discern. Places where this stylesheet references named ranges and cells are highlighted in bold.

Example 7-9. A SpreadsheetML transform that relies on named range information
<xsl:stylesheet version="1.0"    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   xmlns="http://simonstl.com/ns/dinosaurs/"   xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"  > <xsl:output method="xml" omit-xml-declaration="yes" indent="yes" encoding="US-    ASCII"/> <xsl:template match="ss:Workbook">   <dinosaurs>        <xsl:apply-templates select="ss:Worksheet[@ss:Name = 'Sheet1']"/>    </dinosaurs> </xsl:template> <xsl:template match="ss:Worksheet">    <date><xsl:value-of select="ss:Table/ss:Row/ss:Cell[ss:NamedCell/@ss:Name =      'Date']" /></date>    <xsl:apply-templates select="ss:Table" /> <total><xsl:value-of select="ss:Table/ss:Row/ss:Cell[ss:NamedCell/@ss:Name = 'GrandTotal']" /></total> </xsl:template> <xsl:template match="ss:Table">    <xsl:apply-templates select="ss:Row[position( ) &gt; 2]" /> </xsl:template> <!--Only create sale elements for Rows which start with an ID--> <xsl:template match="ss:Row[ss:Cell[1]/ss:NamedCell/@ss:Name='ID']"> <sale>    <IDnum><xsl:apply-templates select="ss:Cell[ss:NamedCell/@ss:Name='ID']" />   </IDnum>    <critter><xsl:apply-templates select="ss:Cell[ss:NamedCell/@ss:Name='Critters']" /></critter>    <price><xsl:apply-templates select="ss:Cell[ss:NamedCell/@ss:Name='Price']" />      </price>    <quantity><xsl:apply-templates select="ss:Cell[ss:NamedCell/@ss:Name='Quantity']" /></quantity>    <total><xsl:apply-templates select="ss:Cell[ss:NamedCell/@ss:Name='Total']" />      </total> </sale> </xsl:template> <xsl:template match="ss:Row" /> </xsl:stylesheet>

This stylesheet will produce exactly the same output as the stylesheet in Example 7-7, which will look like the result in Example 7-8.



Office 2003 XML
Office 2003 XML
ISBN: 0596005385
EAN: 2147483647
Year: 2003
Pages: 135

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net