Recipe12.7.Parsing Microsoft Excel s XML

Recipe 12.7. Parsing Microsoft Excel's XML

Credit: Thomas Guettler

Problem

You have Microsoft Excel spreadsheets saved in XML form, and want to parse them into memory as Python nested lists.

Solution

The XML form of Excel spreadsheets is quite simple: all text is in Cell tags, which are nested in Row tags nested in Table tags. SAX makes it quite simple to parse this kind of XML into memory:

import sys from xml.sax import saxutils, parse class ExcelHandler(saxutils.DefaultHandler):     def _ _init_ _(self):         self.chars = [  ]         self.cells = [  ]         self.rows = [  ]         self.tables = [  ]     def characters(self, content):         self.chars.append(content)     def startElement(self, name, atts):         if name=="Cell":             self.chars = [  ]         elif name=="Row":             self.cells=[  ]         elif name=="Table":             self.rows = [  ]     def endElement(self, name):         if name=="Cell":             self.cells.append(''.join(self.chars))         elif name=="Row":             self.rows.append(self.cells)         elif name=="Table":             self.tables.append(self.rows) if _ _name_ _ == '_ _main_ _':     excelHandler = ExcelHandler( )     parse(sys.argv[1], excelHandler)     print excelHandler.tables

Discussion

The structure of the parser presented in this recipe is pleasingly simple: at each of three logical nesting levels of data, we collect content into a list. Each time a tag of a given level begins, we start with an empty list for it; each time the tag ends, we append the tag's contents to the list of the next upper level. The net result is that the top-level list, the one named tables, accumulates all of the spreadsheet's contents with the proper structure (a triply nested list). At the lowest level, of course, we join all the text strings that are reported as being within the same cell into a single cell content text string, when we accumulate, because the division between the various strings is just an artefact of the XML parsing process.

For example, consider a tiny spreadsheet with one column and three rows, where the first two rows each hold the number 2 and the third one holds the number 4 obtained by summing the numbers in the first two rows with an Excel formula. The relevant snippet of the Excel XML output (XMLSS format, as Microsoft calls it) is then:

<Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="3"        x:FullColumns="1" x:FullRows="1">    <Row>     <Cell><Data ss:Type="Number">2</Data></Cell>    </Row>    <Row>     <Cell><Data ss:Type="Number">2</Data></Cell>    </Row>    <Row>     <Cell ss:Formula="=SUM(R[-2]C, R[-1]C)">           <Data ss:Type="Number">4</Data></Cell>    </Row> </Table>

and running the script in this recipe over this file emits:

[[[u'2'], [u'2'], [u'4']]]

As you can see, the XMLSS file also contains a lot of supplementary information that this recipe is not collectingthe attributes hold information about the type of data (number or string), the formula used for the computation (if any), and so on. If you need any or all of this supplemental information, it's not hard to enrich this recipe to record and use it.

Recipe12.7.Parsing Microsoft Excel s XML