Recipe2.26.Extracting Text from OpenOffice.org Documents

Recipe 2.26. Extracting Text from OpenOffice.org Documents

Credit: Dirk Holtwick

Problem

You need to extract the text content (with or without the attending XML markup) from an OpenOffice.org document.

Solution

An OpenOffice.org document is just a zip file that aggregates XML documents according to a well-documented standard. To access our precious data, we don't even need to have OpenOffice.org installed:

import zipfile, re rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE) def convert_OO(filename, want_text=True):     """ Convert an OpenOffice.org document to XML or text. """         zf = zipfile.ZipFile(filename, "r")         data = zf.read("content.xml")         zf.close( )         if want_text:             data = " ".join(rx_stripxml.sub(" ", data).split( ))         return data if _ _name_ _=="_ _main_ _":     import sys     if len(sys.argv)>1:         for docname in sys.argv[1:]:             print 'Text of', docname, ':'             print convert_OO(docname)             print 'XML of', docname, ':'             print convert_OO(docname, want_text=False)     else:         print 'Call with paths to OO.o doc files to see Text and XML forms.'

Discussion

OpenOffice.org documents are zip files, and in addition to other contents, they always contain the file content.xml. This recipe's job, therefore, essentially boils down to just extracting this file. By default, the recipe then throws away XML tags with a simple regular expression, splits the result by whitespace, and joins it up again with a single blank to save space. Of course, we could use an XML parser to get information in a vastly richer and more structured way, but if all we need is the rough textual content, this fast, rough-and-ready approach may suffice.

Specifically, the regular expression rx_stripxml matches any XML tag (opening or closing) from the leading < to the terminating >. Inside function convert_OO, in the statements guarded by if want_text, we use that regular expression to change every XML tag into a space, then normalize whitespace by splitting (i.e., calling the string method split, which splits on any sequence of whitespace), and rejoining (with " ".join, to use a single blank character as the joiner). Essentially, this split-and-rejoin process changes any sequence of whitespace into a single blank character. More advanced ways to extract all text from an XML document are shown in Recipe 12.3.

Recipe2.26.Extracting Text from OpenOffice.org Documents