Recipe2.27.Extracting Text from Microsoft Word Documents


Recipe 2.27. Extracting Text from Microsoft Word Documents

Credit: Simon Brunning, Pavel Kosina

Problem

You want to extract the text content from each Microsoft Word document in a directory tree on Windows into a corresponding text file.

Solution

With the PyWin32 extension, we can access Word itself, through COM, to perform the conversion:

import fnmatch, os, sys, win32com.client wordapp = win32com.client.gencache.EnsureDispatch("Word.Application") try:     for path, dirs, files in os.walk(sys.argv[1]):         for filename in files:             if not fnmatch.fnmatch(filename, '*.doc'): continue             doc = os.path.abspath(os.path.join(path, filename))             print "processing %s" % doc             wordapp.Documents.Open(doc)             docastxt = doc[:-3] + 'txt'             wordapp.ActiveDocument.SaveAs(docastxt,                 FileFormat=win32com.client.constants.wdFormatText)             wordapp.ActiveDocument.Close( ) finally:     # ensure Word is properly shut down even if we get an exception     wordapp.Quit( )

Discussion

A useful aspect of most Windows applications is that you can script them via COM, and the PyWin32 extension makes it fairly easy to perform COM scripting from Python. The extension enables you to write Python scripts to perform many kinds of Window tasks. The script in this recipe's Solution drives Microsoft Word to extract the text from every .doc file in a "directory" tree into a corresponding .txt text file. Using the os.walk function, we can access every subdirectory in a tree with a simple for statement, without recursion. With the fnmatch.fnmatch function, we can check a filename to determine whether it matches an appropriate wildcard, here '*.doc'. Once we have determined the name of a Word document file, we process that name with functions from os.path to turn it into a complete absolute path, and have Word open it, save it as text, and close it again.

If you don't have Word, you may need to take a completely different approach. One possibility is to use OpenOffice.org, which is able to load Word documents. Another is to use a program specifically designed to read Word documents, such as Antiword, found at http://www.winfield.demon.nl/. However, we have not explored these alternative options.

See Also

Mark Hammond, Andy Robinson, Python Programming on Win32 (O'Reilly), for documentation on PyWin32; http://msdn.microsoft.com, for Microsoft's documentation of the object model of Microsoft Word; Library Reference and Python in a Nutshell sections on modules fnmatch and os.path, and function os.walk.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net