Recipe 2.27. Extracting Text from Microsoft Word DocumentsCredit: Simon Brunning, Pavel Kosina ProblemYou want to extract the text content from each Microsoft Word document in a directory tree on Windows into a corresponding text file. SolutionWith the PyWin32 extension, we can access Word itself, through COM, to perform the conversion: import fnmatch, os, sys, win32com.client wordapp = win32com.client.gencache.EnsureDispatch("Word.Application") try: for path, dirs, files in os.walk(sys.argv[1]): for filename in files: if not fnmatch.fnmatch(filename, '*.doc'): continue doc = os.path.abspath(os.path.join(path, filename)) print "processing %s" % doc wordapp.Documents.Open(doc) docastxt = doc[:-3] + 'txt' wordapp.ActiveDocument.SaveAs(docastxt, FileFormat=win32com.client.constants.wdFormatText) wordapp.ActiveDocument.Close( ) finally: # ensure Word is properly shut down even if we get an exception wordapp.Quit( ) DiscussionA useful aspect of most Windows applications is that you can script them via COM, and the PyWin32 extension makes it fairly easy to perform COM scripting from Python. The extension enables you to write Python scripts to perform many kinds of Window tasks. The script in this recipe's Solution drives Microsoft Word to extract the text from every .doc file in a "directory" tree into a corresponding .txt text file. Using the os.walk function, we can access every subdirectory in a tree with a simple for statement, without recursion. With the fnmatch.fnmatch function, we can check a filename to determine whether it matches an appropriate wildcard, here '*.doc'. Once we have determined the name of a Word document file, we process that name with functions from os.path to turn it into a complete absolute path, and have Word open it, save it as text, and close it again. If you don't have Word, you may need to take a completely different approach. One possibility is to use OpenOffice.org, which is able to load Word documents. Another is to use a program specifically designed to read Word documents, such as Antiword, found at http://www.winfield.demon.nl/. However, we have not explored these alternative options. See AlsoMark Hammond, Andy Robinson, Python Programming on Win32 (O'Reilly), for documentation on PyWin32; http://msdn.microsoft.com, for Microsoft's documentation of the object model of Microsoft Word; Library Reference and Python in a Nutshell sections on modules fnmatch and os.path, and function os.walk. |