Recipe12.11.Using MSHTML to Parse XML or HTML

Recipe 12.11. Using MSHTML to Parse XML or HTML

Credit: Bill Bell

Problem

Your Python application, running on Windows, needs to use the Microsoft MSHTML COM component, which is also the parser that Microsoft Internet Explorer uses to parse HTML and XML web pages.

Solution

As usual, PyWin32 lets our Python code access COM quite simply:

from win32com.client import Dispatch html = Dispatch('htmlfile')    # the disguise for MSHTML as a COM server html.writeln( "<html><header><title>A title</title>"      "<meta name='a name' content='page description'></header>"      "<body>This is some of it. <span>And this is the rest.</span>"      "</body></html>" ) print "Title: %s" % (html.title,) print "Bag of words from body of the page: %s" % (html.body.innerText,) print "URL associated with the page: %s" % (html.url,) print "Display of name:content pairs from the metatags: " metas = html.getElementsByTagName("meta") for m in xrange(metas.length):     print "\t%s: %s" % (metas[m].name, metas[m].content,)

Discussion

While Python offers many ways to parse HTML or XML, as long as you're running your programs only on Windows, MSHTML is very speedy and simple to use. As the recipe shows, you can simply use the writeln method of the COM object to feed the page into MSHTML and then you can use the methods and properties of the components to get at all kinds of aspects of the page's DOM. Of course, you can get the string of markup and text to feed into MSHTML in any way that suits your application, such as by using the Python Standard Library module urllib if you're getting a page from some URL.

Since the structure of the enriched DOM that MSHTML makes available is quite rich and complicated, I suggest you experiment with it in the PythonWin interactive environment that comes with PyWin32. The strength of PythonWin for such exploratory tasks is that it displays all of the properties and methods made available by each interface.

Recipe12.11.Using MSHTML to Parse XML or HTML