Flylib.com

Books Software

 
 
 

Recipe12.11.Using MSHTML to Parse XML or HTML


Recipe 12.11. Using MSHTML to Parse XML or HTML

Credit: Bill Bell

Problem

Your Python application, running on Windows, needs to use the Microsoft MSHTML COM component, which is also the parser that Microsoft Internet Explorer uses to parse HTML and XML web pages.

Solution

As usual, PyWin32 lets our Python code access COM quite simply:

from win32com.client import Dispatch
html = Dispatch('htmlfile')    # the disguise for MSHTML as a COM server
html.writeln( "<html><header><title>A title</title>"
     "<meta name='a name' content='page description'></header>"
     "<body>This is some of it. <span>And this is the rest.</span>"
     "</body></html>" )
print "Title: %s" % (html.title,)
print "Bag of words from body of the page: %s" % (html.body.innerText,)
print "URL associated with the page: %s" % (html.url,)
print "Display of name:content pairs from the metatags: "
metas = html.getElementsByTagName("meta")
for m in xrange(metas.length):
    print "\t%s: %s" % (metas[m].name, metas[m].content,)

Discussion

While Python offers many ways to parse HTML or XML, as long as you're running your programs only on Windows, MSHTML is very speedy and simple to use. As the recipe shows, you can simply use the writeln method of the COM object to feed the page into MSHTML and then you can use the methods and properties of the components to get at all kinds of aspects of the page's DOM. Of course, you can get the string of markup and text to feed into MSHTML in any way that suits your application, such as by using the Python Standard Library module urllib if you're getting a page from some URL.

Since the structure of the enriched DOM that MSHTML makes available is quite rich and complicated, I suggest you experiment with it in the PythonWin interactive environment that comes with PyWin32. The strength of PythonWin for such exploratory tasks is that it displays all of the properties and methods made available by each interface.

See Also

A detailed reference to MSHTML, albeit oriented to Visual Basic and C# users, can be found at http://www.xaml.net/articles/type.asp?o=MSHTML.


Chapter 13. Network Programming

Introduction

Recipe 13.1.  Passing Messages with Socket Datagrams

Recipe 13.2.  Grabbing a Document from the Web

Recipe 13.3.  Filtering a List of FTP Sites

Recipe 13.4.  Getting Time from a Server via the SNTP Protocol

Recipe 13.5.  Sending HTML Mail

Recipe 13.6.  Bundling Files in a MIME Message

Recipe 13.7.  Unpacking a Multipart MIME Message

Recipe 13.8.  Removing Attachments from an Email Message

Recipe 13.9.  Fixing Messages Parsed by Python 2.4 email.FeedParser

Recipe 13.10.  Inspecting a POP3 Mailbox Interactively

Recipe 13.11.  Detecting Inactive Computers

Recipe 13.12.  Monitoring a Network with HTTP

Recipe 13.13.  Forwarding and Redirecting Network Ports

Recipe 13.14.  Tunneling SSL Through a Proxy

Recipe 13.15.  Implementing the Dynamic IP Protocol

Recipe 13.16.  Connecting to IRC and Logging Messages to Disk

Recipe 13.17.  Accessing LDAP Servers


Introduction

Credit: Guido van Rossum, creator of Python

Network programming is one of my favorite Python applications. I wrote or started most of the network modules in the Python Standard Library, including the socket and select extension modules and most of the protocol client modules (such as ftplib ). I also wrote a popular server framework module, SocketServer , and two web browsers in Python, the first predating Mosaic. Need I say more?

Python's roots lie in a distributed operating system, Amoeba, which I helped design and implement in the late 1980s. Python was originally intended to be the scripting language for Amoeba, since it turned out that the Unix shell, while ported to Amoeba, wasn't very useful for writing Amoeba system administration scripts. Of course, I designed Python to be platform independent from the start. Once Python was ported from Amoeba to Unix, I taught myself BSD socket programming by wrapping the socket primitives in a Python extension module and then experimenting with them using Python; this was one of the first extension modules.

This approach proved to be a great early testimony of Python's strengths. Writing socket code in C is tedious : the code necessary to do error checking on every call quickly overtakes the logic of the program. Quick: in which order should a server call accept , bind , connect , and listen ? This is remarkably difficult to find out if all you have is a set of Unix manpages. In Python, you don't have to write separate error-handling code for each call, making the logic of the code stand out much clearer. You can also learn about sockets by experimenting in an interactive Python shell, where misconceptions about the proper order of calls and the argument values that each call requires are cleared up quickly through Python's immediate error messages.

Python has come a long way since those first days, and now few applications use the socket module directly; most use much higher-level modules such as urllib or smtplib , and third-party extensions such as the Twisted framework, whose popularity keeps growing. The examples in this chapter are a varied bunch: some construct and send complex email messages, while others dwell on lower-level issues such as tunneling. My favorite is Recipe 13.11, which implements PyHeartBeat : it's useful, it uses the socket module, and it's simple enough to be an educational example. I do note, with that mixture of pride and sadness that always accompanies a parent's observation of children growing up, that, since the Python Cookbook 's first edition, even PyHeartBeat has acquired an alternative server implementation based on Twisted!

Nevertheless, my own baby, the socket module itself, is still the foundation of all network operations in Python. It's a plain transliteration of the socket APIsfirst introduced in BSD Unix and now widespread on all platformsinto the object-oriented paradigm. You create socket objects by calling the socket.socket factory function, then you call methods on these objects to perform typical low-level network operations. You don't have to worry about allocating and freeing memory for buffers and the likePython handles that for you automatically. You express IP addresses as (host, port) pairs, in which host is a string in either dotted -quad (' 1.2.3.4 ') or domain- name (' www.python.org ') notation. As you can see, even low-level modules in Python aren't as low level as all that.

Despite the various conveniences , the socket module still exposes the actual underlying functionality of your operating system's network sockets. If you're at all familiar with sockets, you'll quickly get the hang of Python's socket module, using Python's own Library Reference . You'll then be able to play with sockets interactively in Python to become a socket expert, if that is what you want. The classic, highly recommended work on this subject is W. Richard Stevens, UNIX Network Programming, Volume 1: Networking APIs - Sockets and XTI, 2d ed. (Prentice-Hall). For many practical uses, however, higher-level modules will serve you better.

The Internet uses a sometimes dazzling variety of protocols and formats, and the Python Standard Library supports many of them. In the Python Standard Library, you will find dozens of modules dedicated to supporting specific Internet protocols (such as smtplib to support the SMTP protocol to send mail and nntplib to support the Network News Transfer Protocol (NNTP) to send and receive Network News). In addition, you'll find about as many modules that support specific Internet formats (such as htmllib to parse HTML data, the email package to parse and compose various formats related to emailincluding attachments and encoding).

I cannot even come close to doing justice to the powerful array of tools mentioned in this introduction, nor will you find all of these modules and packages used in this chapter, nor in this book, nor in most programming shops . You may never need to write any program that deals with Network News, for example; if that is the case, you don't need to study nntplib . But it is still reassuring to know it's there (part of the "batteries included" approach of the Python Standard Library).

Two higher-level modules that stand out from the crowd , however, are urllib and urllib2 . Each of these two modules can deal with several protocols through the magic of URLsthose now-familiar strings, such as http://www.python.org/index.html, that identify a protocol (such as http), a host and port (such as www.python.org, port 80 being the default for the HTTP protocol), and a specific resource at that address (such as /index.html ). urllib is very simple to use, but urllib2 is more powerful and extensible. HTTP is the most popular protocol for URLs, but these modules also support several others, such as FTP. In many cases, you'll be able to use these modules to write typical client-side scripts that interact with any of the supported protocols much quicker and with less effort than it might take with the various protocol-specific modules.

To illustrate , I'd like to conclude with a cookbook example of my own. It's similar to Recipe 13.2, but, rather than a program fragment, it's a little script. I call it wget.py because it does everything for which I've ever needed wget . (In fact, I originally wrote this script on a system where wget wasn't installed but Python was; writing wget.py was a more effective use of my time than downloading and installing the real thing.)

import sys, urllib
def reporthook(*a): print a
for url in sys.argv[1:]:
    i = url.rfind('/')
    file = url[i+1:]
    print url, "->", file
    urllib.urlretrieve(url, file, reporthook)

Pass this script one or more URLs as command-line arguments; the script retrieves them into local files whose names match the last components of the URLs. The script also prints progress information of the form:

(block number, block size, total size)

Obviously, it's easy to improve on this script; but it's only seven lines, it's readable, and it worksand that's what's so cool about Python.

Another cool thing about Python is that you can incrementally improve a program like this, and after it's grown by two or three orders of magnitude, it's still readable, and it still works! To see what this particular example might evolve into, check out Tools/webchecker/websucker.py in the Python source distribution. Enjoy!