Recipe 12.11. Using MSHTML to Parse XML or HTMLCredit: Bill Bell ProblemYour Python application, running on Windows, needs to use the Microsoft MSHTML COM component, which is also the parser that Microsoft Internet Explorer uses to parse HTML and XML web pages. SolutionAs usual, PyWin32 lets our Python code access COM quite simply:
from win32com.client import Dispatch
html = Dispatch('htmlfile') # the disguise for MSHTML as a COM server
html.writeln( "<html><header><title>A title</title>"
"<meta name='a name' content='page description'></header>"
"<body>This is some of it. <span>And this is the rest.</span>"
"</body></html>" )
print "Title: %s" % (html.title,)
print "Bag of words from body of the page: %s" % (html.body.innerText,)
print "URL associated with the page: %s" % (html.url,)
print "Display of name:content pairs from the metatags: "
metas = html.getElementsByTagName("meta")
for m in xrange(metas.length):
print "\t%s: %s" % (metas[m].name, metas[m].content,)
Discussion
While Python offers many ways to parse HTML or XML, as long as you're running your programs only on Windows, MSHTML is very speedy and simple to use. As the recipe shows, you can simply use the
writeln
method of the COM object to feed the page into MSHTML and then you can use the
Since the structure of the enriched DOM that MSHTML makes available is quite rich and complicated, I suggest you experiment with it in the PythonWin interactive environment that comes with PyWin32. The strength of PythonWin for such exploratory
See AlsoA detailed reference to MSHTML, albeit oriented to Visual Basic and C# users, can be found at http://www.xaml.net/articles/type.asp?o=MSHTML. |
Chapter 13. Network Programming
Introduction Recipe 13.1. Passing Messages with Socket Datagrams Recipe 13.2. Grabbing a Document from the Web Recipe 13.3. Filtering a List of FTP Sites Recipe 13.4. Getting Time from a Server via the SNTP Protocol Recipe 13.5. Sending HTML Mail Recipe 13.6. Bundling Files in a MIME Message Recipe 13.7. Unpacking a Multipart MIME Message Recipe 13.8. Removing Attachments from an Email Message Recipe 13.9. Fixing Messages Parsed by Python 2.4 email.FeedParser Recipe 13.10. Inspecting a POP3 Mailbox Interactively Recipe 13.11. Detecting Inactive Computers Recipe 13.12. Monitoring a Network with HTTP Recipe 13.13. Forwarding and Redirecting Network Ports Recipe 13.14. Tunneling SSL Through a Proxy Recipe 13.15. Implementing the Dynamic IP Protocol Recipe 13.16. Connecting to IRC and Logging Messages to Disk Recipe 13.17. Accessing LDAP Servers |
IntroductionCredit: Guido van Rossum, creator of Python
Network programming is one of my favorite Python applications. I wrote or started most of the network modules in the Python Standard Library, including the
socket
and
select
extension modules and most of the protocol client modules (such as
Python's roots lie in a distributed operating system, Amoeba, which I helped design and implement in the late 1980s. Python was originally intended to be the scripting language for Amoeba, since it turned out that the Unix shell, while ported to Amoeba, wasn't very useful for writing Amoeba system administration scripts. Of course, I designed Python to be platform independent from the start. Once Python was ported from Amoeba to Unix, I taught
This approach proved to be a great early testimony of Python's strengths. Writing socket code in C is
Python has come a long way since those first days, and now few applications use the
socket
module directly; most use much higher-level modules such as
urllib
or
smtplib
, and third-party extensions such as the Twisted framework, whose popularity keeps growing. The examples in this chapter are a varied bunch: some construct and send complex email messages, while others dwell on lower-level issues such as tunneling. My favorite is Recipe 13.11, which implements
PyHeartBeat
: it's useful, it uses the
socket
module, and it's simple enough to be an educational example. I do note, with that mixture of
Nevertheless, my own baby, the
socket
module itself, is still the foundation of all network operations in Python. It's a plain transliteration of the socket APIsfirst introduced in BSD Unix and now widespread on all platformsinto the object-oriented paradigm. You create socket objects by calling the
socket.socket
factory function, then you call
Despite the various
The Internet uses a sometimes dazzling variety of protocols and formats, and the Python Standard Library supports many of them. In the Python Standard Library, you will find dozens of modules dedicated to supporting specific Internet protocols (such as
smtplib
to support the SMTP protocol to send mail and
I cannot even come close to doing
Two higher-level modules that stand out from the
To
import sys, urllib
def reporthook(*a): print a
for url in sys.argv[1:]:
i = url.rfind('/')
file = url[i+1:]
print url, "->", file
urllib.urlretrieve(url, file, reporthook)
Pass this script one or more URLs as command-line arguments; the script retrieves them into local files whose
(block number, block size, total size) Obviously, it's easy to improve on this script; but it's only seven lines, it's readable, and it worksand that's what's so cool about Python.
Another cool thing about Python is that you can incrementally improve a program like this, and after it's grown by two or three orders of magnitude, it's still readable, and it still works! To see what this particular example might
|