27.6 Internet-Related Modules

Python is used in a wide variety of Internet-related tasks, from making web servers to crawling the Web to "screen-scraping" web sites for data. This section briefly describes the most often used modules used for such tasks that ship with Python's core. For more detailed examples of their use, we recommend Lundh's Standard Python Library and Martelli and Ascher's Python Cookbook (O'Reilly). There are many third-party add-ons worth knowing about before embarking on a significant web- or Internet-related project.

27.6.1 The Common Gateway Interface: The cgi Module

Python programs often process forms from web pages. To make this task easy, the standard Python distribution includes a module called cgi. Chapter 28 includes an example of a Python script that uses the CGI, so we won't cover it any further here.

27.6.2 Manipulating URLs: The urllib and urlparse Modules

Universal resource locators are strings such as http://www.python.org that are now ubiquitous. Three modules urllib, urllib2, and urlparse provide tools for processing URLs.

The urllib module defines a few functions for writing programs that must be active users of the Web (robots, agents, etc.). These are listed in Table 27-9.

Table 27-9. Functions of the urllib module
Function name	Behavior
`urlopen(url [, data])`	Opens (for reading) a network object denoted by a URL; it can also open local files: >>> page = urlopen('http://www.python.org') >>> page.readline( ) '<HTML>\012' >>> page.readline( ) '<!-- THIS PAGE IS AUTOMATICALLY GENERATED.DO NOT EDIT. -->\012'
`urlretrieve(url [, filename][, hook])`	Copies a network object denoted by a URL to a local file (uses a cache): >>> urllib.urlretrieve('http://www.python.org/', 'wwwpython.html')
`urlcleanup( )`	Cleans up the cache used by `urlretrieve`.
`quote(string[, safe])`	Replaces special characters in string using the `%xx` escape. The optional safe parameter specifies additional characters that shouldn't be quoted; its default value is: >>> quote('this & that @ home') 'this%20%26%20that%20%40%20home'
`quote_plus(string[, safe])`	Like `quote( )`, but also replaces spaces by plus signs.
`unquote(string)`	Replaces `%xx` escapes by their single-character equivalent: >>> unquote('this%20%26%20that%20%40%20home') 'this & that @ home'
`urlencode(dict)`	Converts a dictionary to a URL-encoded string, suitable to pass to `urlopen( )` as the optional data argument: >>> locals( ) {'urllib': <module 'urllib'>, '__doc__': None, 'x': 3, '__name__': '__main__', '__builtins__': <module '__builtin__'>} >>> urllib.urlencode(locals( )) 'urllib=%3cmodule+%27urllib%27%3e&__doc__=None&x=3& __name__=__main__&__builtins__=%3cmodule+%27 __builtin__%27%3e'

The module urllib2 focuses on the tasks of opening URLs that the simpler urllib doesn't know how to deal with, and provides an extensible framework for new kinds of URLs and protocols. It is what you should use if you want to deal with passwords, digest authentication, proxies, HTTPS URLs, and other fancy URLs.

The module urlparse defines a few functions that simplify taking URLs apart and putting new URLs together. These are listed in Table 27-10.

Table 27-10. Functions of the urlparse module
Function name	Behavior
`urlparse(urlstring[, default_scheme[,allow fragments]])`	Parses a URL into six components, returning a six tuple (addressing scheme, network location, path, parameters, query, fragment identifier): >>> urlparse('http://www.python.org/ FAQ.html') ('http', 'www.python.org', '/FAQ.html', '', '', '')
`urlunparse(tuple)`	Constructs a URL string from a tuple as returned by `urlparse( )`
`urljoin(base, url[,allow fragments])`	Constructs a full (absolute) URL by combining a base URL (`base`) with a relative URL (`url`): >>> urljoin('http://www.python.org', 'doc/lib') 'http://www.python.org/doc/lib'

27.6.3 Specific Internet Protocols

The most commonly used protocols built on top of TCP/IP are supported with modules named after them. The telnetlib module lets you act like a Telnet client. The httplib module lets you talk to web servers with the HTTP protocol. The ftplib module is for transferring files using the FTP protocol. The gopherlib module is for browsing Gopher servers (now fairly rare). In the domains of mail and news, you can use the poplib and imaplib modules for reading mail files on POP3 and IMAP servers, respectively and the smptlib module for sending mail, and the nntplib module for reading and posting Usenet news from NNTP servers.

There are also modules that can build Internet servers, specifically a generic socket-based IP server (SocketServer), a simple web server (SimpleHTTPServer), a CGI-compliant HTTP server (CGIHTTPSserver), and a module for building asynchronous socket handling services (asyncore).

Support for web services currently consists of a core library to process XML-RPC client-side calls (xmlrpclib), as well as a simple XML-RPC server implementation (SimpleXMLRPCServer). Support for SOAP is likely to be added when the SOAP standard becomes more stable.

27.6.4 Processing Internet Data

Once you use an Internet protocol to obtain files from the Internet (or before you serve them to the Internet), you often must process these files. They come in many different formats. Table 27-11 lists each module in the standard library that processes a specific kind of Internet-related file format (there are others for sound and image format processing; see the library reference manual).

Table 27-11. Modules dedicated to Internet file processing
Module name	File format
`sgmllib`	A simple parser for SGML files.
`htmllib`	A parser for HTML documents.
`formatter`	Generic output formatter and device interface.
`rfc822`	Parse RFC-822 mail headers (i.e., "Subject: hi there!").
`mimetools`	Tools for parsing MIME-style message bodies (a.k.a. file attachments).
`multifile`	Support for reading files that contain distinct parts.
`binhex`	Encode and decode files in binhex4 format.
`uu`	Encode and decode files in uuencode format.
`binascii`	Convert between binary and various ASCII-encoded representations.
`xdrlib`	Encode and decode XDR data.
`mailcap`	Mailcap file handling.
`mimetypes`	Mapping of filename extensions to MIME types.
`base64`	Encode and decode MIME base64 encoding.
`quopri`	Encode and decode MIME quoted-printable encoding.
`mailbox`	Read various mailbox formats.
`mimify`	Convert mail messages to and from MIME format.
`mail`	A package for parsing, handling, and generating email messages.

27.6.5 XML Processing

Python comes with a rich set of XML-processing tools. These include parsers, DOM interfaces, SAX interfaces, and more, as shown in Table 27-12.

Table 27-12. Some of the XML modules in the core distribution
Module name	Description
`xml.parsers.expat`	An interface to the Expat nonvalidating XML parser
`xml.dom`	Document Object Model (DOM) API for Python
`xml.dom.minidom`	Lightweight DOM implementation
`xml.dom.pulldom`	Support for building partial DOM trees from SAX events
`xml.sax`	Package containing SAX2 base classes and convenience functions
`xml.sax.handlers`	Base classes for SAX event handlers.
`xml.sax.saxutils`	Convenience functions and classes for use with SAX.
`xml.sax.xmlreader`	Interface that SAX-compliant XML parsers must implement.
`xmllib`	A parser for XML documents.

See the standard library reference for details, or the Python Cookbook (O'Reilly) for example tasks easily solved using the standard XML libraries. The XML facilities are developed by the XML Special Interest Group, which publishes versions of the XML package in-between Python releases. See http://www.python.org/topics/xml for details and the latest version of the code. For expanded coverage, consider Python and XML, by Christopher A. Jones and Fred L. Drake, Jr. (O'Reilly).

27.6.1 The Common Gateway Interface: The cgi Module

27.6.2 Manipulating URLs: The urllib and urlparse Modules

Table 27-9. Functions of the urllib module

Table 27-10. Functions of the urlparse module

27.6.3 Specific Internet Protocols

27.6.4 Processing Internet Data

Table 27-11. Modules dedicated to Internet file processing

27.6.5 XML Processing

Table 27-12. Some of the XML modules in the core distribution