Recipe14.4.Checking for a Web Page s Existence

Recipe 14.4. Checking for a Web Page's Existence

Credit: James Thiele, Rogier Steehouder

Problem

You want to check whether an HTTP URL corresponds to an existing web page.

Solution

Using httplib allows you to easily check for a page's existence without actually downloading the page itself, just its headers. Here's a module implementing a function to perform this task:

""" httpExists.py A quick and dirty way to check whether a web file is there. Usage: >>> import httpExists >>> httpExists.httpExists('http://www.python.org/') True >>> httpExists.httpExists('http://www.python.org/PenguinOnTheTelly') Status 404 Not Found : http://www.python.org/PenguinOnTheTelly False """ import httplib, urlparse def httpExists(url):     host, path = urlparse.urlsplit(url)[1:3]     if ':' in host:         # port specified, try to use it         host, port = host.split(':', 1)         try:             port = int(port)         except ValueError:             print 'invalid port number %r' % (port,)             return False     else:         # no port specified, use default port         port = None     try:         connection = httplib.HTTPConnection(host, port=port)         connection.request("HEAD", path)         resp = connection.getresponse( )         if resp.status == 200:       # normal 'found' status             found = True         elif resp.status == 302:     # recurse on temporary redirect             found = httpExists(urlparse.urljoin(url,                                resp.getheader('location', '')))         else:                        # everything else -> not found             print "Status %d %s : %s" % (resp.status, resp.reason, url)             found = False     except Exception, e:         print e._ _class_ _, e, url         found = False     return found def _test( ):     import doctest, httpExists     return doctest.testmod(httpExists) if _ _name_ _ == "_ _main_ _":     _test( )

Discussion

While this recipe is very simple and runs quite fast (thanks to the ability to use the HTTP command HEAD to get just the headers, not the body, of the page), it may be too simplistic for your specific needs: the HTTP result codes you might need to deal with may go beyond the simple 200 success code, and 302 temporary redirect, to include permanent redirects, temporary inaccessibility, permission problems, and so on.

In my case, I needed to check the correctness of a huge number of mutual links among pages of a site generated by a complex web application on an intranet, so I knew I had the privilege of relying on a simple check for "200 or bust." At any rate, you can use this simple recipe as a starting point to which to add any refinements you determine you actually need.

Recipe14.4.Checking for a Web Page s Existence