Recipe 14.4. Checking for a Web Page's ExistenceCredit: James Thiele, Rogier Steehouder ProblemYou want to check whether an HTTP URL corresponds to an existing web page. SolutionUsing httplib allows you to easily check for a page's existence without actually downloading the page itself, just its headers. Here's a module implementing a function to perform this task: """ httpExists.py A quick and dirty way to check whether a web file is there. Usage: >>> import httpExists >>> httpExists.httpExists('http://www.python.org/') True >>> httpExists.httpExists('http://www.python.org/PenguinOnTheTelly') Status 404 Not Found : http://www.python.org/PenguinOnTheTelly False """ import httplib, urlparse def httpExists(url): host, path = urlparse.urlsplit(url)[1:3] if ':' in host: # port specified, try to use it host, port = host.split(':', 1) try: port = int(port) except ValueError: print 'invalid port number %r' % (port,) return False else: # no port specified, use default port port = None try: connection = httplib.HTTPConnection(host, port=port) connection.request("HEAD", path) resp = connection.getresponse( ) if resp.status == 200: # normal 'found' status found = True elif resp.status == 302: # recurse on temporary redirect found = httpExists(urlparse.urljoin(url, resp.getheader('location', ''))) else: # everything else -> not found print "Status %d %s : %s" % (resp.status, resp.reason, url) found = False except Exception, e: print e._ _class_ _, e, url found = False return found def _test( ): import doctest, httpExists return doctest.testmod(httpExists) if _ _name_ _ == "_ _main_ _": _test( ) DiscussionWhile this recipe is very simple and runs quite fast (thanks to the ability to use the HTTP command HEAD to get just the headers, not the body, of the page), it may be too simplistic for your specific needs: the HTTP result codes you might need to deal with may go beyond the simple 200 success code, and 302 temporary redirect, to include permanent redirects, temporary inaccessibility, permission problems, and so on. In my case, I needed to check the correctness of a huge number of mutual links among pages of a site generated by a complex web application on an intranet, so I knew I had the privilege of relying on a simple check for "200 or bust." At any rate, you can use this simple recipe as a starting point to which to add any refinements you determine you actually need. See AlsoDocumentation on the urlparse and httplib standard library modules in the Library Reference and Python in a Nutshell. |