Parsing URLs | Python Phrasebook

import urlparse parsedTuple = urlparse.urlparse( "http://www.google.com/search? hl=en&q=urlparse&btnG=Google+Search") unparsedURL = urlparse.urlunparse((URLscheme, \         URLlocation, URLpath, '', '', '')) newURL = urlparse.urljoin(unparsedURL, "/module-urllib2/request-objects.html")

The urlparse module included with Python makes it easy to break down URLs into specific components and reassemble them. This is very useful for a number of purposes when processing HTML documents.

The urlparse(urlstring [, default_scheme [, allow_fragments]]) function takes the URL provided in urlstring and returns the tuple (scheme, netloc, path, parameters, query, fragment). The tuple can then be used to determine things such as location scheme (HTTP, FTP, and so on), server address, file path, and so on.

The urlunparse(tuple) function accepts the tuple (scheme, netloc, path, parameters, query, fragment) and reassembles it into a properly formatted URL that can be used by the other HTML parsing modules included with Python.

The urljoin(base, url [, allow_fragments]) function accepts a base URL as the first argument and then joins whatever relative URL is specified in the second argument. The urljoin function is extremely useful in processing several files in the same location by joining new filenames to the existing base URL location.

Note

If the relative path does not start using the root (/) character, the rightmost location in the base URL path will be replaced with the relative path. For example, a base URL of http://www.testpage.com/pub and a relative URL of test.html would join to form the URL http://www.testpage.com/test.html, not http://www.testpage.com/test.html. If you want to keep the end directory in the path, make sure to end the base URL string with a / character.

import urlparse URLscheme = "http" URLlocation = "www.python.org" URLpath = "lib/module-urlparse.html" modList = ("urllib", "urllib2", \            "httplib", "cgilib") #Parse address into tuple print "Parsed Google search for urlparse" parsedTuple = urlparse.urlparse( "http://www.google.com/search? hl=en&q=urlparse&btnG=Google+Search") print parsedTuple #Unparse list into URL print "\nUnarsed python document page" unparsedURL = urlparse.urlunparse( \ (URLscheme, URLlocation, URLpath, '', '', '')) print "\t" + unparsedURL #Join path to new file to create new URL print "\nAdditional python document pages using join" for mod in modList:     newURL = urlparse.urljoin(unparsedURL, \                     "module-%s.html" % (mod))     print "\t" + newURL #Join path to subpath to create new URL print "\nPython document pages using join of sub-path" newURL = urlparse.urljoin(unparsedURL,          "module-urllib2/request-objects.html") print "\t" + newURL

URL_parse.py

Parsed Google search for urlparse ('http', 'www.google.com', '/search', '', 'hl=en&q=urlparse&btnG=Google+Search', '') Unparsed python document page        http://www.python.org/lib/module-urlparse.html Additional python document pages using join        http://www.python.org/lib/module-urllib.html        http://www.python.org/lib/module-urllib2.html        http://www.python.org/lib/module-httplib.html        http://www.python.org/lib/module-cgilib.html Python document pages using join of sub-path        http://www.python.org/lib/module-urllib2/ request-objects.html

Output from URL_parse.py code