16.8. More on HTML and URL Escapes
Perhaps the subtlest change in the last section's rewrite is that, for robustness, this version's reply script (Example 16-23) also calls cgi.escape for the language name, not just for the language's code snippet. This wasn't required in languages2.py (Example 16-20) for the known language names in our selection list table. However, it is not impossible that someone could pass the script a language name with an embedded HTML character as a query parameter. For example, a URL such as:
embeds a < in the language name parameter (the name is a<b). When submitted, this version uses cgi.escape to properly translate the < for use in the reply HTML, according to the standard HTML escape conventions discussed earlier:
<TITLE>Languages</TITLE> <H1>Syntax</H1><HR> <H3>a<b</H3><P><PRE> Sorry--I don't know that language </PRE></P><BR> <HR>
The original version doesn't escape the language name, such that the embedded <b is interpreted as an HTML tag (which may make the rest of the page render in bold font!). As you can probably tell by now, text escapes are pervasive in CGI scriptingeven text that you may think is safe must generally be escaped before being inserted into the HTML code in the reply stream.
Because the Web is a text-based medium that combines multiple language syntaxes, multiple formatting rules may apply: one for URLs and another for HTML. We met HTML escapes earlier in this chapter; URLs, and combinations of HTML and URLs, merit a few additional words.
16.8.1. URL Escape Code Conventions
Notice that in the prior section, although it's wrong to embed an unescaped < in the HTML code reply, it's perfectly all right to include it literally in the URL string used to trigger the reply. In fact, HTML and URLs define completely different characters as special. For instance, although & must be escaped as & inside HTML code, we have to use other escaping schemes to code a literal & within a URL string (where it normally separates parameters). To pass a language name like a&b to our script, we have to type the following URL:
Here, %26 represents &the & is replaced with a % followed by the hexadecimal value (0x26) of its ASCII code value (38). Similarly, as we suggested at the end of Chapter 14, to name C++ as a query parameter in an explicit URL, + must be escaped as %2b:
Sending C++ unescaped will not work, because + is special in URL syntaxit represents a space. By URL standards, most nonalphanumeric characters are supposed to be translated to such escape sequences, and spaces are replaced by + signs. Technically, this convention is known as the application/x-www-form-urlencoded query string format, and it's part of the magic behind those bizarre URLs you often see at the top of your browser as you surf the Web.
16.8.2. Python HTML and URL Escape Tools
If you're like me, you probably don't have the hexadecimal value of the ASCII code for & committed to memory (though Python's hex(ord(c)) can help). Luckily, Python provides tools that automatically implement URL escapes, just as cgi.escape does for HTML escapes. The main thing to keep in mind is that HTML code and URL strings are written with entirely different syntax, and so employ distinct escaping conventions. Web users don't generally care, unless they need to type complex URLs explicitlybrowsers handle most escape code details internally. But if you write scripts that must generate HTML or URLs, you need to be careful to escape characters that are reserved in either syntax.
Because HTML and URLs have different syntaxes, Python provides two distinct sets of tools for escaping their text. In the standard Python library:
The urllib module also has tools for undoing URL escapes (unquote, unquote_plus), but HTML escapes are undone during HTML parsing at large (e.g., by Python's htmllib module). To illustrate the two escape conventions and tools, let's apply each tool set to a few simple examples.
16.8.3. Escaping HTML Code
As we saw earlier, cgi.escape TRanslates code for inclusion within HTML. We normally call this utility from a CGI script, but it's just as easy to explore its behavior interactively:
>>> import cgi >>> cgi.escape('a < b > c & d "spam"', 1) 'a < b > c & d "spam"' >>> s = cgi.escape("1<2 <b>hello</b>") >>> s '1<2 <b>hello</b>'
Python's cgi module automatically converts characters that are special in HTML syntax according to the HTML convention. It translates <, >, &, and with an extra true argument, ", into escape sequences of the form &X;, where the X is a mnemonic that denotes the original character. For instance, < stands for the "less than" operator (<) and & denotes a literal ampersand (&).
There is no unescaping tool in the CGI module, because HTML escape code sequences are recognized within the context of an HTML parser, like the one used by your web browser when a page is downloaded. Python comes with a full HTML parser, too, in the form of the standard module htmllib, which imports and specializes tools in the module sgmllib (HTML is a kind of SGML syntax). We won't go into details on the HTML parsing tools here (see the library manual for details), but to illustrate how escape codes are eventually undone, here is the SGML module at work reading back the last output from earlier:
>>> from sgmllib import TestSGMLParser >>> p = TestSGMLParser(1) >>> s '1<2 <b>hello</b>' >>> for c in s: ... p.feed(c) ... >>> p.close( ) data: '1<2 <b>hello</b>'
16.8.4. Escaping URLs
By contrast, URLs reserve other characters as special and must adhere to different escape conventions. As a result, we use different Python library tools to escape URLs for transmission. Python's urllib module provides two tools that do the translation work for us: quote, which implements the standard %XX hexadecimal URL escape code sequences for most nonalphanumeric characters, and quote_plus, which additionally translates spaces to + signs. The urllib module also provides functions for unescaping quoted characters in a URL string: unquote undoes %XX escapes, and unquote_plus also changes plus signs back to spaces. Here is the module at work, at the interactive prompt:
>>> import urllib >>> urllib.quote("a & b #! c") 'a%20%26%20b%20%23%21%20c' >>> urllib.quote_plus("C:\stuff\spam.txt") 'C%3a%5cstuff%5cspam.txt' >>> x = urllib.quote_plus("a & b #! c") >>> x 'a+%26+b+%23%21+c' >>> urllib.unquote_plus(x) 'a & b #! c'
URL escape sequences embed the hexadecimal values of nonsafe characters following a % sign (usually, their ASCII codes). In urllib, nonsafe characters are usually taken to include everything except letters, digits, and a handful of safe special characters (any of _, ,, ., -, ), and / by default). You can also specify a string of safe characters as an extra argument to the quote calls to customize the translations; the argument defaults to /, but passing an empty string forces / to be escaped:
>>> urllib.quote_plus("uploads/index.txt") 'uploads/index.txt' >>> urllib.quote_plus("uploads/index.txt", '') 'uploads%2findex.txt'
Note that Python's cgi module also translates URL escape sequences back to their original characters and changes + signs to spaces during the process of extracting input information. Internally, cgi.FieldStorage automatically calls urllib.unquote if needed to parse and unescape parameters passed at the end of URLs (most of the translation happens in cgi.parse_qs). The upshot is that CGI scripts get back the original, unescaped URL strings, and don't need to unquote values on their own. As we've seen, CGI scripts don't even need to know that inputs came from a URL at all.
16.8.5. Escaping URLs Embedded in HTML Code
But what do we do for URLs inside HTML? That is, how do we escape when we generate and embed text inside a URL, which is itself embedded inside generated HTML code? Some of our earlier examples used hardcoded URLs with appended input parameters inside <A HREF> hyperlink tags; the file languages2.py, for instance, prints HTML that includes a URL:
Because the URL here is embedded in HTML, it must at least be escaped according to HTML conventions (e.g., any < characters must become <), and any spaces should be translated to + signs per URL conventions. A cgi.escape(url) call followed by the string url.replace(" ", "+") would take us this far, and would probably suffice for most cases.
That approach is not quite enough in general, though, because HTML escaping conventions are not the same as URL conventions. To robustly escape URLs embedded in HTML code, you should instead call urllib.quote_plus on the URL string, or at least most of its components, before adding it to the HTML text. The escaped result also satisfies HTML escape conventions, because urllib TRanslates more characters than cgi.escape, and the % in URL escapes is not special to HTML.
188.8.131.52. HTML and URL conflicts: &
But there is one more very subtle (and thankfully rare) wrinkle: you may also have to be careful with & characters in URL strings that are embedded in HTML code (e.g., within <A> hyperlink tags). The & symbol is both a query parameter separator in URLs (?a=1&b=2) and the start of escape codes in HTML (<). Consequently, there is a potential for collision if a query parameter name happens to be the same as an HTML escape sequence code. The query parameter name amp, for instance, that shows up as &=1 in parameters two and beyond on the URL may be treated as an HTML escape by some HTML parsers, and translated to &=1.
Even if parts of the URL string are URL-escaped, when more than one parameter is separated by a &, the & separator might also have to be escaped as & according to HTML conventions. To see why, consider the following HTML hyperlink tag with query parameter names name, job, amp, sect, and lt:
When rendered in most browsers tested, this URL link winds up looking incorrectly like this (the S character is really a non-ASCII section marker):
The first two parameters are retained as expected (name=a, job=b), because name is not preceded with an & and &job is not recognized as a valid HTML character escape code. However, the &, §, and < parts are interpreted as special characters, because they do name valid HTML escape codes.
184.108.40.206. Avoiding conflicts
To make this work as expected, the & separators should be escaped if your parameter names may clash with an HTML escape code:
Browsers render this fully escaped link as expected:
Because of this conflict between HTML and URL syntax, most server tools (including Python's cgi module) also allow a semicolon to be used as a separator instead of &; the following link, for example, works the same as the fully escaped URL, but does not require an extra HTML escaping step (at least not for the ;):
To test this for yourself, put these links in an HTML file, open the file in your browser using its http://localhost/badlink.html URL, and view the links when followed. The HTML file in Example 16-24 will suffice.
Example 16-24. PP3E\Internet\Web\badlink.html
When these links are clicked, they invoke the simple CGI script in Example 16-25. This script displays the inputs sent from the client on the standard error stream to avoid any additional translations (for our locally running web server in Example 16-1, this routes the printed text to the server's console window).
Example 16-25. PP3E\Internet\Web\cgi-bin\badlink.py
When the "escaped" second link in the HTML page is followed by the Firefox web browser, we get back the correct parameters set on the server as a result of the HTML escaping employed, as seen in the web server's console window:
..."GET /cgi-bin/badlink.py?name=a&job=b&=c§=d<=e HTTP/1.1" 200 - name => a job => b amp => c sect => d lt => e
But the accidental HTML escapes cause serious issues for the first "unescaped" linkthe client's HTML parser translates these in unintended ways:
..."GET /cgi-bin/badlink.py?name=a&job=b&=c%A7=d%3C=e HTTP/1.1" 200 - name => a job => b => cº=d<=e
The third "alternative" link produces the same output as the "escaped link," because ; doesn't cause collisions HTML, like & does. We get the intended set of parameter names:
..."GET /cgi-bin/badlink.py?name=a;job=b;amp=c;sect=d;lt=e HTTP/1.1" 200 - name => a job => b amp => c sect => d lt => e
The moral of this story is that unless you can be sure that the names of all but the leftmost URL query parameters embedded in HTML are not the same as the name of any HTML character escape code like amp, you should generally either use a semicolon as a separator if supported by your tools, or run the entire URL through cgi.escape after escaping its parameter names and values with urllib.quote_plus:
>>> import cgi >>> cgi.escape('file.py?name=a&job=b&=c§=d<=e') 'file.py?name=a&job=b&amp=c&sect=d&lt=e'
Having said that, I should add that some examples in this book do not escape & URL separators embedded within HTML simply because their URL parameter names are known not to conflict with HTML escapes. In fact, this concern is likely to be very rare in practice, since your program usually controls the set of parameter names it expects. This is not, however, the most general solution, especially if parameter names may be driven by a dynamic database; when in doubt, escape much and often.