Recipe 17.4. Encoding Special Characters in Web Output

Problem

Certain characters are special in web pages and must be encoded if you want to display them literally. Because database content often contains instances of these characters, scripts that include query results in web pages should encode those results to prevent browsers from misinterpreting the information.

Solution

Use the methods that are provided by your API for performing HTML-encoding and URL-encoding.

Discussion

HTML is a markup language: it uses certain characters as markers that have a special meaning. To include literal instances of these characters in a page, you must encode them so that they are not interpreted as having their special meanings. For example, < should be encoded as < to keep a browser from interpreting it as the beginning of a tag. Furthermore, there are actually two kinds of encoding, depending on the context in which you use a character. One encoding is appropriate for HTML text, another is used for text that is part of a URL in a hyperlink.

The MySQL table-display scripts shown in Recipes Section 17.2 and Section 17.3 are simple demonstrations of how to produce web pages using programs. But with one exception, the scripts have a common failing: they take no care to properly encode special characters that occur in the information retrieved from the MySQL server. (The exception is the JSP version of the script. The <c:out> tag used there handles encoding automatically, as we'll discuss shortly.)

As it happens, I deliberately chose information to display that is unlikely to contain any special characters; the scripts should work properly even in the absence of any encoding. However, in the general case, it's unsafe to assume that a query result will contain no special characters, so you must be prepared to encode it for display in a web page. Neglecting to do this often results in scripts that generate pages containing malformed HTML that displays incorrectly.

This recipe describes how to handle special characters, beginning with some general principles, and then discusses how each API implements encoding support. The API-specific examples show how to process information drawn from a database table, but they can be adapted to any content you include in a web page, no matter its source.

General encoding principles

One form of encoding applies to characters that are used in writing HTML constructs; another applies to text that is included in URLs. It's important to understand this distinction so that you don't encode text inappropriately.

NOTE

Encoding text for inclusion in a web page is an entirely different issue from encoding special characters in data values for inclusion in an SQL statement. Section 2.5 discusses the latter issue.

Encoding characters that are special in HTML. HTML markup uses < and > characters to begin and end tags, & to begin special entity names (such as   to signify a nonbreaking space), and " to quote attribute values in tags (such as <p align="left">). Consequently, to display literal instances of these characters, you should encode them as HTML entities so that browsers or other clients understand your intent. To do this, convert the special characters <, >, &, and " to the corresponding HTML entity designators shown in the following table.

Special character	HTML entity
`<`	`<`
`>`	`>`
`&`	`&`
`"`	`"`

Suppose that you want to display the following string literally in a web page:

Paragraphs begin and end with <p> & </p> tags.

If you send this text to the client browser exactly as shown, the browser will misinterpret it: the <p> and </p> tags will be taken as paragraph markers and the & may be taken as the beginning of an HTML entity designator. To display the string the way you intend, encode the special characters as the <, >, and & entities:

Paragraphs begin and end with &lt;p&gt; &amp; &lt;/p&gt; tags.

The principle of encoding text this way is also useful within tags. For example, HTML tag attribute values usually are enclosed within double quotes, so it's important to perform HTML-encoding on attribute values. Suppose that you want to include a text-input box in a form, and you want to provide an initial value of Rich "Goose" Gossage to be displayed in the box. You cannot write that value literally in the tag like this:

<input type="text" name="player_name" value="Rich "Goose" Gossage" />

The problem here is that the double-quoted value attribute includes internal double quotes, which makes the <input> tag malformed. The proper way to write it is to encode the double quotes:

<input type="text" name="player_name" value="Rich &quot;Goose&quot; Gossage" />

When a browser receives this text, it decodes the " entities back to " characters and interprets the value attribute value properly.

Encoding characters that are special in URLs. URLs for hyperlinks that occur within HTML pages have their own syntax and their own encoding. This encoding applies to attributes within several tags:

<a href="URL"> <img src="URL"> <form action="URL"> <frame src="URL">

Many characters have special meaning within URLs, such as :, /, ?, =, &, and ;. The following URL contains some of these characters:

http://localhost/myscript.php?id=428&name=Gandalf

Here the : and / characters segment the URL into components, the ? character indicates that parameters are present, and the & character separates the parameters, each of which is specified as a name=value pair. (The ; character is not present in the URL just shown, but commonly is used instead of & to separate parameters.) If you want to include any of these characters literally within a URL, you must encode them to prevent the browser from interpreting them with their usual special meaning. Other characters such as spaces require special treatment as well. Spaces are not allowed within a URL, so if you want to reference a page named my home page.html on the local host, the URL in the following hyperlink won't work:

<a href="http://localhost/my home page.html">My Home Page</a>

URL-encoding for special and reserved characters is performed by converting each such character to % followed by two hexadecimal digits representing the character's ASCII code. For example, the ASCII value of the space character is 32 decimal, or 20 hexadecimal, so you'd write the preceding hyperlink like this:

<a href="http://localhost/my%20home%20page.html">My Home Page</a>

Sometimes you'll see spaces encoded as + in URLs. That is legal, too.

Use the appropriate encoding method for the context. Be sure to encode information properly for the context in which you're using it. Suppose that you want to create a hyperlink to trigger a search for items matching a search term, and you want the term itself to appear as the link label that is displayed in the page. In this case, the term appears as a parameter in the URL, and also as HTML text between the <a> and </a> tags. If the search term is "cats & dogs", the unencoded hyperlink construct looks like this:

<a href="/cgi-bin/myscript?term=cats & dogs">cats & dogs</a>

That is incorrect because & is special in both contexts and the spaces are special in the URL. The link should be written like this instead:

<a href="/cgi-bin/myscript?term=cats%20%26%20dogs">cats &amp; dogs</a>

Here, & is HTML-encoded as & for the link label, and is URL-encoded as %26 for the URL, which also includes spaces encoded as %20.

Granted, it's a pain to encode text before writing it to a web page, and sometimes you know enough about a value that you can skip the encoding. (See the sidebar, "Do You Always Need to Encode Web Page Output?") But encoding is the safe thing to do most of the time. Fortunately, most APIs provide functions to do the work for you. This means you need not know every character that is special in a given context. You just need to know which kind of encoding to perform, so that you can call the appropriate function to produce the intended result.

Do You Always Need to Encode Web Page Output?

If you know a value is legal in a particular context within a web page, you need not encode it. For example, if you obtain a value from an integer-valued column in a database table that cannot be NULL, it must necessarily be an integer. No HTML- or URL-encoding is needed to include the value in a web page, because digits are not special in HTML text or within URLs. On the other hand, suppose that you solicit an integer value using a field in a web form. You might be expecting the user to provide an integer, but the user might be confused and enter an illegal value. You could handle this by displaying an error page that shows the value and explains that it's not an integer. But if the value contains special characters and you don't encode it, the page won't display the value properly, possibly further confusing the user.

Encoding special characters using web APIs

The following encoding examples show how to pull values out of MySQL and perform both HTML-encoding and URL-encoding on them to generate hyperlinks. Each example reads a table named phrase that contains short phrases and then uses its contents to construct hyperlinks that point to a (hypothetical) script that searches for instances of the phrases in some other table. The table contains the following rows:

mysql> SELECT phrase_val FROM phrase ORDER BY phrase_val; +--------------------------+ | phrase_val               | +--------------------------+ | are we "there" yet?      | | cats & dogs              | | rhinoceros               | | the whole > sum of parts | +--------------------------+

The goal here is to generate a list of hyperlinks using each phrase both as the hyperlink label (which requires HTML-encoding) and in the URL as a parameter to the search script (which requires URL-encoding). The resulting links look something like this:

<a href="/cgi-bin/mysearch.pl?phrase=are%20we%20%22there%22%20yet%3F"> are we &quot;there&quot; yet?</a> <a href="/cgi-bin/mysearch.pl?phrase=cats%20%26%20dogs"> cats &amp; dogs</a> <a href="/cgi-bin/mysearch.pl?phrase=rhinoceros"> rhinoceros</a> <a href="/cgi-bin/mysearch.pl?phrase=the%20whole%20%3E%20sum%20of%20parts"> the whole &gt; sum of parts</a>

The initial part of the HRef attribute value will vary per API. Also, the links produced by some APIs will look slightly different because they encode spaces as + rather than as %20.

Perl. The Perl CGI.pm module provides two methods , escapeHTML⁠(⁠ ⁠ ⁠) and escape⁠(⁠ ⁠ ⁠), that handle HTML-encoding and URL-encoding. There are three ways to use these methods to encode a string $str:

Invoke escapeHTML⁠(⁠ ⁠ ⁠) and escape⁠(⁠ ⁠ ⁠) as CGI class methods using a CGI:: prefix:
```
use CGI; printf "%s\n%s\n", CGI::escape ($str), CGI::escapeHTML ($str); 
```
Create a CGI object and invoke escapeHTML⁠(⁠ ⁠ ⁠) and escape⁠(⁠ ⁠ ⁠) as object methods:
```
use CGI; my $cgi = new CGI; printf "%s\n%s\n", $cgi->escape ($str), $cgi->escapeHTML ($str); 
```
Import the names explicitly into your script's namespace. In this case, neither a CGI object nor the CGI:: prefix is necessary and you can invoke the methods as standalone functions. The following example imports the two method names in addition to the set of standard names:
```
use CGI qw(:standard escape escapeHTML); printf "%s\n%s\n", escape ($str), escapeHTML ($str); 
```

I prefer the last alternative because it is consistent with the CGI.pm function call interface that you use for other imported method names. Just remember to include the encoding method names in the use CGI statement for any Perl script that requires them, or you'll get "undefined subroutine" errors when the script executes.

The following code reads the contents of the phrase table and produces hyperlinks from them using escapeHTML⁠(⁠ ⁠ ⁠) and escape⁠(⁠ ⁠ ⁠):

my $stmt = "SELECT phrase_val FROM phrase ORDER BY phrase_val"; my $sth = $dbh->prepare ($stmt); $sth->execute (); while (my ($phrase) = $sth->fetchrow_array ()) {   # URL-encode the phrase value for use in the URL   my $url = "/cgi-bin/mysearch.pl?phrase=" . escape ($phrase);   # HTML-encode the phrase value for use in the link label   my $label = escapeHTML ($phrase);   print a ({-href => $url}, $label), br (), "\n"; }

Ruby. The Ruby cgi module contains two methods, CGI.escapeHTML⁠(⁠ ⁠ ⁠) and CGI.escape⁠(⁠ ⁠ ⁠), that perform HTML-encoding and URL-encoding. However, both methods raise an exception unless the argument is a string. One way to deal with this is to apply the to_s method to any argument that might not be a string, to force it to string form and convert nil to the empty string. For example:

stmt = "SELECT phrase_val FROM phrase ORDER BY phrase_val" dbh.execute(stmt) do |sth|   sth.fetch do |row|     # make sure that the value is a string     phrase = row[0].to_s     # URL-encode the phrase value for use in the URL     url = "/cgi-bin/mysearch.rb?phrase=" + CGI.escape(phrase)     # HTML-encode the phrase value for use in the link label     label = CGI.escapeHTML(phrase)     page << cgi.a("href" => url) { label } + cgi.br + "\n"   end end

page is used here as a variable that "accumulates" page content and that eventually you pass to cgi.out to display the page.

PHP. In PHP, the htmlspecialchars⁠(⁠ ⁠ ⁠) and urlencode⁠(⁠ ⁠ ⁠) functions perform HTML-encoding and URL-encoding. Use them as follows:

$stmt = "SELECT phrase_val FROM phrase ORDER BY phrase_val"; $result =& $conn->query ($stmt); if (!PEAR::isError ($result)) {   while (list ($phrase) = $result->fetchRow ())   {     # URL-encode the phrase value for use in the URL     $url = "/mcb/mysearch.php?phrase=" . urlencode ($phrase);     # HTML-encode the phrase value for use in the link label     $label = htmlspecialchars ($phrase);     printf ("<a href=\"%s\">%s</a><br />\n", $url, $label);   }   $result->free (); }

Python. In Python, the cgi and urllib modules contain the relevant encoding methods. cgi.escape⁠(⁠ ⁠ ⁠) and urllib.quote⁠(⁠ ⁠ ⁠) perform HTML-encoding and URL-encoding. However, both methods raise an exception unless the argument is a string. One way to deal with this is to apply the str⁠(⁠ ⁠ ⁠) method to any argument that might not be a string, to force it to string form and convert None to the string "None". (If you want None to convert to the empty string, you need to test for it explicitly.) For example:

import cgi import urllib stmt = "SELECT phrase_val FROM phrase ORDER BY phrase_val" cursor = conn.cursor () cursor.execute (stmt) for (phrase,) in cursor.fetchall ():   # make sure that the value is a string   phrase = str (phrase)   # URL-encode the phrase value for use in the URL   url = "/cgi-bin/mysearch.py?phrase=" + urllib.quote (phrase)   # HTML-encode the phrase value for use in the link label   label = cgi.escape (phrase, 1)   print "<a href=\"%s\">%s</a><br />" % (url, label) cursor.close ()

The first argument to cgi.escape⁠(⁠ ⁠ ⁠) is the string to be HTML-encoded. By default, this function converts <, >, and & characters to their corresponding HTML entities. To tell cgi.escape⁠(⁠ ⁠ ⁠) to also convert double quotes to the " entity, pass a second argument of 1, as shown in the example. This is especially important if you're encoding values to be placed into a double-quoted tag attribute.

Java. The <c:out> JSTL tag automatically performs HTML-encoding for JSP pages. (Strictly speaking, it performs XML-encoding, but the set of characters affected is <, >, &, ", and ', which includes all those needed for HTML-encoding.) By using <c:out> to display text in a web page, you need not even think about converting special characters to HTML entities. If for some reason you want to suppress encoding, invoke <c:out> with an encodeXML attribute value of false:

<c:out value="value to display" encodeXML="false"/>

To URL-encode parameters for inclusion in a URL, use the <c:url> tag. Specify the URL string in the tag's value attribute, and include any parameter values and names in <c:param> tags in the body of the <c:url> tag. A parameter value can be given either in the value attribute of a <c:param> tag or in its body. Here's an example that shows both uses:

<c:url var="urlStr" value="myscript.jsp">   <c:param name="id" value ="47"/>   <c:param name="color">sky blue</c:param> </c:url>

This will URL-encode the values of the id and color parameters and add them to the end of the URL. The result is placed in an object named urlStr, which you can display as follows:

<c:out value="${urlStr}"/>

NOTE

The <c:url> tag does not encode special characters such as spaces in the string supplied in its value attribute. You must encode them yourself, so it's probably best to avoid creating pages with spaces in their names, to avoid the likelihood that you'll need to refer to them.

To display entries from the phrase table, use the <c:out> and <c:url> tags as follows:

<sql:query dataSource="${conn}" var="rs">   SELECT phrase_val FROM phrase ORDER BY phrase_val </sql:query> <c:forEach items="${rs.rows}" var="row">   <%-- URL-encode the phrase value for use in the URL --%>   <c:url var="urlStr" value="/mcb/mysearch.jsp">     <c:param name="phrase" value ="${row.phrase_val}"/>   </c:url>   <a href="<c:out value="${urlStr}"/>">     <%-- HTML-encode the phrase value for use in the link label --%>     <c:out value="${row.phrase_val}"/>   </a>   <br /> </c:forEach>