Section 11.5. Processing Form Data

11.5. Processing Form Data

The form concept in HTML is rather simple, even simplistic. This has been obscured by the superficial complexity of elements used to construct a form as well as by the variation in technologies for processing form data . The basic idea is the following:

A form element in HTML defines a data structure for a fill-out form and indicates (in an action attribute) the address of the software that processes the form data when submitted, the form handler.
A form element containsinput fields, called controls in HTML specifications. An input field allows a user to select between alternatives, type in data, insert a file, or submit the form data.
When a form is submitted, typically by clicking on a submit button (defined by an input field), the web browser takes the contents of all input fields, encodes them in a particular way, and submits this data to the form handler.
The data may pass through someinterface (such as Common Gateway Interface) that converts the data to a format that is more easily processed by the form handler.
The form handler usually decodes the form data to a suitable format, often splitting it into different variables corresponding to the fields of the form.
The rest is up to the form handler. It may, and normally should, send the browser some response, such as search results, a notification or an error message, or the next part of a logical form divided into parts.

Originally, form handling was designed for ASCII data. When the GET method is used (the form element has the attribute method="GET", which is the default), the form data is encoded into a URL using URL encoding as described in Chapter 6. Thereby the character repertoire is restricted to ASCII. Form data processing is undefined in other cases, though in practice, other encodings have been used, relying on extended URL encoding. Using method="POST" is in principle safer, since that way, the form data is passed as a separate block of data, not as part of any URL.

A web author who sets up a form should consider the potential problems caused by non-ASCII input, even if he has no intentions of processing such data. We will here present some basic problems and solutions. More details are available on the page "FORM submission and i18n," http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html.

You cannot prevent people from writing strange characters in form fields. You can only be prepared to handle them somehow.

11.5.1. Decoding Form Data

The usual tools for decoding form data in programming languages extract the values of form fields and decode the URL encoding. This is typically automatic in advanced programming tools. For example, using Perl and the CGI.pm library for CGI scripting, you would use code like the following to retrieve the value of a field foo to a variable $zap, as URL decoded:

use CGI qw(:standard); $zap = param('foo');

Thus, a character that was typed as @ and URL encoded as %40, is again @ after this operation. In PHP, for example, you would do the same thing as follows:

$zap = $_GET['foo'];

In some cases, you might wish to use functions that specifically URL decode data, such as urldecode in PHP. It is, however, important to avoid URL decoding twice, since decoding already decoded data can result in completely wrong results.

If you wish to use the decoded data on an HTML page, typically in the content of the result page that the form handler sends, you need to escape the markup-significant characters < and & and possibly quotation marks, as usual in HTML. Programming languages often have built-in functions like HTMLescape for the purpose. However, there are problems with this, due to the way browsers may represent special characters, as explained in "Avoid Oddities by Using UTF-8" later in the chapter.

11.5.2. Recognizing the Encoding

Extraction of fields from the form data and URL decoding them is not sufficient. You need to find out the encoding in which the data should be interpreted. The encoding should be the same as on the page where the form appears. Although HTML specifications define an accept-charset attribute for specifying the encoding of form data, it has not been implemented. Instead, browsers use the page's encoding if they can. We cannot always know for sure that a browser has got this right, though.

It is possible that a browser receives a document that is, say, ISO-8859-15 encoded and announced as such, but the browser actually treats it as ISO-8859-1 or windows-1252 encoded. The user would usually observe nothing wrong, especially if all characters used on the page have the same code numbers in all the encodings. However, if she fills out and submits a form, her data might get distorted. If she enters a character that has a different code in ISO-8859-15 than in the code actually used, the form handler interprets it incorrectly.

A simple heuristic check is to include a hidden field in the form and check its value in the form handler. The field should contain some characters that have different codes in encodings that might actually be used by browsers. The euro sign U+20AC, representable in HTML as €, is a useful diagnostic character, since it has different codes in Unicode and windows-1252, and it does not belong to ISO-8859-1 at all. You could also include some other character, one that does not appear in windows-1252. For example:

<input type="hidden" name="euro" value="&#8364;"> <input type="hidden" name="Omega" value="&#937;">

In the form handler, you can check that the value of this field is what it should be. For example, if your document was sent as ISO-8859-15 encoded, the value should be octet A0 in hexadecimal. If your document was sent as UTF-8, the value should be the UTF-8 encoded form of U+20AC, which is E2 82 AC.

If the test fails, you can know that something went wrong. Normal form processing should be prevented. The form handler could, for example, just send back an error message like the following: "Form data cannot be processed. Unfortunately, your browser is not able to handle the character encoding UTF-8. Therefore, we cannot ensure that your data would be processed correctly."

11.5.3. Avoid Oddities by Using UTF-8

There is a particular reason to use UTF-8 on pages that contain a form. If the user enters a character that cannot be represented in the encoding of the page, there is no rule that says what a browser should do. It would be natural to expect that it issues an error message, or perhaps omits such a character or replaces it with a suitable control character. However, what browsers normally do is convert the character to a character reference (in decimal) and then include this value as URL encoded into the form data.

For example, assume that your page is ISO-8859-1 encoded and contains a form with a text input field. If the user enters, for example, the Greek capital letter omega Ω, browsers will typically convert it to the character reference Ω and then URL encode this to the following: %26%23937%3B. Although this is quite illogical (character references belong to HTML source, not to encoded data) and does not conform to any specification, you need to take it into account. A user may fill out your form using characters he finds natural or necessary, without realizing the limitations of the encoding.

Sometimes you may find it useful to keep special characters as character references. You need to be careful, however. If you use normal tools or algorithms to HTML escape the data retrieved from form fields, you would escape & as & and break the idea. On the other hand, plain & in the data needs to be escaped. In principle, we cannot distinguish the string "Ω" generated by a browser from Ω from the same string typed by the user. Effectively, you need to treat them as equivalent, as a matter of form handler functionality, and you need to use an HTML escape method that leaves character references intact. On the other hand, you could avoid the problem altogether.

By using UTF-8, you avoid the problem, since all Unicode characters are representable in it. On the other hand, you need to handle the encoding, and this would be nontrivial, if your server-side programming language does not support Unicode. However, even when you need to process the data as an octet sequence to be interpreted by your code, you can process ASCII data easily: all octets in the range 0..7F are ASCII characters.

11.5.4. Using UTF-8

The following demonstration code is a Perl script, intended to be used as a CGI script, and it uses the CGI.pm library (see http://search.cpan.org/dist/CGI.pm/CGI.pm) for the creation of an HTML form and for processing the form data. The script creates a UTF-8 encoded HTML document containing a form and decodes the form data into UTF-8 format, and then writes the data to a file in UTF-8 encoding, in append mode. (In real life, you would want to include some checks against excessive amounts of data and other abuse.)

#!/usr/local/gnu/bin/perl use CGI qw(:standard); use Encode; binmode STDOUT, ":utf8"; print header(-charset => 'utf-8'); print start_html(-title => 'Collecting words', -encoding => 'utf-8'),       h1('Collecting words'); if (param()) {     if(open(OUT, ">>:utf8", "words.txt")) {         $word = Encode::decode_utf8(param('word'));         print OUT "$word\n";         print p("Thank you for \x{201c}$word\x{201d}!");  }     else {         print p("Internal error, sorry!"); exit(0); }} else {     print start_form,           "Some word(s): ",textfield('word'),           submit(-name => 'Submit'),           end_form; } print end_html;

11.5.5. Submitting a File

When you use a form with a file input field (<input type="file">), the browser creates a special input widget where the user can pick up a file from his system. The contents of the file will be included into the form data as one of the parts of a multipart message. The part has headers of its own, where the encoding could be specified. However, in practice, the browser will just copy the contents of the file octet by octet, and it will insert a header that specifies the media type of the data according to the file system properties. For example, if the filename suffix is .txt, the browser includes a header that specifies the media type as text/plain without charset indication.

The conclusion is that the encoding and even media types of submitted files remain unknown. Human intervention or application-related heuristics is needed to deduce such information. In some cases, you might include a field where the user can specify the encoding of a file, but this would probably be too challenging for most users.