Character encodings | Struts Survival Guide: Basics to Best Practices (J2ee Survival Series)

Earlier, when applications were built, they were built for one language. Those were the days of ‚“code pages ‚½. Code pages described how binary values mapped to human readable characters . A currently executing program was considered to be in a single code page. These approaches were fine until Internationalization came along. Then came the issue of how to represent multiple character sets and encodings for an application. Hence came character sets and encodings.

Character sets are sets of text and graphic symbols mapped to positive integers. ASCII was one of the first character sets to be used. ASCII though efficient, was good at representing only US English.

A Character encoding, as mentioned earlier, maps a character to fixed width units. It also defines ordering rules and byte serializing guidelines. Different character sets have multiple encodings. For example, Java programs represent Cyrillic character sets using KO18-R or KO18-U encodings. Unicode enables us to write multilingual applications.

Other examples of encodings include ISO 8859, UTF-8 etc. UTF or Unicode Transformation Format is used to encode 16 bit Unicode characters as one to four bytes. A UTF byte is equivalent to 7-bit ASCII if its higher order bit is zero. You might have come across many JSP pages, which have a line that looks like:

 <%@ page contentType="text/html;charset=UTF-8" language="java" %>

Here, charset=UTF-8 indicates that the page uses a response encoding of UTF-8. When internationalizing the web tier , you need to consider three types of encodings:

Request encoding
Page encoding
Response encoding

Request encoding deals with the encoding used to encode request parameters. Browsers typically send the request encoding with the Content-type header. If this is not present, the Servlet container will use ISO-8859-1 as the default encoding.

Page encoding is used in JSP pages to indicate the character encoding for that file. You can find the page encoding from:

The Page Encoding value of a JSP property group whose URL pattern matches the page. To see how JSP property groups work, you can go to the following URL: http://java.sun.com/j2ee/1.4/docs/tutorial/doc/JSPIntro13.html#wp72193
The pageEncoding attribute in a JSP page specified along with the page directive. If the value pageEncoding attribute differs from the value specified in the JSP property group, a translation error can occur.
The CHARSET value of the contentType attribute in the page directive.

If none of these encodings are mentioned, then the default encoding of ISO-8859-1 is used.

Response encoding is the encoding of the text response sent by a Servlet or a JSP page. This encoding governs the way the output is rendered on a client ‚ s browser and based on the client ‚ s locale. The web container sets a response encoding from one of the following:

The CHARSET value of the contentType attribute in the page directive.
The encoding in the pageEncoding attribute of the page directive
The Page Encoding value of a JSP property group whose URL pattern matches the page

If none of these encodings are mentioned, then the default encoding of ISO-8859-1 is used.

Early on, when internationalization of computer applications became popular, there was a boom in the number of encodings available to the user . Unfortunately these encodings were unable to cover multiple languages. For instance, the European Union was not able to cover all the European languages in one encoding, resulting in having to create multiple encodings to cover them. This further worsened the problem as multiple encodings could use the same number to represent different characters in different languages. The result: higher chances of data corruption.

A big company had its applications working great with a default locale of US English, until it decided to go global. One of the requirements was to support Chinese characters. The application code was modified accordingly but each time the application ran, it was just not able to produce meaningful output, as the text seemed to be distorted . The culprit was the database encoding.

Chinese characters, just like Korean and Japanese, have writing schemes that cannot be represented by single byte code formats such as ASCII and EBCDIC. These languages need at least a Double Byte Character Set (DBCS) encoding to handle their characters. Once the database was updated to support DBCS encoding, the applications worked fine. These problems led to the creation of a universal character-encoding format called Unicode.

Unicode is a 16 bit character encoding that assigns a unique number to each character in the major languages of the world. Though it can officially support up to 65,536 characters, it also has reserved some code points for mapping into additional 16-bit planes with the potential to cope with over a million unique characters. Unicode is more efficient as it defines a standardized character set that represents most of the commonly used languages. In addition, it can be extended to accommodate any additions. Unicode characters are represented as escape sequences of type \u XXXX where XXXX is a character ‚ s 16 bit representation in hexadecimal in cases where a Java program ‚ s source encoding is not Unicode compliant.

Struts and character encoding

Setting the character encoding in the web application requires the following steps:

Configure the servlet container to support the desired encoding. For instance, you have to set the servlet container to interpret the input as UTF-8 for Unicode. This configuration is vendor dependent.
Set the response content type to the required encoding (e.g. UTF-8). In Struts 1.1, this information is specified in the < controller > element in struts-config.xml using the contentType attribute.
This can also be set in the JSPs with the @page directive as follows :
```
 <%@ page contentType="text/html; charset=UTF-8" %>. 
```

Next add the following line in the HTML < head >:

 <meta http-equiv="content-type"               content="text/html; charset=UTF-8">

Make sure you are using the I18N version rather than the US version of the JRE. (If you are using JDK, this problem may ot arise)
Make sure that the database encoding is also set to Unicode.

Note ‚

Setting < html:html locale="true" > doesn't set the encoding stream. It is only a signal to Struts to use the locale-specific resource bundle

native2ascii conversion

Java programs can process only those files that are encoded in Latin-1 (ISO 8859-1) encoding or files in Unicode encoding. Any other files containing different encodings besides these two will not be processed . The native2ascii tool is used to convert such non Latin-1 or non-Unicode files into a Unicode encoded file. Any characters that are not in ISO 8859-1 will be encoded using Unicode escapes . For example, if you have a file encoded in a different language, say myCyrillicFile in Cyrillic, you can use the native2ascii tool to convert it into a Unicode encoded file as follows:

 native2ascii encoding UTF-8 myCyrillicFile myUnicodeFile

You can use other encodings besides UTF-8 too. Use the above tool on the Struts prorperties files (message resource bundles) containing non Latin-1 encoding. Without this conversion, the Struts application (or java for that matter) will not be able to interpret the encoded text. Consequently the < bean:message > and < html:errors/ > will display garbage.