13.1 Western European Languages

Java Servlet Programming, 2nd Edition > 13. Internationalization > 13.1 Western European Languages

< BACK

CONTINUE >

Let's begin with a look at how a servlet outputs a page written in a Western European language such as English, Spanish, German, French, Italian, Dutch, Norwegian, Finnish, or Swedish. As our example, we'll say "Hello World!" in Spanish, generating a page similar to the one shown in Figure 13-1.

Figure 13-1. En Espa ol: Hola Mundo!

Notice the use of the special characters and . Characters such as these, while scarce in English, are prevalent in Western European languages. Servlets have two ways to generate these characters: with HTML character entities or Unicode escape sequences.

13.1.1 HTML Character Entities

HTML 2.0 introduced the ability for specific sequences of characters in an HTML page to be displayed as a single character. The sequences, called character entities, begin with an ampersand (&) and end with a semicolon (;). Character entities can either be named or numbered. For example, the named character entity ñ represents , while ¡ represents . A complete listing of special characters and their names is given in Appendix E. Example 13-1 shows a servlet that uses named entities to say "Hello World" in Spanish.

Example 13-1. Hello to Spanish Speakers, Using Named Character Entities

import java.io.*; import javax.servlet.*; import javax.servlet.http.*; public class HelloSpain extends HttpServlet {   public void doGet(HttpServletRequest req, HttpServletResponse res)                                throws ServletException, IOException {     res.setContentType("text/html");     PrintWriter out = res.getWriter();     res.setHeader("Content-Language", "es");     out.println("<HTML><HEAD><TITLE>En Espa&ntilde;ol</TITLE></HEAD>");     out.println("<BODY>");     out.println("<H3>En Espa&ntilde;ol:</H3>");     out.println("&iexcl;Hola Mundo!");     out.println("</BODY></HTML>");   } }

You may have noticed that, in addition to using character entities, this servlet sets its Content-Language header to the value es. The Content-Language header is used to specify the language of the following entity body. In this case, the servlet uses the header to indicate to the client that the page is written in Spanish (Espa ol). Most clients ignore this information, but it's polite to send it anyway. Languages are always represented using two-character lowercase abbreviations. For a complete listing, see the ISO-639 standard at http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt.

Character entities can also be referenced by number. For example, ñ represents , and ¡ represents . The number corresponds to the character's ISO-8859-1 (Latin-1) decimal value, which will be discussed later in this chapter. A complete listing of the numeric values for character entities can also be found in Appendix E. Example 13-2 shows HelloSpain rewritten using numeric entities.

Example 13-2. Hello to Spanish Speakers, Using Numbered Character Entities

import java.io.*; import javax.servlet.*; import javax.servlet.http.*; public class HelloSpain extends HttpServlet {   public void doGet(HttpServletRequest req, HttpServletResponse res)                                throws ServletException, IOException {     res.setContentType("text/html");     PrintWriter out = res.getWriter();     res.setHeader("Content-Language", "es");     out.println("<HTML><HEAD><TITLE>En Espa&#241;ol</TITLE></HEAD>");     out.println("<BODY>");     out.println("<H3>En Espa&241;ol:</H3>");     out.println("&#161;Hola Mundo!");     out.println("</BODY></HTML>");   } }

Unfortunately, there's one major problem with the use of character entities: they work only for HTML pages. If the servlet's output isn't HTML, the page looks something like Figure 13-2. To handle non-HTML output, we need to use Unicode escapes.

Figure 13-2. Not quite Spanish

13.1.2 Unicode Escapes

In Java, characters, strings, and identifiers are internally composed of 16-bit (2-byte) Unicode 2.0 characters. Unicode was established by the Unicode Consortium, which describes the standard as follows (see http://www.unicode.org):

The Unicode Worldwide Character Standard is a character coding system designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world. In addition, it supports classical and historical texts of many written languages.

In its current version (2.0), the Unicode standard contains 38,885 distinct coded characters derived from the Supported Scripts. These characters cover the principal written languages of the Americas, Europe, the Middle East, Africa, India, Asia, and Pacifica.

For more information on Unicode see http://www.unicode.org. Also see The Unicode Standard, Version 2.0 (Addison-Wesley). Note that although Unicode 3.0 has become available, Java continues to support Version 2.0.

Java's use of Unicode is very important to this chapter because it means a servlet can internally represent essentially any character in any commonly used written language. We can represent 16-bit Unicode characters in 7-bit US-ASCII source code using Unicode escapes of the form \uxxxx, where xxxx is a sequence of four hexadecimal digits. The Java compiler interprets each Unicode escape sequence as a single character.

Conveniently, and not coincidentally, the first 256 characters of Unicode (\u0000 to \u00ff) correspond to the 256 characters of ISO-8859-1 (Latin-1). Thus, the character can be written as \u00f1 and the character can be written as \u00a1. A complete listing of the Unicode escape sequences for ISO-8859-1 characters is also included in Appendix E. Example 13-3 shows HelloSpain rewritten using Unicode escapes.

Example 13-3. Hello to Spanish Speakers, Using Unicode Escapes

import java.io.*; import javax.servlet.*; import javax.servlet.http.*; public class HelloSpain extends HttpServlet {   public void doGet(HttpServletRequest req, HttpServletResponse res)                                throws ServletException, IOException {     res.setContentType("text/plain");     PrintWriter out = res.getWriter();     res.setHeader("Content-Language", "es");     out.println("En Espa\u00f1ol:");     out.println("\u00a1Hola Mundo!");   } }

The output from this servlet displays correctly when used as part of an HTML page or when used for plain-text output.

< BACK

CONTINUE >