Translation Issues | Lessons Learned in Software Testing

Although translation is just a part of the overall localization effort, it's an important one from a test standpoint. The most obvious problem is how to test something that's in another language. Well, you or someone on your test team will need to be at least semi-fluent in the language you're testing, being able to navigate the software, read any text it displays, and type the necessary commands to run your tests. It might be time to sign up for the community college course in Slovenian you always wanted to take.

NOTE

It's important that you or someone on your test team be at least a little familiar with the language you're testing. Of course, if you're shipping your program in 32 different languages, they may be difficult. The solution is to contract out this work to a localization testing company. Numerous such companies worldwide can perform testing in nearly any language. For more information, search the Internet for "localization testing."

It's not a requirement that everyone on the test team speak the language that the software is being localized into; you probably need just one person. Many things can be checked without knowing what the words say. It would be helpful, sure, to know a bit of the language, but you'll see that you might be able to do a fair amount of the testing without being completely fluent.

Text Expansion

The most straightforward example of a translation problem that can occur is due to something called text expansion. Although English may appear at times to be wordy, it turns out that when English is translated into other languages, often more characters are necessary to say the same thing. Figure 10.1 shows how the size of a button needs to expand to hold the translated text of two common computer words. A good rule of thumb is to expect up to 100 percent increase in size for individual wordson a button, for example. Expect a 50 percent increase in size for sentences and short paragraphstypical phrases you would see in dialog boxes and error messages.

Figure 10.1. When translated into other languages, the words Minimize and Maximize can vary greatly in size often forcing the UI to be redesigned to accommodate them.

Because of this expansion, you need to carefully test areas of the software that could be affected by longer text. Look for text that doesn't wrap correctly, is truncated, or is hyphenated incorrectly. This could occur anywhereonscreen, in windows, boxes, buttons, and so on. Also look for cases where the text had enough room to expand, but did so by pushing something else out of the way.

Another possibility is that this longer text can cause a major program failure or even a system crash. A programmer could have allocated enough internal memory for the English text messages, but not enough for the translated strings. The English version of the software will work fine but the German version will crash when the message is displayed. A white-box tester could catch this problem without knowing a single word of the language.

ASCII, DBCS, and Unicode

Chapter 5, "Testing the Software with Blinders On," briefly discussed the ASCII character set. ASCII can represent only 256 different charactersnot nearly enough to represent all the possible characters in all languages. When software started being developed for different languages, solutions needed to be found to overcome this limitation. An approach common in the days of MS-DOS, but still in use today, is to use a technique called code pages. Essentially, a code page is a replacement ASCII table, with a different code page for each language. If your software runs in Quebec on a French PC, it could load and use a code page that supports French characters. Russian uses a different code page for its Cyrillic characters, and so on.

This solution is fine, although a bit clunky, for languages with less than 256 characters, but Japanese, Chinese, and other languages with thousands of symbols cause problems. A system called DBCS (for Double-Byte Character Set) is used by some software to provide more than 256 characters. Using 2 bytes instead of 1 byte allows for up to 65,536 different characters.

Code pages and DBCS are sufficient in many situations but suffer from a few problems. Most important is the issue of compatibility. If a Hebrew document is loaded onto a German computer running a British word processor, the result can be gibberish. Without the proper code pages or the proper conversion from one to the other, the characters can't be interpreted correctly, or even at all.

The solution to this mess is the Unicode standard.

     Unicode provides a unique number for every character,
              no matter what the platform,
              no matter what the program,
              no matter what the language.
"What is Unicode?" from the Unicode Consortium website, www.unicode.org

Because Unicode is a worldwide standard supported by the major software companies, hardware manufacturers, and other standards groups, it's becoming more commonplace. Most major software applications support it. Figure 10.2 shows many of the different characters supported. If it's at all possible that your software will ever be localized, you and the programmers on your project should cut your ties to "ol' ASCII" and switch to Unicode to save yourself time, aggravation, and bugs.

Figure 10.2. This Microsoft Word dialog shows support for the Unicode standard.

Hot Keys and Shortcuts

In English, it's Search. In French, it's Réchercher. If the hotkey for selecting Search in the English version of your software is Alt+S, that will need to change in the French version.

In localized versions of your software, you'll need to test that all the hotkeys and shortcuts work properly and aren't too difficult to usefor example, requiring a third keypress. And, don't forget to check that the English hotkeys and shortcuts are disabled.

Extended Characters

A common problem with localized software, and even non-localized software, is in its handling of extended characters. Referring back to that ancient ASCII table, extended characters are the ones that fall outside the normal English alphabet of AZ and az. Examples of these would be the accented characters such as the é in José or the ñ in El Niño. They also include the many symbol characters such as that aren't on your typical keyboard. If your software is properly written to use Unicode or even if it correctly manages code pages or DBCS, this shouldn't be an issue, but a tester should never assume anything, so it's worthwhile to check.

The way to test this is to look for all the places that your software can accept character input or send output. In each place, try to use extended characters to see if they work just as regular characters would. Dialog boxes, logins, and any text fields are fair game. Can you send and receive extended characters through a modem? Can you name your files with them or even have the characters in the files? Will they print out properly? What happens if you cut, copy, and paste them between your program and another one?

TIP

The simplest way to ensure that you test for proper handling of extended characters is to add them to your equivalence partition of the standard characters that you test. Along with those bug-prone characters sitting on the ASCII table boundaries, throw in an Æ, an Ø and a ß.

Computations on Characters

Related to extended characters are problems with how they're interpreted by software that performs calculations on them. Two examples of this are word sorting and upper- and lowercase conversion.

Does your software sort or alphabetize word lists? Maybe in a list box of selectable items such as filenames or website addresses? If so, how would you sort the following words?

Kopiëren	Reiste
Ärmlich	Arg
Reiskorn	résumé
Reißaus	kopieën
reiten	Reisschnaps
reißen	resume

If you're testing software to be sold to one of the many Asian cultures, are you aware that the sort order is based on the order of the brush strokes used to paint the character? The preceding list would likely have a completely different sort order if written in Mandarin Chinese. Find out what the sorting rules are for the language you're testing and develop tests to specifically check that the proper sort order occurs.

The other area where calculation on extended characters breaks down is with upper- and lowercase conversion. It's a problem because the "trick" solution that many programmers learn in school is to simply add or subtract 32 to the ASCII value of the letter to convert it between cases. Add 32 to the ASCII value of A and you get the ASCII value of a. Unfortunately, that doesn't work for extended characters. If you tried this technique using the Apple Mac extended character set, you'd convert Ñ (ASCII 132) to § (ASCII 164) instead of ñ (ASCII 150)not exactly what you'd expect.

Sorting and alphabetizing are just two examples. Carefully look at your software to determine if there are other situations where calculations are performed on letters or words. Spell-checking perhaps?

Reading Left to Right and Right to Left

A huge issue for translation is that some languages, such as Hebrew and Arabic, read from right to left, not left to right. Imagine flipping your entire user interface into a mirror image of itself.

Thankfully, most major operating systems provide built-in support for handling these languages. Without this, it would be a nearly impossible task. Even so, it's still not a simple matter of translating the text. It requires a great deal of programming to make use of the OS's features to do the job. From a testing standpoint, it's probably safe to consider it a completely new product, not just a localization.

Text in Graphics

Another translation problem occurs when text is used in graphics. See Figure 10.3 for several examples.

Figure 10.3. Word 2000 has examples of text in bitmaps that would be difficult to translate.

The icons in Figure 10.3 are the standard ones for selecting Bold, Italic, Underline, and Font Color. Since they use the English letters B, I, U, and A, they'll mean nothing to someone from Japan who doesn't read English. They might pick up on the meaning based on their lookthe B is a bit dark, the I is leaning, and the U has a line under itbut software isn't supposed to be a puzzle.

The impact of this is that when the software is localized, each icon will have to be changed to reflect the new languages. If there were many of these icons, it could get prohibitively expensive to localize the program. Look for text-in-graphic bugs early in the development cycle so they don't make it through to the end.

Keep the Text out of the Code

The final translation problem to cover is a white-box testing issuekeep the text out of the code. What this means is that all text strings, error messages, and really anything that could possibly be translated should be stored in a separate file independent of the source code. You should never see a line of code such as:

 Print "Hello World"

Most localizers are not programmers, nor do they need to be. It's risky and inefficient to have them modifying the source code to translate it from one language to another. What they should modify is a simple text file, called a resource file, that contains all the messages the software can display. When the software runs, it references the messages by looking them up, not knowing or caring what they say. If the message is in English or Dutch, it gets displayed just the same.

That said, it's important for white-box testers to search the code to make sure there are no embedded strings that weren't placed in the external file. It would be pretty embarrassing to have an important error message in a Spanish program appear in English.

Another variation of this problem is when the code dynamically generates a text message. For example, it might piece together snippets of text to create a larger message. The code could take three strings:

"You pressed the"
a variable string containing the name of the key just pressed
"key just in time!"

and put them together to create a message. If the variable string had the value "stop nuclear reaction," the total message would read:

You pressed the stop nuclear reaction key just in time!

The problem is that the word order is not the same in all languages. Although it pieces together nicely in English, with each phrase translated separately, it could be gibberish when stuck together in Mandarin Chinese or even German. Don't let strings crop into the code and don't let them be built up into larger strings by the code.