| 7.5. Languages and FontsAlthough languages using the same script may have different typographic conventions and practices, setting the language of text (in markup or with a word processor command) will usually not affect the visible rendering of text. When desired, such typographic issues need to be handled at the font level. This typically means that you select the font so that its typographic features are suitable for the main language of your text. However, most widely used fonts are intentionally rather "neutral." In OpenType technology, it is possible to have language-dependent variants of a character as different glyphs within the same font. Software that makes use of such possibilities is still rare. 7.5.1. Example: Shape of the Acute AccentIn Polish typography, the acute accent is more vertical than the acute accent used, for example, in French. It is also positioned differently: more to the right. Commonly used fonts typically make compromises but are closer to the French typography. Compare, for example, the French-style é of Times with the mixed-style é of Georgia and the rather Polish style é of Arial. In practice, selecting a font on such grounds is seldom possible, since there are so many other issues to consider. However, it is one criterion to be considered. The situation is so frustrating to some people that they claim that the Polish diacritic is not an acute accent at all but a separate diacritic, "kreska." See, for example, the illustrated description at http://www.twardoch.com/download/polishhowto/kreska.html. However, it is unlikely that Unicode will be amended by the kreska. This means that the difference cannot be made at the character level. In some distant future, we might be able to use fonts that have acute accents of different shapese.g., Western, Polish, and Greeksimply by setting the language of the text. At present, don't expect anything like that to happen. There are, however, issues in East Asian languages that can sometimes be handled by making language-dependent font choices, to some extent. 7.5.2. Chinese Characters and Language InformationDue to the nature of the Chinese writing system and its unification in Unicode, it is in principle useful to indicate the language of text containing Chinese-Japanese-Korean (CJK) ideographs. This allows the selection of appropriate glyphs as intended, either by the choice of a font or by the choice of glyphs within a font that supports variation by language. For Chinese, there are two major writing systems, called "Traditional" and "Simplified." The latter is much more common especially in mainland China. In addition to simplifying the shapes of many characters, it removes some distinctions made in the Traditional script by mapping two or more characters into one. For an illustrated explanation, see http://people.w3.org/rishida/scripts/chinese/. The CJK ideographs share a common origin but may differ between the languages. The unification process recognized some differences as so essential that different code points were assigned for characters that originate from one old Chinese character. In such cases, it is of course an author's responsibility to use the correct code points; font settings will not help. However, most differences were deemed typographic only. In such cases, a reader is expected to recognize a character in any of the variants (most important, Japanese, Chinese Traditional, and Chinese Simplified). Despite this, it is natural to try to make the text appearance correspond to the user's expectations. In practice, authors mostly decide on the representation of CJK ideographs by selecting a specific font. In particular, when setting the style of some element in a word processor, you might see separate settings for "Asian" or "East-Asian" and other text. There you can select a font that is suitable for the language you are using. In web authoring, you could similarly set a specific font, or a list of alternative fonts, in a stylesheet. Although explicit font settings are still often the most effective way, there are some problems with them. The author's font choice might be ignored or overriddene.g., because the document has been sent to a computer that lacks the chosen font. The user may dislike a font and may wish to override author-supplied font settings. Moreover, setting a specific font normally means setting it by the name, and many fonts exist in different versions (with different character coverage) under the same name. In web authoring, you can set the language of text in markup, instead of or in addition to suggesting specific fonts. The idea is that browsers may then map different languages to different fonts. For Japanese and Korean, there is no fundamental problem: you would use language codes "ja" and "ko," respectively. For Chinese (code "zh"), things are different, since it is relevant to indicate the difference between the writing systems, "Traditional" and "Simplified." Usually a font that contains CJK characters has them as according to one of these systems, or as in Japanese, or as in Korean. The language codes "zh-CN," "zh-TW," and "zh-HK" have often been used to specify the version of Chinese used. The real purpose has usually been to specify Simplified Figure 7-3. Effect of language markup on CJK characters on Firefox Chinese when using "zh-CN" and Traditional Chinese in the other cases. The reason in that in mainland China (code CN), the Simplified system is normally used, whereas in Taiwan (code TW) and Hong Kong (code HK), the Traditional system is more common. It is in principle more adequate to use script codes, since the issue is really about scripts, not territories. The codes "zh-Hans" and "zh-Hant" denote Simplified and Traditional Chinese, respectively. Modern software often recognizes them, though some programs might recognize only the previously mentioned notations with territory codes. As you can guess, "s" stands for Simplified, "t" stands for Traditional; "Han" is one of the names of the Chinese writing system. The potential effect of language markup is illustrated in Figure 7-3, which shows how language markup alone (with no font settings on a web page) may affect the display of CJK ideographs. You may need to take a close look at the ideograph glyphs to see how they differ. In this case, the browser, Mozilla Firefox, uses Japanese glyphs by default. The actual fonts used depend on the settings of the browser. The effect of language markup on the rendering of CJK ideographs depends on several things, including the browser, its font settings, and its internal logic in selecting glyphs. See some data on this at http://www.w3.org/International/tests/results/langandcjkfont. At the end of Chapter 1, we mentioned how browsers may let the user select different fonts for different scripts. The script concept used there is not the same as the script concept described in this chapter. Rather, it involves the script proper, the encoding of the page, and other factors. One of the factors might be the script as declared in HTML markup, using, for example, the attribute lang="zh-Hans". Such markup may enable the automatic selection of different fonts for different parts of a document. It is however questionable whether this is useful. For example, a Japanese user might prefer seeing even Chinese text written using Japanese glyphs. |