Section 5.5. Case Properties

5.5. Case Properties

Some writing systems, such as Latin, Greek, and Cyrillic, make a distinction between cases of letters. Historically, uppercase letters, also known as capital letters or as majuscules, reflect the original shapes of letters. In the middle ages, lowercase letters, also known as small letters or as minuscules were invented to make writing by hand faster. Uppercase letters were preserved for special usee.g., for emphasis, for abbreviations, and for use as initials in proper names and in the first word of a sentence.

Usually an uppercase letter is larger than the corresponding lowercase letter. In some cases, this is the only essential difference; e.g., compare "O" with "o." Usually there is also a shape difference, which can be considerable; e.g., between "E" and "e." If you see letters of a script unknown to you, you might have difficulties in recognizing their case . For example, which of and is uppercase? (Hint: uppercase letters usually do not extend below the baseline of text, in most fonts.)

Not all writing systems make a case distinction, even if they use letters. For example, there is no such distinction, even though the shape of a letter may vary considerably for other reasons (by position within a word).

The use of uppercase letters varies by language. For example, German writes all nouns with initial capitals, and most European languages write names of months in all lowercase, unlike English. There is also considerable stylistic variation; in some styles, headings and even entire paragraphs are written in all uppercase. The Unicode standard does not try to describe such variation. Instead, it describes properties that can be used to deal with the variatione.g., to recognize or convert the case of a letter.

5.5.1. Recognizing Uppercase, Lowercase, and Titlecase

The Unicode names of letters generally contain the word "capital" for uppercase letters and the word "small" for lowercase letters . However, there are exceptions to this, and there is no reason to rely on the names. Instead, you can use several defined properties of characters, such as the General Category property values, listed in Table 5-1. The value of the property is Lu for uppercase letters, Ll for lowercase letters, Lt for the few letters that are of a special titlecase form, and Lm or Lo for letters that make no case distinction.

"Titlecase" refers to a character used at the start of a word written with a capital initial, as common for most words in titles of books, articles, etc., in English. Note that the capitalization conventions of English do not apply to some words like prepositions; thus, not all words in a title begin with a titlecase letter. For most characters, titlecase is the same as uppercase. However, for some letters that are originally ligatures, only the first component is in uppercase version in the titlecase form. For example, if you have the letter ǆ (U+01C6), converting it to uppercase gives Ǆ (U+01C4), but conversion to titlecase gives ǅ (U+01C5).

If you find it more convenient, you can also use the derived Boolean (yes/no) properties Uppercase and Lowercase. There is no derived property for detecting titlecase, though.

5.5.2. Case Mappings

Suppose that you have a file or database containing character data and you wish to create a program for searching data from it using simple searches by keywords. If your data contains the word "Newton," you would probably like to make a search find it even if the user enters the word as "newton" or "NEWTON." In effect, you wish to perform a case -insensitive match in the search. That is what people intuitively expect from a search.

You could use case folding, converting all your data to uppercase, or to lowercase, and doing the same for any user input. This would usually be awkward, since you normally want to display the data normally, in mixed case. Therefore, you might wish to perform delayed case folding: keep both the data and the user input in mixed case but convert them to a single case just before performing a comparison (matching) in the search. You might also avoid any case folding and just use a routine that performs a case-insensitive search (although it might internally perform case folding for the purpose).

Mapping (converting) characters from lowercase to uppercase or vice versa is more complex than you might expect. The Unicode database contains, in the basic file Unicodedata.txt (described in Chapter 4), values for the properties Simple Uppercase Mapping, Simple Lowercase Mapping, and Simple Titlecase Mapping. The word "Simple" is there for a reason. The properties are intentionally limited to character-to-character mappings. For example, the Latin small letter sharp "s" ß (U+00DF) has no Simple Uppercase Mapping definedi.e., it remains invariant in such a mapping. However, such behavior violates the rules of the only language where the character is used (German): the rules say that the uppercase equivalent is the character pair "SS" (e.g., "Fuß" becomes "FUSS").

Simple case mappings are meant to be used only when it is not possible to perform the correct case mappingse.g., because the length of a string cannot be changed in the mapping. In practice, however, existing software often performs simple case mappings only.

There are additional mapping rules in the SpecialCasing.txt file. They are meant to be used in order to override and augment the simple mapping rules. For example, the Latin small ligature "fi" (U+FB01) has no simple uppercase or titlecase mapping, since it is not possible to present them as single characters. The SpecialCasing.txt file however contains:

 FB01; FB01; 0046 0069; 0046 0049; # LATIN SMALL LIGATURE FI

This line specifies that for U+FB01, the lowercase form is the character itself, the titlecase form is U+0046 U+0069 (i.e., "F" followed by "i"), and the uppercase form is U+0046 U+0049 (i.e., "F" followed by "I").

In addition to letters like ß and ligature characters with no single-character uppercase mappings, the additional mapping rules cover letters with diacritic marks, in situations where the uppercase form does not exist as a precomposed character. There are also conditional mappings, such as mapping Greek capital letter sigma Σ to lowercase by a special rule for its use at the end of a word: there lowercase sigma is as opposed to the normal σ. Some mappings are language-dependent (for Lithuanian, Turkish, and Azerbaijani). Of course, they can be applied only in situations where the language of text is known.

In different languages, styles, and applications there are other deviations from the general principles, and they need to be handled separately. It is rather common (though perhaps disapproved by language authorities) to omit diacritic marks from uppercase letters, especially when writing words in all uppercase. This means that you would post-process the result of conversion to uppercase by removing the marks.

5.5.3. Case Folding in Unicode

In Unicode, case folding mostly maps everything to lowercase, but there are some complications. The case folding mapping is separately defined in the Unicode database file CaseFolding.txt by explicitly giving the case folded form for each character that changes in case folding. This mapping is defined formally as independent of other mappings, but in practice, there is logic behind it, connecting the mappings.

We can conceptually think of the case folding mapping as mapping everything to uppercase, and then to lowercase. The reason for this apparently absurd complexity is that otherwise the case folded form would not do its job in removing case distinctions. For example, the sharp "s" ß has "SS" as its uppercase equivalent in full case mapping. Therefore, it is mapped to "ss" in full case folding. Otherwise, full case folding would not map "Fuß" and "FUSS" (which differ in case only) to the same string.

The CaseFolding.txt file contains rules for both simple and full mappings, as opposed to the use of two distinct files as for uppercase, lowercase, and titlecase mappings. The file contains lines like the following:

 00DE; C; 00FE; # LATIN CAPITAL LETTER THORN 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

Here, as usual in the Unicode database, the first item on a line is the code number of the character to which the mapping applies, and anything from # onward is a comment. The lines say that U+00DE is case folded to U+00FE (which is Latin small letter thorn) and U+00DF is case folded to U+0073 U+0073 (which is "ss"). The letter in the second field, here "C" or "F," specifies the applicability of the rule as follows:

"C" means "Common"i.e., the rule is always applied in case folding.
"F" means "Full"i.e., the rule is applied in full case folding only.
"S" means "Simple"i.e., the rule is applied in simple case folding only.

Figure 5-5. Viewing Case Charts for some Greek letters
"T" means "Turkic," which means that the rule is optionally selectable for use in case folding by the principles on handling dotted and undotted "i" ("i" versus "") in Turkish and Azerbaijani.

5.5.4. Viewing the Mappings

If you just want to view the mappings for different characters, the Unicode Case Charts at http://www.unicode.org/charts/case/ are very handy, as illustrated in Figure 5-5. They show the uppercase, lowercase, titlecase, and case folded form for each character that has any difference between the forms. As usual in such matters, the rendering of glyphs can be problematic due to font problems, especially on Internet Explorer.

5.5.5. Character Case Mappings Versus Visual Mappings

The mappings discussed in the previous section need to be distinguished from purely visual mappings . You could store and process character data as such in mixed case and perform mapping to uppercase, lowercase, or titlecase in visual rendering only. Usually you would map to uppercase in order to highlight a piece of text as a heading or just for emphasis.

The difference between character-level mappings and visual mappings is illustrated by two functions in MS Word:

If you select a piece of text, and then use the command Format Change case, you can have the text case mapped to uppercase, lowercase, titlecase, or "sentence case," which means that the first word is in titlecase, other words are in lowercase. Such operations are irreversiblei.e., there is no general way to get the original form back, except naturally in the sense that you might do the Undo operation next.
Font and check (on the Font pane) the checkbox "All caps (under "Effects"), then the text will be displayed in all uppercase. The character data is preserved as such, however, so if you later select the text again and uncheck the checkbox, the original form becomes visible. You can also use this approach when defining a style in MS Word, since the style settings have font formatting options, too.

Both of these mappings might perform simple mapping only, so they should be used with caution; e.g., for texts in German and Turkish. Also note that mapping to titlecase does not produce grammatically correct results for English, since it capitalizes every word, but by English rules, words like "a" and "to" should be left lowercase.

In HTML or XML authoring, you might use a Cascading Style Sheet (CSS) declaration like text-TRansform: uppercase. Applied to a string, it performs a conversion to uppercase when selecting glyphs for rendering the characters. The other values of the property are lowercase, capitalize (= titlecase), and none.

Such operations can be a better choice than conversions at the character level, since keeping the data itself in mixed case helps in editing, spellchecking, etc. Moreover, character-level case mappings are irreversible: there is no way to deduce the original form from the case-mapped string.

Such an approach also lets you use different stylesheets for the same data, using conversion to uppercase only when it is judged to be the best waye.g., for headings (typically, due to lack of better typographic possibilities). However, beware that such transformations might not work by Unicode rules for all characters and that they might apply simple mappings. CSS specifications do not specify how the mappings are performed. In practice, if you write <h1>Fuß</h1> in HTML and have the rule h1 { text-transform: uppercase } in CSS, you probably get "FUß" or even "FUS" (incorrect) depending on the browser, instead of the full case folded result "FUSS."