Section 4.4. Unicode Terms | Unicode Explained

4.4. Unicode Terms

The Unicode standard and related documents contain a large number of special terms, often consisting of common words in highly specialized meanings. In this book, the presentation of the terms has been spread across the material, into contexts where the terms can be illustrated and exemplified. To check the meaning of a particular term, it is therefore simplest to consult the index.

In this section, some special Unicode terms are presented. The terms refer to concepts that don't quite belong to the core of Unicode and don't belong to special sections either.

4.4.1. Deprecated and Obsolete Characters

A deprecated character is a character that has been included in Unicode but declared as deprecated in the Unicode standard. This indicates a strong recommendation that the character not be used. It remains in Unicode, though, due to the stability principle. For example, a character may be declared deprecated if it turns out that it was introduced into Unicode in error. There is a machine-readable list of deprecated characters in the document http://www.unicode.org/Public/UNIDATA/PropList.txt. In Unicode 4.1, there are few deprecated characters: the combining marks U+0340 and U+0341, the Khmer characters U+17A3 and U+17D3, and the formatting characters U+206A..206F.

In many other standards, the term "deprecated" contains a warning that deprecated constructs may, and probably will, be removed in future versions of the standard. In the Unicode standard, there is no such idea; on the contrary, deprecated characters are guaranteed to remain in the standard.

An obsolete character is a character that is not used in new texts anymore but has been included into Unicode due to its historical usage. Obsolete characters are not deprecated, as a rule. It is quite appropriate to use an obsolete character when writing text that discusses old texts that contain the character. For example, Latin small letter long "s," ſ (U+017F), is an obsolete character that was used in Gothic (blackletter) writing instead of "s" in some positions. In a broader sense, a character can be regarded as obsolete if it is no longer used in some language, even if other languages may use it.

These concepts are quite different from concepts like "noncharacter" or "unassigned code point." Deprecated and obsolete characters fall into the category of graphic characters in the basic classification of code points, but they are pragmatically different from normal characters.

4.4.2. Digraphs

A digraph is a combination of two successive characters treated as a unit in some sense, such as "ch" in many languages (e.g., when used to indicate one sound) or "ll" in Spanish, where it denotes a particular sound and might be treated in sorting as if it were a single character. Thus, a digraph is a pragmatic concept, not a formal one, and it is an example of a text element (see next subsection).

Speakers of a language may intuitively understand a digraph as "one character," especially if the ordering rules of the language treat it that way. This is especially true for digraphs that are used as replacements for characters, such as "ae" for ä when writing German (under conditions where one cannot or dare not use ä). From the Unicode perspective, it's still two characters.

However, there are many Unicode characters that are originally digraphs but are now treated as one character. Examples include characters that are completely independent in Unicode, such as small Latin letter ae æ (U+00E6) (even though an English reader may well see it just as a way of writing "ae" together) as well as compatibility characters such as Latin small ligature ij "" (U+0133), which decomposes into "i followed by "j."

Thus, a digraph is normally written as two separate characters in Unicode. Treating them as a unit is up to an application. A digraph may or may not be presented visually as a ligaturei.e., as a single glyph that contains the two characters "melted together."

Similarly, a trigraph is a sequence of three characters treated as a unit, as "sch" in German, where it denotes roughly the same sound as the digraph "sh" in English.

4.4.3. Text Elements

The concept of text element is informal: it means a sequence of characters (including the special case of one character) that is treated as a unit in some processing. In typical character input and output, characters are text elements . In layout processes, syllables might be treated as text elements, since line breaks are usually allowed between syllables but not within them. When you form a text concordance (a list of occurrences of wordse.g., in alphabetic order), a word is a text element.

The concept is sometimes confused with a combining character sequencei.e., a sequence consisting of a base character and one or more combining characters (such as combining accents). Although a combining character sequence could also be a text element, that's casual. A text element is whatever an application regards as a text element.

4.4.4. Unicode Strings

The term "Unicode string" has a more technical meaning than you might expect. It does not refer to a string (sequence) of Unicode characters (code points) but to a sequence of code units. Thus, the components of the string are of fixed size in bits (in practice, 8, 16, or 32 bits). In many programming languages, Unicode strings have a code unit size of 16 bits. This does not limit the range of characters, since such a string could be interpreted according to UTF-16.

Thus, a component of a Unicode string need not correspond to a character. A code unit could be part of the representation of a character (say, the second octet of a two-octet representation in UTF-8). Even if a code unit as such represents a code point, it can be a noncharacter or an unassigned code point.

Although a Unicode string is often in some encoding, this is not a requirement. It is possible to consider any sequence of octets as a Unicode string, even if the sequence does not correspond to the rules of any Unicode encoding (in practice, UTF-8 in this case). You could also have a sequence of 16-bit (double byte) code points containing isolated surrogates.

The point here is that "Unicode string" is a technical concept for use in programming, and it is intended to be very simple for such use. A program or function that accepts a Unicode string as input need not check its internal structure and may process in any suitable way. If the output is a Unicode string, it need not correspond to any encoding.

The reason behind this is efficiency. Software designers can make programs check for the integrity of a Unicode string as representing a sequence of characters, but they can do it at the point they prefer.