Section 5.10. Unicode Conformance Requirements

5.10. Unicode Conformance Requirements

As mentioned in Chapter 4, full presentation of the conformance requirements relies on concepts related to character properties, and it was therefore postponed to be given here.

Conformance to the Unicode standard is voluntary. The motivation for making software conformant is that it can then be honestly marketed as Unicode conformant and it can be expected to cooperate with other Unicode conformant software in a predictable manner.

Note that wording like "this program supports Unicode" does not really make a claim on conformance. In practice, this often means just that the software internally operates on Unicode representations of characters. Conformance to the Unicode standard means more: several rules on the interpretation and processing of characters must be satisfied.

On the other hand, conformance does not require the ability to deal with all Unicode characters. You could write a program that conforms to the Unicode standard but processes and displays just a small repertoire of characterssay, ASCII characters or Thai letters. If such a program interfaces with other software, participating in a chain of programs where it receives input from a previous program in the chain and sends output to the next one, it must correctly pass forward all Unicode characters it receivesunless, of course, its defined task includes acting as a filter.

There is currently no mechanism for officially certifying a claim on conformance. The conformance requirements are rather exact, though, so in most cases, it can be determined objectively whether some software conforms or not.

5.10.1. An Informal Summary

Before presenting the conformance requirements, let's list their essentials in an informal manner. The Unicode FAQ contains a brief summary of the requirements at http://www.unicode.org/faq/basic_q.html; the following list is a somewhat different formulation. Here "you" refers to software that is meant to be conforming, although intuitively you can read it as referring to people who create or modify such software:

You don't need to support all Unicode characters.
You may be ignorant of a character, but not plain wrong about it.
You can modify characters if that's part of your job, but not arbitrarily.
Don't just garble what you don't understand.
Treat unassigned code points as taboo: don't generate, don't change.
Surrogates are unassigned as code points, but you must recognize surrogates as code units in UTF-16.
Noncharacters, including U+FFFE and U+FFFF, are not characters. If you get one, pass it forward, or drop it.
Canonical equivalents should normally be treated as the same character, but they may be treated as technically different.
Interpret and generate UTF-8 & Co. according to specifications.
Treat ill-formed input (violating UTF-8 & Co. rules) as errors.
Recognize the byte order mark (BOM) on input, and imply big-endian, if there is no BOM.
If you include Arabic or Hebrew, you need to implement the bidirectional algorithm.
If you include normalization, apply it by the standard.
If you do things with the case of letters, follow Unicode rules.

5.10.2. Notations and Terms Used in the Requirements

The conformance requirements will be presented here as annotated quotations from the Unicode standard. The quotations contain somewhat difficult language, but due to their authoritative role, they have been preserved verbatim. The annotations (explanations) use simpler and more common terms. The numbering (C4, C5,...) is the same as in the standard, which preserves the numbering of previous versions of the standard. Therefore, some numbers are missing, since some old requirements have been superseded. This is why the first requirement is currently C4. Some numbers have letters attached to them, since requirements have been inserted without changing the numbering of old requirementse.g., C12a was added between C12 and C13.

The requirements use the termabstract character in an attempt to be exact, but actually this causes some vagueness. In Chapter 1, we discussed the various meanings of this term. An abstract character need not have a code number of its own in Unicode; it may consist of a character followed by one or more diacritic marks, for example.

The word process is used a lot in the conformance requirements, but it is not defined in the Unicode standard. It can mostly be understood as meaning software in a broad sense that covers applications, databases, etc.

5.10.3. Unassigned Code Points

C4 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.

Although Unicode contains two large blocks for so-called surrogates, the code points in those blocks are not meant to be used at all in character data. Instead, the corresponding code units may be used in the UTF-16 encoding. This sounds confusing, but the gist is that the idea of representing some Unicode characters as "surrogate pairs" consisting of two values operates at the encoding level only. If surrogate code points are detected at the character levele.g., after an encoding has been interpreted as a sequence of code points (and thereby characters)an error of some kind has occurred.

The conformance requirements do not specify any particular error processing in such a situation, but they disallow the treatment of surrogate code points as characters:

C5 A process shall not interpret a noncharacter code point as an abstract character.

Noncharacter code points (e.g., U+FFFF) are code points in the Unicode coding space that are permanently defined as not denoting any characters ever. They are thus logically impossible in character data. In practice they may appear in data as indicators (e.g., as indicating, upon return from an input routine, that no input was obtained), sentinel values, or structural delimiters between strings:

C6 A process shall not interpret an unassigned code point as an abstract character.

This is similar to the previous requirement but applies to code points in the Unicode coding space that have not (yet) been assigned in any way. They are free locations that may later be filled with something, in an update to the standard, and this is the reason for disallowing their use at present, and for now.

A conforming program may use code points in "private" meaningse.g., to represent characters that have not yet been included in Unicode. But it must not use unassigned code points for that; instead, private use characters should be used.

5.10.4. Interpretation

C7 A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation.

"Coded character representation" means a sequence of code points, say U+0041 U+0301. A program is not required to interpret it, but if it does, it must do so in accordance with the normative properties of U+0041 and U+0301:

C8 A process shall not assume that it is required to interpret any particular coded character representation.

This effectively means that software need not assume that it has to understand all Unicode characters. Thus, this is not really a requirement, but permission to implement software that supports just a subset of Unicode characters. It need not even document that subset, although it's often wise to do so, to help users as well as future developers:

C9 A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct.

This requirement does not mean that software has to treat canonical equivalent sequences (such as ä and its decomposition, "a" followed by combining dieresis) as the same. It is allowed to treat them differently. The general idea in the Unicode standard is that canonical equivalent sequences should be treated identically and as denoting the same abstract character. The standard mentions, however, in this context, that "there are practical circumstances under which implementations may reasonably distinguish them."

For example, it is permissible, though usually not wise, to treat a character differently from its canonical decomposition on display. A program might render ä using a glyph for the character in the current font but the canonical equivalent decomposition by displaying "a" and putting a dieresis over it, using some algorithm for the placement. You may actually see such things happen; it's a bit simpler to implement things that way.

It is allowable, for a program that conforms to the Unicode standard, to fail to interpret combining diacritic marksi.e., to treat them as unknown characters. Such a program would probably render ä well when represented in precomposed form but as "a" followed by some indication of unknown character when in decomposed form.

Conforming software must not rely on having the distinction made in other conforming software. A program that prepares data to be sent to another program for further processing shall not assume that the other program treats, for example, ä and its decomposition as different.

5.10.5. Modification

C10 When a process purports not to modify the interpretation of a valid coded character representation, it shall make no change to that coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.

A conforming program may interpret character data in many ways, of course. It might even be a decipherment program! The requirement, however, discusses a situation in which a program makes a claim that it does not modify the interpretation of character data. In that case, the data itself must not be modified except perhaps by:

Replacing a string with a canonical equivalent string (e.g., by replacing a precomposed character like ä with its decomposition, or vice versa)
Removing code points that are defined as not denoting any characters, such as U+FFFF

This means that (under the given condition) a program must not remove any characters, such as characters that it does not recognize. For example, if a program "collapses" consecutive space characters into a single space (as web browsers do), this constitutes a modification of the interpretation of character data.

On the other hand, transcoding is allowed. That is, the representation of data may be changed from one encoding to another, perhaps changing the byte order.

5.10.6. Character Encoding Forms

C11 When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall interpret that code unit sequence according to the corresponding code point sequence.

Code units are storage units used for the low-level representation of character data, and their size varies by encoding. The size is 8, 16, or 32 bits for UTF-8, UTF-16, and UTF-32, respectively. The requirement says that conforming software must be able to deal with Unicode encodings and must do so according to the specification of each encoding.

C12 When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences.

Here "ill-formed" means a sequence that is prohibited by the encoding used, as defined in the specification of the encoding. This is, of course, part of generating data as correctly encoded.

C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters.

This corresponds to the previous requirement but relates to input. Note that although ill-formed data is to be treated as an error, there are no requirements on error processing, except that such data must not be treated as characters. A program is not required to issue an error message. It may just ignore the data. The standard explicitly permits representing an ill-formed code unit with a marker such as U+FFFD, though this seems unnatural, since that special character is defined to indicate an unrepresentable character rather than ill-formed data.

Conformance clauses C12 and C12a do not mean that programs should never process ill-formed code units. The phrase "purports to be" is interpreted freely in the standard. A conforming program may read data "as such"i.e., as a sequence of octets or other storage units, without paying any attention to its internal structure. For example, copying Unicode data as such, preserving its internal representation, can most efficiently be performed as raw copying. This could mean copying octets in a loop, or a block copy instruction, depending on computer or communication architecture. The point is that the copying software need not check the data, if it does not try to interpret it according to some encoding.

5.10.7. Character Encoding Schemes

C12b When a process interprets a byte sequence which purports to be in a Unicode character encoding scheme, it shall interpret that byte sequence according to the byte order and specifications for the use of the byte order mark established by this standard for that character encoding scheme.

The requirement is a verbose way of saying that byte order rules be observed. This means that a program, when reading UTF-16 encoded data, must recognize the byte order as defined in the specification of UTF-16, and apply it, instead of implying some particular fixed byte order.

Byte order specifies whether the most significant byte (octet) or the least significant byte comes first in a 2-byte quantity. If the most significant byte comes first ("big end first"), the order is called "big-endian"; otherwise, it is "little-endian." A conforming program must be able to handle both, no matter which byte order is used in the "native" data format of the system where the program runs.

5.10.8. Bidirectional Text

C13 A process that displays text containing supported right-to-left characters or embedding codes shall display all visible representations of characters (excluding format characters) in the same order as if the bidirectional algorithm had been applied to the text, in the absence of higher-level protocols.

This requirement relates to the display of characters that belong to writing systems that are written right to left (e.g., Arabic), as well as to the use of explicit codes (control characters) for setting the writing direction. Conforming programs that perform such operations are effectively required to implement the Unicode bidirectional algorithm, which is defined in Unicode Standard Annex #9. Technically, the formulation of the requirement is more abstract: it is sufficient that the program behaves as if it used that algorithm.

5.10.9. Normalization Forms

C14 A process that produces Unicode text that purports to be in a Normalization Form shall do so in accordance with the specifications in Unicode Standard Annex #15, "Unicode Normalization Forms."

C15 A process that tests Unicode text to determine whether it is in a Normalization Form shall do so in accordance with the specifications in Unicode Standard Annex #15, "Unicode Normalization Forms."

C16 A process that purports to transform text into a Normalization Form must be able to produce the results of the conformance test specified in Unicode Standard Annex #15, "Unicode Normalization Forms."

This is a way of requiring conformance to the specification of Unicode normalization forms . The formulation is somewhat complex, since a conforming program need not understand normalization at all. The requirement says that if it plays with normalization (in the sense of producing normalized data, testing for data being normalized, and transforming to normalized form), it must play by the rules in the annex.

5.10.10. Normative References

C17 Normative references to the Standard itself, to property aliases, to property value aliases, or to Unicode algorithms shall follow the formats specified in Section 3.1, Versions of the Unicode Standard.

Informally, for example, when saying "I ♡ Unicode," you can use whatever style you prefer to refer to Unicode. The same applies even to official documents, as long as you are not making a normative reference. A normative reference claims or requires conformance. For example, in a contract on building some software, you might wish to specify that the product will conform to the Unicode standard, and then you should be exact. This means that you refer to a specific version, and do that unambiguously. For safety, you may wish to use the exact citation format specified in the standard, such as the following:

The Unicode Consortium. The Unicode Standard, Version 4.1.0, defined by: The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1/) and Unicode 4.1.0 (http://www.unicode.org/versions/Unicode4.1.0/).

Note that a conformance claim does not imply support to all characters defined in a particular version of Unicode. There is not even a formal requirement to specify the repertoire supported. In order to know what a software vendor really promises when it claims conformance to Unicode, you need check what it says about the repertoire.

References to properties should use long names (aliases), not abbreviations, and should cite the standard version as well. The example given is the following, followed by an exact reference to the standard version:

The property value Uppercase_Letter from the General_Category property, as defined in Unicode 3.2.0

This makes the references rather verbose, of course. What is important in practice is to use long names, not abbreviations like "Lu" and "gc," which can be rather cryptic. Specifying the Unicode version is important to definiteness, since properties and their values may change, though they usually don't.

References to Unicode algorithms (see below for a definition) should specify the name of the algorithm or its abbreviation and cite the version of the standard, as in this example:

The Unicode Bidirectional Algorithm, as specified in Version 4.1.0 of the Unicode Standard.

See Unicode Standard Annex #9, "The Bidirectional Algorithm," (http://www.unicode.org/reports/tr9/).

Where algorithms allow tailoring, the reference must state whether any such tailorings were applied or are applicable:

C18 Higher-level protocols shall not make normative references to provisional properties.

A property may be designated as provisional in the standard. This means that it has been included as potentially useful but immature. Officially, it is a "property whose values are unapproved and tentative, and which may be incomplete or otherwise not in a usable state."

The phrase "higher-level protocol" means any agreement on the interpretation of Unicode characters that extends beyond the scope of the Unicode standard.

For data, there is no defined format for claiming or requiring conformance. When you say, for example, that an application accepts Unicode data as input, the meaning of this statement depends on what Unicode version is implied or expressed.

5.10.11. Unicode Algorithms

C19 If a process purports to implement a Unicode algorithm, it shall conform to the specification of that algorithm in the standard, unless tailored by a higher-level protocol.

The term Unicode algorithm is defined as "the logical description of a process used to achieve a specified result involving Unicode characters." Despite the broad definition, it is meant to refer only to algorithms defined in the Unicode standard.

Although the word "algorithm" is used, the essential meaning is the result, not the execution of specific steps in a specific manner. This means that an implementation may use some other approach, as long as the results are always the same.

The term tailoring refers to a different kind of allowed variation. Even the logical descriptioni.e., the relationship between input and output datamay differ from the one specified in the algorithm, if the algorithm is defined to be tailorable. For example, the algorithms for normalization and canonical ordering are not tailorable, whereas the bidirectional algorithm allows some tailoring.

5.10.12. Default Casing Operations

C20 An implementation that purports to support the default casing operations of case conversion, case detection, and caseless mapping shall do so in accordance with the definitions and specifications in Section 3.13, Default Case Operations.

The basics of casing were described in the section "Case Properties" earlier in this chapter. The casing may be simple or full, and it must be based on the Unicode case mappings. Conformance to the standard does not exclude language-specific tailoring of the rules. Testing the case of a string must be logically based on normalizing the string to NFD and then case mapping it. However, an implementation may perform the test more efficiently, if the results are the same. Similarly, caseless (case insensitive) comparison of strings must logically involve mapping both strings to lowercase.

5.10.13. Unicode Standard Annexes

Conformance to the Unicode standard requires conformance to the specifications contained in the following annexes of the standard. The annexes contain both descriptive (informative) and normative material; only the normative parts are relevant to conformance.

UAX #9: "The Bidirectional Algorithm"
UAX #11: "East Asian Width"
UAX #14: "Line Breaking Properties"
UAX #15: "Unicode Normalization Forms"
UAX #24: "Script Names"
UAX #29: "Text Boundaries"

The annexes are available via http://www.unicode.org/reports/. The page contains links to other Unicode Technical Reports (UTR), too. However, only a UTR designated as UAX is part of the Unicode standard. There are also Unicode Technical Standards (UTS), which are normative documents issued by the Unicode Consortium, but separate from the Unicode standard. Conformance to them is not required for conformance to the Unicode standard. Moreover, there are UTR documents labeled as UTR! Such documents are informative (descriptive), not normative. Thus, we can loosely describe the relationships between these types of documents by the formula UTR = UAX + UTS + UTR, naturally reflecting the two meanings, broader and narrower, of "UTR."

There are also Unicode Technical Notes (UTN), at http://www.unicode.org/notes/, but they have no normative or otherwise official status whatsoever. In contrast with a UTR, which is produced by the Unicode Technical Committee even if the UTR is not normative, a UTN can be one person's product, which is just made available through the Unicode web site. In practice, the author of a UTN is an expert, and a UTN can be a helpful tutorial, an interesting proposal, an in-depth treatise of a special topic, or otherwise useful.