Section 5.7. Text Boundaries


5.7. Text Boundaries

In text processing, we often need to work with text elements larger than individual characters. For example, we might need operations like "delete next word" or "move one sentence forward in the text." Therefore, we need to recognize boundaries between elements of text, collectively called text boundaries .

Text boundary principles are defined in a separate document, Unicode Standard Annex (UAX) #29, "Text Boundaries." It specifies boundaries for three types of text elements:

  • Grapheme cluster, which is characterized informally as "user character"

  • Word, which partly corresponds to the word concept in natural languages

  • Sentence, which is recognized from punctuation by some coarse rules

The concept "grapheme cluster" is the most obscure of the three. It is meant to correspond to the idea of a character as a user sees it, on the basis of her cultural background. For example, it could be a digraph (two-character combination) like "ch" if understood as a single letter in some language, or a combination of a base character and one or more diacritic marks, or a sequence of Unicode characters that represent one syllable and are displayed as a unit. Therefore, "grapheme cluster" depends on the writing system and on conventions, but UAX #29 still tries to specify how to recognize it in general. Expressions like "grapheme" and "logical character" have been used too, but all names seem to be prone to misunderstanding. In particular, this is not a matter of graphemes in the linguistic sense.

The default, or language-independent, boundary rules are specified in UAX #29 at the general level, referring to certain special properties of characters. The values of these properties are defined in the Unicode database files GraphemeBreakProperty.txt, WordBreakProperty.txt, and SentenceBreakProperty.txt. The general approach in the definitions is the same as for line breaks, as described in section "ine-Breaking Properties" later in this chapter.

To illustrate the nature of boundary rules, which are not yet widely implemented along these lines in existing software, we will consider the word boundaries. Described informally and somewhat loosely, the principles are:

  • Treat consecutive alphabetic characters as belonging to the same word. This applies to characters for which the Alphabetic property has the value "yes" (True), except characters belonging to Thai, Lao, or Hiragana writing system, as well as to the no-break space character (somewhat surprisingly?).

  • Treat digits and other numeric characters as comparable to alphabetic characters (e.g., treat "3A" as one word).

  • Do not break a numeric string at a character that has a LineBreak property value of IN = Infix, numeric (except for ":"). For example, treat "1.000,00" as one word.

  • Treat connector punctuation such as "_" (with General Category value of Cp = Connector, punctuation) as comparable to alphabetic characters (e.g., treat "foo_bar" as one word).

  • Treat a grapheme cluster as if it were one character.

  • Regard the following as part of word when they appear between alphabetic characters: apostrophe ' (U+0027), right single quotation mark ' (U+2019), middle dot · (U+00B7), hyphenation point ‧ (U+2027), colon : (U+003A), and Hebrew punctuation gershayim ״ (U+05F4).

For example, the principle mentioned last in the list works well for some strings that need to be treated as words, such as "cat's" and "c:a" (an abbreviation of a word in Swedish). On the other hand, the principle also brings things together although they should be treated as separate words, as in the Italian expression "dellarte" (where "dell'" is a contraction of a preposition).

The default boundary rules in UAX #29 are not meant to work as a basis of advanced processing of natural languages, such as syntactic analysis. Rather, they are meant to help in implementing useful operations in editing. For example, you can typically double-click in a word processor to select a "word," and perhaps triple-click to select a "sentence." The text boundary rules are meant to define what is selected that way.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net