Section 5.9. Line-Breaking Properties


5.9. Line-Breaking Properties

The need for line breaking arises from the simple fact that horizontal space is usually limited by factors such as paper or screen width. In addition, for readability, line length should be kept within reasonable limits. Typographic recommendations usually suggest a maximum of 80 or 90 characters and an optimum of 55 to 60 characters. On small devicese.g., when displaying text messages in a mobile phonethe line length can be very small, for example, 13 characters.

Plain text is often preformatted so that it is divided into lines with explicit line breaks, typically making lines shorter than 80 characters. Such text may need to be reformatted, though, due to changing requirements on line length. Moreover, it is nowadays very common to avoid preformatting. In word processors, web authoring, and even in plain text, explicit line breaks are often omitted, marking just paragraph boundaries. Therefore, paragraphs need to be dynamically formatted into lines.

In old manual typography, line breaks were decided by professional typesetters with years of experience. In the modern world, line breaking of prose text is mostly automated, though sometimes people inspect and check the results. This may mean preventing undesirable line breaks, suggesting line-breaking opportunities, or adding forced line breaks. It might also be possible to do such things even in the writing phase, and this is the only feasible way if the author cannot see the formatted results (e.g., in normal web authoring).

Preformatted text is still used in many contexts, such as poetry. Typically, forced line breaks of some kind are used to create lines, when using lines so short they need not be broken in any normal circumstances.

5.9.1. Conformance Criteria

Line breaking is a very complex issue, and the Unicode standard deals with it at a rather technical and low level only. Unicode has, in addition to explicit line break characters, some special control characters for suggesting or prohibiting line breaks at specific points and some general line-breaking rules. The rules mostly operate very locally only, for a single character or a pair of consecutive characters. Therefore, these rules constitute just a coarse technical basis, which often needs to be augmented and overridden by higher-level rules, such as language-specific hyphenation. Moreover, they are just an optional basis. Some features in line breaking are normative (i.e., must be obeyed if line breaking is performed at all), but the algorithm as a whole is not.

Conformance to the Unicode standard does not require conformance to the Line Breaking Algorithm. The algorithm specifies default rules that software designers may wish to use as a basis.


A program can separately claim conformance to the Unicode Line Breaking Algorithm. Even then, the algorithm may partly be overridden by higher-level rules, or as "tailored," provided that the existence of such rules is mentioned. Detailed documentation is not required; a statement like the following is acceptable: "This program uses the Unicode Line Breaking Algorithm as specified in Version 4.1.0 of the Unicode Standard, as tailored to the Vogon language."

5.9.2. Characters for Special Control over Line Breaking

Let us first look at some Unicode characters that are meant for controlling line breaks in text, either by preventing an undesired break or by suggesting a break. The general idea is that programs divide text into lines according to some general principles, such as the default Unicode rules for line-breaking or application-specific rules. Characters that can be used to override general rules are presented in Table 5-8. Some of the characters are control characters; some are graphic characters with special line-breaking properties. Note that here we do not discuss forced line breaks, generated with line break characters such as Carriage Return; they will be discussed in Chapter 8.

Table 5-8. Characters for special control over line breaking

Code

Unicode name

Description

U+00A0

No-break space

Like space U+0020, but prevents line break

U+2011

Non-breaking hyphen

Like hyphen U+2010, but prevents line break

U+00AD

Soft hyphen

Invisible; indicates allowed hyphenation point

U+200B

Zero-width space

Invisible; indicates allowed line break

U+2060

Word joiner

Invisible; prevents line break


5.9.2.1. Preventing line breaks

The word joiner (WJ) character exists for the sole purpose of preventing a line break. It can be used when general line-breaking rules would allow a line break but a break is considered inappropriate. Do not confuse it with the ZWJ U+200D and the ZWNJ U+200C, which relate to ligature behavior. Unfortunately, there is such confusion in several versions of MS Word. When you select Insert Symbol and select the Special characters pane, there are options like "No-Width Optional Break and "No-Width Non Break." Selecting them means actually ZWJ and ZWNJ, respectively. MS Word treats them the way it describes, but if the text is transferred to a program that conforms to the Unicode standard in this issue, their effect changes essentially.

Previously, the zero-width no-break space (ZWNBSP) U+FEFF has been defined as an invisible character that prevents line break. However, such usage is not recommended anymore; instead, the word joiner should be used. The zero-width no-break space retains its use in an unrelated purpose, as a byte order mark.

Only a few characters, such as the space, have "non-breaking clones." Technically, the "clones" are characters with a compatibility decomposition containing the <noBreak> tag, indicating that a line break is prohibited after the character. In practice, there can be other differences, too, between a character and its "non-breaking clone" (see notes on no-break space in Chapter 8).

For other characters, different techniques must be used. To disallow a line break after a solidus /, for example, you cannot use a non-breaking version of that character. Instead, you can use the normal character and add a word joiner character after it.

5.9.2.2. Suggesting line break opportunities

The zero-width space (ZWSP) allows a simple line break, without adding any hyphen. It is typically used in strings that are not words, although in the Thai script, it can be used to separate words. It may be useful, for example, when a long URL is mentioned in text. You can add ZWSP in places where a line break is acceptable.

The soft hyphen is supported by some programs, ignored by some, and treated as a visible hyphen by some other software. When supported according to the Unicode standard, it suggests (allows) hyphenationi.e., division of a word so that a hyphen is placed at the end of the first line.

For example, if you insert a soft hyphen U+00AD between "c" and "d" in "abcdef," you allow the string to be divided so that "abc-" appears at the end of a line. If you insert a zero-width space U+200B instead, you allow the string to be divided so that "abc" without a hyphen appears at the end of line. In the latter case, the reader cannot really know (except from an explicit explanation or by guessing right) that the text contains "abcdef" and not "abc def."

5.9.2.3. Limited support

Software like Microsoft Word may not interpret all line-breaking control characters as defined in the Unicode standard. Program-specific tools, which operate at levels other than character level, can be more effective in practice. Some techniques are mentioned later in this chapter, and Chapter 9 presents some ways to prevent line breaks in markup languages like HTML.

Characters in Table 5-7 are inconsistently supported in popular software such as word processors. Check the software documentation, or run some tests, before taking them into use, and stay tuned to problems in data transfer between programs. The no-break space is well supported, though.


5.9.3. Principles of Line Breaking

In very simple processing of English texts, line breaking consists of breaking between words. Technically, this means the principle that a line break is allowed after a space but not elsewhere. Conceptually, this means that there is a space at the end of a line but it is ignored (not even counted as lengthening the line). Many text editors, browsers, and other programs still apply such a simple model. They may treat a hyphen (or hyphen-minus) as an allowed break point, too.

In the Unicode context, the problem is more complex, since not all writing systems use spaces between words. Moreover, technical or otherwise special texts can contain long strings of symbols with no spaces. There are six basic modes of line breaking (although the Unicode standard lists only the first, the fourth, and the fifth as "principal styles"):


Westerni.e., word-oriented (without hyphenation)

A line break may be introduced after a space, and possibly after a hyphen as well. This can be applied to Latin, Greek, Cyrillic, and many other scripts that have a concept of written word that consists of letters, with words separated by spaces.


Western with hyphenation

Additional line breaks may be introduced on the basis of hyphenation of words so that a word is broken to two lines and a hyphen is added at the end of a line to indicate this. Hyphenation may be based on language-specific algorithms, on hyphenation dictionaries, or on invisible hyphenation hints.


Symbolic

Line breaks should generally be avoided, but when necessary, line breaks can be introduced between major components of a construct, such as a mathematical expression, a chemical formula, a pathname of a file, or a URL. Quite often, simpler methods are needed, and they are often based on allowing breaks after certain characters.


East Asian

Line breaks are allowed everywhere except after or before certain characters. This is used for East Asian languages written using an ideographic or syllabic system. Korean, however, uses spaces between words.


South East Asian

Line breaks are allowed at syllable boundaries, to be detected in a morphological analysis. This is used for languages like Thai, written without spaces between words and allowing a line break between syllables in general.


Emergency breaks

A line break is made as required by an imposed maximum line length when the limit is reached, irrespectively of any line-breaking rules. Emergency breaks are normally applied as the last resort only, when there is a long string that cannot be broken at all by the line-breaking rules being applied.

If you use a program that applies Western-style line breaking, it is clear that East Asian or South East Asian texts won't work well, even if you could type them in that program. In such a case, short fragments of symbolic text work reasonably, but long expressions can be difficult to handle, especially if you cannot use spaces in them.

In all modes, special characters for explicit line break control can be used as an additional device, if supported as defined in the Unicode standard. Sometimes they might be entered automaticallye.g., by processing URLs by programs that insert invisible line break control characters.

Unicode line-breaking rules are mostly oriented toward handling Western, Symbolic, and East Asian modes. Thus, the rules address just a relatively small part of the problem. On the other hand, being defined in a uniform manner, they let you work with documents containing a mixture of scripts and languages. This has partly been achieved by somewhat artificial decisions. It is easy to start defining line-breaking rules so that Latin letters and other characters used in Western scripts have properties suitable for the Western mode, etc., but there are many borderline cases, especially since many characters are used in several scripts, or for several essentially different purposes within a script.

5.9.4. Emergency Breaks

The oldest forms of alphabetic writing can be described as using emergency breaks as the normal mode. Words were written with no space between them, and the entire available writing width was used, with no regard to word boundaries. This saved writing material, which was very expensive (e.g., parchment). We can see such things (although with spaces between words) happen again, for example, on small devices where the line length is small. Applying emergency break mode throughout makes things unambiguous, if readers know about it and if a space is written even when a line break occurs between words. Applying emergency breaks as the last resort may introduce ambiguity. In particular, it means breaking a word without displaying a hyphen, even when the normal mode is (or the user may think it is) Western with hyphenation.

When a program applies some line-breaking rules and observes that a line would exceed the allowed width, since there is no line-breaking opportunity within a long string, there are several ways to handle the situation, such as the following:


Emergency break

Break the line so that the maximum width is used and continue at the start of the next line, with no indication that a break has occurred.


Visible overflow

Let the line exceed past the allowed width (to a page margin, on paper).


Horizontal scrolling (on screen)

Only the part of the line that fits is visible, but a scrollbar can be used to see the entire line.


Invisible overflow

Make the part of text that exceeds the width invisible, perhaps thereby making the last visible character appear in part only.


Truncation

Similar to invisible overflow, but indicated with some symbol like "..." at the end of a line.


Negative kerning

Reduce the spacing between characters or words or both, making the text fit within the width at the cost of reduced legibility.

All these approaches have considerable drawbacks, so a choice between them is usually about the lesser of evils. The problem is particularly difficult when the data comes from an unpredictable source. For example, discussion forums on the Web can be sabotaged by entering a message with very long strings that are unbreakable by the rules that web browsers apply. On the other hand, a message could meaningfully contain such a stringe.g., a URL. Therefore, displaying the message in an area that has horizontal scrollbar when needed is probably the best option, as a rule. (You would use overflow: scroll in CSS for this.)

When processing data to be presented on paper, it is best to perform at least some preprocessing to decide whether there will be long unbreakable strings. For example, if you know that the rendering engine does not perform any hyphenation and does not observe Unicode line-breaking rules (as a whole), you can estimate that any string that has no spaces in it and is longer than, say, 30 characters will probably cause serious problems. You could then modify the string by adding an explicit line-breaking opportunityi.e., zero-width space U+200Bafter some special characters that you expect to be common in such strings. Of course, you would first need to make sure that the rendering engine understands U+200B. (Otherwise, you might insert something that has a similar effect in a particular situation, perhaps the nonstandard tag <wbr> in HTML authoring.) The special characters should be chosen so that a break after them is not too disturbing and could be understood by the users. If you expect the long strings to be usually URLs, you could insert a break opportunity after any /, ?, and &, for example.

5.9.5. Unicode Line-Breaking Rules

The Unicode standard specifies "line-breaking behavior" of characters in an apparently complex way, but the rules really operate at the level of individual characters and character pairs. The rules answer questions like "is it permissible to break between these two characters" with no regard to what appears before or after them.

Previously, there were different descriptions of "line-breaking behavior" in different parts of the Unicode standard. The assignments of line-breaking property values to characters, too, have changed between Unicode versions. This is one reason why you should not expect to find complete implementation of the rules even in layout software. However, for new software, a designer might decide to use a subroutine library that implements the rules (such as Unicode::Wrap in the CPAN archive for the Perl language). In that case, care should be taken to check which version of line-breaking rules it implements.

Unicode line-breaking rules have not been widely implemented yet. Programs typically implement at most a part of them, possibly according to some older version of the rules.


The definitions have now been collected into Unicode Standard Annex (UAX) #14, "Line Breaking Properties." Despite being issued as a separate document, it is an integral part of the standard. It discusses the line-breaking rules in different ways. It is not obvious which parts are the ultimate definitions. The longish Chapter 5, "Line Breaking Properties," is explanatory, or "narrative" as it calls itself, and Chapter 7, "Pair-table Based Implementation," with a tabular presentation of some of the rules, is descriptive, too: it explains a possible implementation.

The authoritative specification of line-breaking properties (both normative and informative) consists of the first part (before Table 1) of Chapter 2, "Definitions," and the formalized rules in Chapter 6, "line-breaking Algorithm," of UAX #14 and the LineBreak.txt file. The former describes the rules in terms of LineBreak properties; the latter assigns a LineBreak property to each character. All the rest is attempted explanations or illustrations, and might be just confusing.

5.9.5.1. Values of the LineBreak property

Although the values of the LineBreak property are meant to be somewhat mnemonic (e.g., PR stands for "Prefix (Numeric)"), they are not meant to constitute a classification in a general meaning (like the General Category property). For example, the dollar sign $ has the LineBreak value of PR, but this just reflects its treatment as a prefix character in line-breaking, due to common use in front of a number in English usage. The mnemonic interpretations names should be read with the implied text "treated as ... in the context of line-breaking" around them.

The values are briefly described in Table 5-9. The descriptions are meant to be illustrative, not part of the formal definitions. What really constitutes the defined meaning of the values is the set of rules that use these values to describe line-breaking opportunities. The descriptive names in the second column, though taken from the standard, are not even defined synonyms of the values, just concise characterizations.

Table 5-9. LineBreak property values

Value

Descriptive name

Example(s) of characters

AI

Ambiguous (Alphabetic or Ideographic)

½, x, ¡

AL

Ordinary Alphabetic and Symbol

A, >

B2

Break Opportunity Before and After

' (em dash)

BA

Break Opportunity After

Thin space, soft hyphen

BB

Break Opportunity Before

´ (U+00B4), ˌ (U+02CC)

BK *

Mandatory Break

LS (U+2028), PS (U+2029)

CB *

Contingent Break Opportunity

U+FFFC

CL

Closing Punctuation

), ]

CM *

Attached Characters and

Combining Marks

Combining grave accent (U+0300)

CR *

Carriage Return

CR (U+000D)

EX

Exclamation/Interrogation

!, ?

GL *

Non-breaking ("Glue")

No-break space (U+00A0)

H2

Hangul LV Syllable

가 (U+AC00)

H3

Hangul LVT Syllable

각 (U+AC01)

HY

Hyphen

- (hyphen-minus)

ID

Ideographic

Chinese ideographs

IN

Inseparable

horizontal ellipsis: ...

IS

Infix Separator

, (comma), . (full stop)

JL

Hangul L Jamo

ᄀ (U+1100)

JT

Hangul T Jamo

ᆩ (U+11A9)

JV

Hangul V Jamo

ᅡ (U+1161)

LF *

Line Feed

LF (U+000A)

NL *

Next Line

NL (U+0085)

NS

Non Starter

Small kana letters (in Japanese)

NU

Numeric

0, 1

OP

Opening Punctuation

(, [

PO

Postfix (Numeric)

"%", "¢"

PR

Prefix (Numeric)

$, +

QU

Ambiguous Quotation

" and other quotation marks

SA

Complex Context (South East Asian)

ก (Thai character ko kai)

SG *

Surrogates

(should not appear)

SP *

Space

" " (space)

SY

Symbols Allowing Breaks

/

WJ *

Word Joiner

WJ (U+2060)

XX

Unknown

Private use code points

ZW *

Zero Width Space

ZWSP (U+200B)


Some of the descriptive names are slightly misleading. In particular, "Inseparable" does not mean that the character cannot be separated from other characters by a line break. It only means inseparability from some types of characters.

The property is normative for the following values: BK, CB, CM, CR, GL, H2, H3, JL, JT, JV, LF, NL, SG, SP, WJ, ZW, denoted by an asterisk * in Table 5-8. The property is informative for other values. Basically, normative values must be applied as defined by conforming implementations, if they do line breaking at all, whereas informative properties are just suggested defaults. However, even the normative values can be overridden at levels other than plain texte.g., by explicit formatting instructions.

Characters with same LineBreak property value are said to constitute a line-breaking class. Some classes are very small (e.g., the class corresponding to the value SP contains the space character only), because some characters need to have very specific line-breaking behavior.

5.9.5.2. The format of LineBreak.txt

The LineBreak.txt file is of rather simple format, explained on comment lines (starting with #) at the start of the file itself. Each entry consists of one line, containing three fields: Unicode value (code number, four hexadecimal digits); value of the LineBreak property, two characters; and Unicode name, which is purely a comment here, since the code number identifies the character uniquely. Consider the following line:

 00B0;PO # DEGREE SIGN

It says that for the Unicode character U+00B0 (which has the name degree sign), the value of the LineBreak property is POi.e., the character belongs to line breaking class PO (which is by the way abbreviated from the word "postfix"not very mnemonic, is it?).

Character ranges are denoted as in the Unicode database in general. Example:

 4E00..9FBB;ID # <CJK Ideograph, First>..<CJK Ideograph, Last>

This means that all characters between U+4E00 and U+9FBB, inclusively, have the value ID (ideographic) for the LineBreak property.

All code points, assigned and unassigned, that are not listed explicitly are given the value XX.

The following is an extract of LineBreak.txt, covering the printable characters in the ISO Latin 1 range (U+0020 to U+007E and U+00A0 to U+00FF) and excluding letters with AL as the value of the LineBreak property (e.g., basic Latin letters). Note that not all letters have that value and not all characters with that value are letters in a normal sense:

 0020;SP # SPACE 0021;EX # EXCLAMATION MARK 0022;QU # QUOTATION MARK 0023;AL # NUMBER SIGN 0024;PR # DOLLAR SIGN 0025;PO # PERCENT SIGN 0026;AL # AMPERSAND 0027;QU # APOSTROPHE 0028;OP # LEFT PARENTHESIS 0029;CL # RIGHT PARENTHESIS 002A;AL # ASTERISK 002B;PR # PLUS SIGN 002C;IS # COMMA 002D;HY # HYPHEN-MINUS 002E;IS # FULL STOP 002F;SY # SOLIDUS 0030;NU # DIGIT ZERO  ... 0039;NU # DIGIT NINE 003A;IS # COLON 003B;IS # SEMICOLON 003C;AL # LESS-THAN SIGN 003D;AL # EQUALS SIGN 003E;AL # GREATER-THAN SIGN 003F;EX # QUESTION MARK 0040;AL # COMMERCIAL AT 005B;OP # LEFT SQUARE BRACKET 005C;PR # REVERSE SOLIDUS 005D;CL # RIGHT SQUARE BRACKET 005E;AL # CIRCUMFLEX ACCENT 005F;AL # LOW LINE 0060;AL # GRAVE ACCENT 007B;OP # LEFT CURLY BRACKET 007C;BA # VERTICAL LINE 007D;CL # RIGHT CURLY BRACKET 007E;AL # TILDE 00A0;GL # NO-BREAK SPACE 00A1;AI # INVERTED EXCLAMATION MARK 00A2;PO # CENT SIGN 00A3;PR # POUND SIGN 00A4;PR # CURRENCY SIGN 00A5;PR # YEN SIGN 00A6;AL # BROKEN BAR 00A7;AI # SECTION SIGN 00A8;AI # DIAERESIS 00A9;AL # COPYRIGHT SIGN 00AA;AI # FEMININE ORDINAL INDICATOR 00AB;QU # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 00AC;AL # NOT SIGN 00AD;BA # SOFT HYPHEN 00AE;AL # REGISTERED SIGN 00AF;AL # MACRON 00B0;PO # DEGREE SIGN 00B1;PR # PLUS-MINUS SIGN 00B2;AI # SUPERSCRIPT TWO 00B3;AI # SUPERSCRIPT THREE 00B4;BB # ACUTE ACCENT 00B5;AL # MICRO SIGN 00B6;AI # PILCROW SIGN 00B7;AI # MIDDLE DOT 00B8;AI # CEDILLA 00B9;AI # SUPERSCRIPT ONE 00BA;AI # MASCULINE ORDINAL INDICATOR 00BB;QU # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 00BC;AI # VULGAR FRACTION ONE QUARTER 00BD;AI # VULGAR FRACTION ONE HALF 00BE;AI # VULGAR FRACTION THREE QUARTERS 00BF;AI # INVERTED QUESTION MARK 00D7;AI # MULTIPLICATION SIGN 00F7;AI # DIVISION SIGN

5.9.5.3. The formal rules

The line-breaking rules themselves in the standard (in UAX #14) consist of formal rules accompanied by verbal notes. The notes try to explain the content of the rules as well as their motivation, though much of the motivation is explained in the descriptions of the values. The formal rules use values of the LineBreak property to indicate any character with that value and the symbols specified in Table 5-10. There is no particular reason for using these specific symbols, but you may think of ! as commanding (a break), x as joining characters together, and ÷ as permitting division (into lines).

Table 5-10. Operators used in line-breaking rules

Symbol

Meaning

Example

! (exclamation mark)

Mandatory break

LF ! means: always break after linefeed

x (multiplication sign)

No break allowed

x QU means: never break before quote

÷ (division sign)

Break allowed

ZW ÷ means: allow break after ZW


The rules are numbered LB1, LB2, etc., but there are holes in the numbering. When rules are removed in an update to the Unicode standard, the numbering of other rules is kept the same. The rules are in order of priority. The general idea is to specify a set of rules that forbid some line breaks, and then allow everything else (using the special symbol ALL to refer to all characters). This may sound strange, but the rules explicitly forbid line breaks between alphabetic characters, for example. Thus, when checking whether a line break is allowed at a particular point, the final rule that says "break everything else" will be applied rather rarely (for English text, for example).

The rules are presented in Table 5-11. The first column contains the rule number, the second column contains the rule itself, and the third column explains the rule in loose prose, perhaps using just example characters. The symbol "sot" means start of text, and "eot" means end of text. For brevity, the verb "break" alone indicates that a line break is allowed, whereas "always break" means an obligatory line break. A notation of the form (A | B) is used to denote "A or B." An asterisk * after a value indicates that a character in the corresponding class may appear zero or more times.

Table 5-11. Line-breaking rules

Nr.

Formal rule

Informal description

1

Resolve AI, CB, SA, SG, and XX into

other line-breaking classes

Use external info to decide what to do with ambiguous classes.

2 a

x sot

No break at the start of text.

2 b

! eot

Always break at the end of text.

3a

BK !

Always break after LS or PS.

3b

CR x LF

No break between CR and LF.

 

(CR | LF | NL)!

Always break after CR, LF, and NL.

3c

x (BK | CR | LF | NL) !

No break before hard line break.

4

x (SP | ZW)

No break before space or ZWSP.

5

ZW ÷

Break after zero-width space.

7b

Treat X CM* as if it were X, for any class X except BK, CR, LF, NL, SP, or ZW

Bind combining marks with the preceding character.

7c

Treat any remaining CM as if it were AL

Treat an isolated combining mark as

alphabetic.

8

x (CL | EX | IS | SY)

No break before ), !, ;, /.

9

OP SP* x

No break after (, even when it is followed by spaces.

10

QU SP* x OP

No break between a quote and (, even if spaces intervene.

11

CL SP* x NS

No break between ( and small kana, even if spaces intervene.

11a

B2 SP* x B2

No break between em dashes, even if spaces intervene.

11b

x WJ

No break before word joiner.

 

WJ x

No break after word joiner.

12

SP ÷

Break after a space.

13

x GL

No break before a Glue.

 

GL x

No break after a Glue.

14

x QU

No break before a quote.

 

QU x

No break after a quote.

14a

÷ CB

Break before Contingent Break Opp.

 

CB ÷

Break after Contingent Break Opp.

15

x (BA | HY)

No break before a hyphen.

 

x NS

No break before small kana.

 

BB x

No break after an acute accent.

16

(AL | ID | IN | NU) x IN

No break between alphabetic etc. and an ellipsis.

17

ID x PO

No break between ideograph and %.

 

AL x NU

No break between letter and number.

 

NU x AL

No break between number and letter.

18

CL x PO

No break in )%.

 

(HY | IS | NU) x NU

No break in -9 or .9 or 89.

 

NU x PO

No break in 9%.

 

PR x (AL | HY | ID | NU | OP)

No break in +a or +-, etc.

 

SY x NU

No break in /9.

18b

JL x (JL | JV | H2 | H3)

No break inside a Korean syllable.

 

(JV | H2) x (JV | JT)

 
 

(JT | H3) x JT

 

18c

(JL | JV | JT | H2 | H3) x (IN | PO)

No break between Korean syllable block and "..." or %.

 

PR x (JL | JV | JT | H2 | H3)

No break between + and a Korean syllable block.

19

AL x AL

No break between letters.

19b

IS x AL

No break in :a.

20

ALL ÷

Break after anything else.

 

÷ ALL

Break before anything else.


Note that rules that appear before LB12 and prohibit a line break before a character (e.g., x CL) imply that a break is not allowed even if the character is preceded by a space. This means that if normal text contains a special notation starting with a special character like a period (e.g., "use the .htaccess file"), the rules forbid breaking the text so that the special character appears at the start of a line. The reason is such rules beat out rule LB12, which allows a line break after a space.

As an example of the motivation behind the rules, explained to some extent in UAX #14, consider the following statements there in the description of the class IS:

Characters that usually occur inside a numerical expression may not be separated from the numeric characters that follow, unless a space character intervenes. For example, there is no break in "100.00" or "10,000", nor in "12:59".

Infix separators are sentence ending punctuation when not used in a numeric context. Therefore they always prevent breaks before.

The first statement explains the rule IS x NU, which is part of LB 18. It also explains the name IS, which is short for Infix Separator. The use of IS characters in a quite different meaning, in normal sentence punctuation, is the reason for the rule x IS, which is part of LB 8. There is also a comment saying that rule IS x AL (LB 19b) prevents abbreviations like "e.g." being broken. This may sound complicated, and it really is, because the rules deal with characters with multiple usage, with no way to differentiate between them except coarsely on the basis of adjacent characters.

5.9.5.4. Applying the rules

In principle, it is relatively straightforward to apply the Unicode line-breaking rules. Consider for example the question of whether line breaks are allowed within the string "⁠/⁠%⁠7⁠e⁠j⁠" (without the quotation marks). The LineBreak properties of the characters in the example string can be found in the LineBreak.txt file:

 002F;SY;SOLIDUS 0025;PO;PERCENT SIGN 0037;NU;DIGIT SEVEN 0065;AL;LATIN SMALL LETTER E 006A;AL;LATIN SMALL LETTER J

Then, taking the characters in order and applying the rules in Line Breaking Algorithm in order (as they have been specified to apply), we find:

  • Between / and %, a line break is permitted, since no rule forbids it and the last rule LB 20 says "break everywhere else."

  • Between % and 7, a line break is permitted on the same grounds.

  • Between 7 and "e," no line break is allowed, according to rule LB 17: NU x AL.

  • Between "e" and "j," no line break is allowed, according to rule LB 19: AL x AL.

This explains, in part, why you might see printed matter containing a URL like http://www.cs.tut.fi/%7ejkorpela/ divided into lines in an odd way, http://www.cs.tut.fi/% and 7ejkorpela/.

Somewhat surprisingly, a line break is not allowed before / even after a space (rule LB 8). Thus, if you have text like "it's in directory /usr/spool" and you would like to allow a line break after the word "directory," you would need to insert a zero-width space U+200B before the first /. This works because the rule permitting a line break after a zero-width space appears before the other rules discussed here, hence has a higher priority. On the practical side, although you may observe the problem on some web browsers for example, the cures may not work; instead of a zero-width space, you could use the nonstandard HTML tag <wbr> to suggest a permitted line break.

At each step in considering whether a line break is permitted before two consecutive characters A and B, we need to consider the LineBreak properties of both characters. If there is no rule (formulated in terms of the LineBreak property values) that forbids a line break between A and B, or after A in general, or before B in general, then a line break is permitted between them.

5.9.5.5. Pair table implementation

For efficiency reasons, the line-breaking algorithm is usually implemented using table lookup techniques. UAX #14 describes how most of the rules can be implemented using a pair table that tells, for any pair of line-breaking classes, whether a break between two characters is allowed. The table does not deal with character pairs only but also situations where characters have one or more spaces between them. The pair table cannot express all aspects of line-breaking behavior, though.

A pair table, shown as Table 5-12, can also be used for quick checks.

Table 5-12. Pair table for line-breaking behavior
 

OP

CL

QU

GL

NS

EX

SY

IS

PR

PO

NU

AL

ID

IN

HY

BA

BB

B2

ZW

CM

WJ

OP

^

^

^

^

^

^

^

^

^

^

^

^

^

^

^

^

^

^

^

^

^

CL

÷

^

  

^

^

^

^

÷

 

÷

÷

÷

÷

  

÷

÷

^

 

^

QU

^

^

   

^

^

^

          

^

 

^

GL

 

^

   

^

^

^

          

^

 

^

NS

÷

^

   

^

^

^

÷

÷

÷

÷

÷

÷

  

÷

÷

^

 

^

EX

÷

^

   

^

^

^

÷

÷

÷

÷

÷

÷

  

÷

÷

^

 

^

SY

÷

^

   

^

^

^

÷

÷

 

÷

÷

÷

  

÷

÷

^

 

^

IS

÷

^

   

^

^

^

÷

÷

  

÷

÷

  

÷

÷

^

 

^

PR

 

^

   

^

^

^

÷

÷

   

÷

  

÷

÷

^

 

^

PO

÷

^

   

^

^

^

÷

÷

÷

÷

÷

÷

  

÷

÷

^

 

^

NU

÷

^

   

^

^

^

÷

   

÷

   

÷

÷

^

 

^

A L

÷

^

   

^

^

^

÷

÷

  

÷

   

÷

÷

^

 

^

ID

÷

^

   

^

^

^

÷

 

÷

÷

÷

   

÷

÷

^

 

^

IN

÷

^

   

^

^

^

÷

÷

÷

÷

÷

   

÷

÷

^

 

^

HY

÷

^

   

^

^

^

÷

÷

 

÷

÷

÷

  

÷

÷

^

 

^

BA

÷

^

   

^

^

^

÷

÷

÷

÷

÷

÷

  

÷

÷

^

 

^

BB

 

^

   

^

^

^

          

^

 

^

B2

÷

^

   

^

^

^

÷

÷

÷

÷

÷

÷

  

÷

^

^

 

^

ZW

÷

÷

÷

÷

÷

÷

÷

÷

÷

÷

÷

÷

÷

÷

÷

÷

÷

÷

^

÷

÷

CM

÷

^

   

^

^

^

÷

÷

  

÷

   

÷

÷

^

 

^

WJ

 

^

   

^

^

^

          

^

 

^


To check whether a line break is permitted between two characters, you first look up their line-breaking classes and then use them as row and column index to the table. The symbols in the table have the following meanings (note that the notations here differ somewhat from those used in UAX #14):


" " (space, empty cell in the table)

Indirect break opportunity only; this means that no line break is allowed if the characters appear in succession, but a break is allowed if one or more spaces intervene.


÷

A direct break opportunityi.e., a line break is allowed (even when no space intervenes).


^

A prohibited break, in the sense that no break is allowed even if spaces intervene; formally, A^B means A SP* x B, where SP is the space character.

Line-breaking behavior for pairs involving the following line-breaking classes must be resolved outside the pair table: AI, BK, CB, CR, LF, NL, SA, SG, SP, and XX.

For example, suppose you need to check whether a line break is permitted between an exclamation mark ! (U+0021) and a horizontal ellipsis ... (U+2026). You would first find their line-breaking classes, EX and IN, from the Unicode database file or some other source. Then you would find the row EX and the cell in column IN on that row, and find ÷, which means that a line break is permitted.

5.9.5.6. Tailoring

Unicode line-breaking rules can cause highly undesirable line breaks or prevent quite adequate line breaks. The rules have been written mainly with "normal" text in mind. For specialties like technical notations and mathematical expressions, the rules may result even in ridiculous results. Even relatively normal expressions often cause problems, if they are short and contain nonalphabetic characters. For example, "c/o" can be broken into "c/" and "o" by the algorithm, since a break after / is generally permitted.

UAX #14 allows tailoring, as long as the normative values of LineBreak are implemented as defined. Examples of tailoring include changing the line-breaking class of some characters, adding rules, and modifying rules. It is also possible to try to recognize some special constructs, like URLs or mathematical formulas, and process them by quite different rules.

In a program designed for displaying text written in a script that uses spaces between words, probably the most useful simple tailoring would consist of a rule that prevents breaking a very short "word" or by separating just one or two characters from a "word." Here "word" means just a string of characters not containing any space or other whitespace. Most people probably would not like to see "w/o" divided at all, or "Formula/X" divided after the "/" even if they might agree on treating "/" as a break opportunity in general. The implementation of such restrictions requires an approach that does not work just at the level of character pairs, of course.

5.9.5.7. Some background and criticism

Years ago, the Internet Explorer 4 browser was observed to use very strange line breakse.g., breaking a string like "a-b" to "a-" at the end of a line and "b" at the start of the next line, or even breaking "-b" to "-" and "b." Newer versions of the browser have exhibited the problems in varying forms, with no documentation. This has caused a lot of frustration. Attempts to use characters for explicit line break control, such as word joiner, were generally unsuccessful, due to lack of support for such characters. Other browsers have had similar problems to some extent. For details on the problems, please refer to the web page http://www.cs.tut.fi/~jkorpela/html/nobr.html.

The problem is that web browsers may apply Unicode line-breaking rules blindly, indiscriminately. Although breaking after a hyphen-minus "-" is very often suitable, perhaps even desirable, it is absurd to apply the rule permitting it to a very short string like "a-b." Unfortunately, the Unicode standard does not mention such considerations.

The Unicode line-breaking rules are largely based on estimates (or maybe just guesses) on the suitability of line breaks before or after a character or between two characters. It may well be that in most cases, a line break after the / character is acceptable whereas a line break before it is not. But rules formulated that forbid quite a many perfectly reasonable line breaks, like before the solidus in "it's in directory /usr/spool," and allow some really absurd breaks, like after either solidus in our example or in "c/o."

It is very confusing to see a string broken into lines just because some mechanical rules have allowed it. A string that is mixture of Latin letters, digits, and various symbols is most probably part of some special notation, such as a URL, or a variable in a programming language, or some code. It is unacceptable to have a string like foo%bar broken, especially when it occurs with no indication of what has happened. It can even distort information or corrupt data. For example, if you write about a programming language that uses the character % at the start of variable names, you will not be pleased by line-breaking rules that break "%foo" into "%" and "foo" on separate lines.

"Customization" or "tailoring" can be used to solve such problems, when you design or modify software. This does not help against more or less general purpose software that implements just the Unicode line-breaking algorithm and cannot realistically be modified to suit the needs of different types of text.

It would probably be best to remove all prohibitions against line breaks after spaces. After all, the no-break space can be used instead of a normal space in such cases, or language-specific higher-level protocols can be applied (e.g., to prevent line breaks in French text between a space and a question mark).

Rules for breaking things like URLs (when they appear as text) to two lines to prevent too long lines belong to higher protocol levels. It is probably best to override Unicode line-breaking rules in such situations, using a few carefully selected principles, such as breaking primarily after /, ?, or &, and never after - (to avoid ambiguity on whether the hyphen-minus is part of the URL).

Any breaking of a URL to several lines should be accompanied with the use of suitable delimiters, as recommended in Appendix E of RFC 3986. It recommends surrounding a URL with whitespace (spaces or line breaks) or, when used in text, enclosing a URL in quotation marks or between the characters < and >. Many publishers use yet another method: they print a URL in a font that differs from the normal font. Similar considerations can be applied to strings other than URLs: avoid breaking a string without indicating somehow that the parts belong together.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net