Section 5.8. Directionality


5.8. Directionality

You might never encounter problems with directionality, if all texts you work with are written exclusively left to right. But when you need to work with other texts, you may have considerable conceptual problems. Therefore, in order to be prepared to meet such problems, it is useful to gain some basic understanding of directionality. Moreover, the topic is culturally interesting in itself. People may think that left-to-right writing is the only possibility, or the only natural way, and consequently think that right-to-left writing is unnatural or wrong.

Directionality as discussed here deals with horizontal writing direction. Vertical writingi.e., writing a text line from top to bottom (or from bottom to top)is a different issue and handled outside Unicode, although it has some implications for Unicode, as we'll see in Chapter 7.

5.8.1. Writing Direction of Text

Right-to-left writing is older than left-to-right writing. Hieroglyphs and oldest Greek were mostly written right to left, though alternating direction was used, too. Arabic and Hebrew scripts have preserved the original direction. The Greek script changed the direction, and the Latin writing system was derived from a version of Greek writing system that already had established left-to-right writing direction.

Figure 5-7. Progress or regress?


The writing system that we learn in our childhood and use throughout our life makes us think that things progress in a particular way in the horizontal direction. If we write left to right, we think that movement rightward means progression in time: given "AB," we think that "A" is before "B" in a natural order of things.

Even if we consider purely graphic presentation, our built-in way of reading left to right or right to left will affect our understanding. If you think left to right, you probably see Figure 5-7 as indicating growth of some kind. If you are accustomed to reading right to left, you might see a decrease there. Naturally, the interpretation depends on the context, such as the presence of left-to-right or right-to-left text. Presented in isolation, graphic presentation can be thoroughly misunderstood: your attempt to describe "before" and "after" situations might be read the other way around.

When you read about writing direction, you probably read about it in English, or in another language written left to right. Therefore, the mental model of "natural" writing order is in front of your eyes even in texts that try to give you a broader view. When I say that, in Hebrew, the letters alef (aleph) א and bet (beth) ב in that order are written as אב, the explanation is confusing. The letters have already appeared in the English sentence in a particular order horizontally, so now it looks like the order was reversed.

I wrote אב in MS Word by using the Insert Characters command, first selecting א from the Character Map, and then selecting ב. Word displays the combination as אב, since it knows that these characters must be written right to left. Things would have gone all wrong if I had thought that

Using Unicode, you type characters in the logical orderi.e., in the order in which they would be mentioned if words were spelled out. In simple cases, properties of characters and the software you use take care of writing direction. When punctuation and special characters intervene, you may need to add control characters to make writing directions correct.


5.8.2. Bidirectionality

Using exclusively left-to-right writing, or exclusively right-to-left writing, is relatively simple to handle. When you have a document that contains, say, both English and Arabic, it becomes challenging to deal with changes in writing direction. Problems arise, among other things, from punctuation characters that are used in different writing systems and therefore need to have their directionality set by the context. The Unicode way of handling such things is described in Unicode Standard Annex #9, "The Bidirectional Algorithm," http://www.unicode.org/reports/tr9/.

5.8.3. Directionality and Character Codes

There are several ways to deal with directionality in an environment where both writing directions may appear:

  • Specify that the content of a file or string is to be always written left to right, and arrange things so that characters appear in an order suitable for that. This is sometimes applied to Hebrew texts: a file contains the character in a completely reversed order, but when written left to right, the order becomes correct (for reading right to left). Confusingly, data is then said to be coded in "visual order," implying eye movement from left to right!

  • Indicate the direction explicitly with invisible control characters. At the simplest, one control character says that subsequent characters are to be written left to right, and another control character switches the direction to the opposite.

  • Assign inherent directionality to characters.

Although the first approach may look most natural and the third one most complicated, Unicode uses inherent directionality, enhanced with the possibility of using control characters for exceptions.

5.8.4. Directionality of Characters

The Unicode approach to writing direction is based on the inherent directionality defined for each character. For example, Latin and Greek letters have inherent left-to-right directionality; Hebrew and Arabic letters have inherent right-to-left directionality. Programs that display text need not know the language, at least not for directionality purposes. They use the much more technical information that is contained in the Unicode database: a table that assigns directionality to each character.

It would even be incorrect to deduce directionality from language information. The HTML specification explicitly says that browsers must not deduce the directionality of text from its declared language (the lang or xml:lang attribute, if present). This is natural if you think about transliterated Arabic, for example. When an Arabic word is written in Latin letters according to some transliteration scheme, it is still an Arabic word, but it is to be written left to right (except in some scientific contexts).

Figure 5-8. English text with an Arabic word in it, with reading directions marked with arrows


The directionality issue is complicated by the use of directionally neutral characters. Some punctuation and other characters are used both in left-to-right and in right-to-left writing. Further complication is caused by merging texts written in different directions. For example, an English document may quote some Arabic in Arabic writing, or Arabic text may contain English words in Latin letters. The reader is assumed to be able to change the reading direction: when she sees some Arabic writing like مغرب, she jumps to the right end of that part of the text, reads leftward, and then jumps back to the right over the Arabic writing she had read. This is illustrated in Figure 5-8.

To deal with the complications, the directionality property of a character has several possible values, not just two. These values classify characters so that bidirectional algorithm can handle most situations well. Officially, the directionality property is called BiDi Class, referring to bidirectionality.

The values are named in Table 5-6, where the second column indicates whether the value indicates strong (S), weak (W), or neutral (N) directionality.

Table 5-6. BiDi Class values (directionality property values)

Code

 

Name

Characters that have this value

AL

S

Arabic Letter

Arabic, Syriac, and Thaana letters, etc.

AN

W

Arabic Number

Arabic-Indic digits, etc.

B

N

Paragraph Separator

Line feed, carriage return, etc.

BN

W

Boundary Neutral

Most formatting and control characters

CS

W

Common Number Separator

Comma, full stop, colon, NBSP, etc.

EN

W

European Number

European and some other digits

ES

W

European Number Separator

Plus sign, minus sign, hyphen-minus, etc.

ET

W

European Number Terminator

Degree sign, currency symbols, etc.

L

S

Left-to-Right

Most letters, ideographs, etc., and LRM

LRE

S

Left-to-Right Embedding

LRE

LRO

S

Left-to-Right Override

LRO

NSM

W

Non-Spacing Mark

Characters in General Category Mn or Me

ON

N

Other Neutrals

Characters not belonging to other classes

PDF

W

Pop Directional Format

PDF

R

S

Right-to-Left

Hebrew letters etc., and RLM

RLE

S

Right-to-Left Embedding

RLE

RLO

S

Right-to-Left Override

RLO

S

N

Segment Separator

Horizontal tab

WS

N

Whitespace

Spaces, form feed, etc.


5.8.5. Control Characters for Directionality

There is small collection of control characters that affect directionality. They are not needed in pieces of text that contain only left-to-right characters (e.g., Latin letters) or only right-to-left characters (e.g., Arabic letters). These characters have been placed into the General Punctuation block, somewhat illogically, since they are not visible punctuation marks but invisible controls. Due to their meaning, these characters should be ignored in any processing except visual rendering.

The characters are presented in Table 5-7, along with their HTML and CSS equivalents, to be discussed shortly.

Table 5-7. Control characters for directionality

Code

Name

Abbr.

Entity

HTML

CSS

U+200E

Left-to-right mark

LRM

‎

 

direction: ltr;

U+200F

Right-to-left mark

RLM

‏

 

direction: rtl;

U+202A

Left-to-right

embedding

LRE

 

dir="ltr"

unicode-bidi:

embed; direction: ltr;

U+202B

Right-to-left

embedding

RLE

 

dir="rtl"

unicode-bidi:

embed; direction: ltr;

U+202D

Left-to-right

override

LRO

 

<bdo dir="ltr">

unicode-bidi:

bidi-override;

direction: ltr;

U+202E

Right-to-left

override

RLO

 

<bdo dir="rtl">

unicode-bidi:

bidi-override;

direction: rtl;

U+202C

Pop directional formatting

PDF

 

Suitable end tag

 


The left-to-right mark and the right-to-left mark set the directionality for preceding and following characters with weak or neutral directionality . Thus, you cannot change the writing direction of a string like "ABC" with these marks. Technically, these marks are zero-width (i.e., invisible) characters with strong directionality.

The override characters, left-to-right override (LRO) and right-to-left override (RLO), have a stronger effect. They affect the directionality of all characters, up to the next override or embedding or pop directional formatting (PDF) character. Thus overriding any natural directionality, they can be used even to make normal English text run right to left. The LRO character or the corresponding markup is needed for "visual Hebrew"i.e., Hebrew written backwardso that modern programs still show it the intended way.

The embedding characters, left-to-right embedding (LRE) and right-to-left embedding (RLE), start and end a new level in directionality, in the following sense: Text between an LRE and a PDF, or between an RLE and a PDF, is treated as embedded (as a whole) inside the surrounding text. Embedding can be nested: you can, for example, have English text with an embedded Arabic quotation, which contains an embedded English word.

The pop directional formatting (PDF) character acts as a closing symbol that terminates the effect of preceding and matching LRO, RLO, LRE, or RLE.

5.8.6. Bidi Mirroring

Many characters can appear both in left-to-right and in right-to-left writing but with different glyphs depending on the writing direction. For example, the greater than sign > points to the smaller of its operands, when seen as an arrowhead. To preserve this relationship in right-to-left writing, the character must be displayed as mirrored. A glyph for the less than character < can be used for this.

For example, consider an expression like "a > b" when written in Hebrew text and using Hebrew letters instead of "a" and "b." If you type the Hebrew letter alef א, the greater-than sign >, and the Hebrew letter bet ב, you should get the visual appearance א>ב assuming that the program supports the Unicode bidirectional algorithm or the writing direction has been explicitly set to right to left. The character that looks like a less-than sign there is still the greater-than sign; it just has a mirrored glyph.

In many cases, mirroring can be described superficially as a character-level correspondence. We can say that > and < correspond to each other in mirroring, and so do "(" and ")." However, it is really a glyph-level correspondence: the rendering engine just uses a normal (i.e., normal in left-to-right writing) glyph for a character to render another character.

Not all mirrored characters can rendered by "borrowing" a glyph from another character. For some characters, such as angle (U+2220), a separate mirrored glyph needs to be used. This means that an implementation that supports the character and supports both writing directions must have two different glyphs for it. However, this is not common in practice, since it requires both adequate programming and a suitable font using advanced font technology.

The Bidi Mirrored property is a normative binary (yes/no) property that simply tells whether a character is mirrored or not. The Bidi Mirroring Glyph property is an informative property that suggests, for many mirrored characters, a character whose glyph might be used to render the mirrored character in right-to-left writing. For example, the following lines in the file BidiMirroring.txt suggest that a glyph for ) can be used to render ( as mirrored, and vice versa:

 0028; 0029 # LEFT PARENTHESIS 0029; 0028 # RIGHT PARENTHESIS

5.8.7. Directionality in HTML and CSS

We will briefly discuss directionality in HTML, since the topic is often neglected, or presented incorrectly, in HTML material. In HTML authoring, you have three ways to affect directionality:

  • Insert Unicode control characters; this is discouraged in UTR #20 (see Chapter 9) except for the left-to-right mark and the right-to-left mark (which you can write in different ways, including the entity references &lrm; and &rlm;).

  • Insert HTML markup to indicate directionality (dir attribute for setting directionality inside an element, and bdo element for bidirectional override).

  • Use Cascading Style Sheets (CSS) rules (direction and bidi-override properties) on a suitable markup element, introducing extra markup for that if needed.

For example, to make the string "ABC" written right to left, you could use any one of the following constructs in HTML (where the last one uses essentially CSS):

 &#x202e;ABC&#x202c; <bdo dir="rtl">ABC</bdo> <span style="unicode-bidi: bidi-override; direction: rtl;">ABC</span>

Setting the dir attribute in HTML for any element but bdo corresponds to using left-to-right mark and therefore does not affect characters with strong directionality. Thus, <span dir="rtl">ABC</span> would appear as ABC. Similarly, the direction property in CSS does not override natural strong directionality, unless the bidi-override property is set as well.

HTML specifications explicitly warn that the declaration of the language used in a document, via lang or xml:lang attribute, shall not set directionality. The overall default in HTML is left-to-right directionality. Thus, a document in Arabic should normally have <html dir="rtl"> as its first tag. Using the attribute lang="ar" there as well can be useful for other purposes, but it does not set directionality.

Web browsers, especially Internet Explorer, have flaws in directionality features. For example, text that contains only right-to-left characters and neutral characters should be displayed correctly without any extra markup, but this does not always happen. Using logically redundant markup with dir attributes may help.

HTML authors who create right-to-left or mixed-direction content should use dir attributes even in contexts where they are not required by the specifications.


For additional explanations, examples, and advice, please consult Andreas Prilop's "Bidirectional text" at http://www.unics.uni-hannover.de/nhtcapri/bidirectional-text.html.

5.8.8. Directionality of Formatting

The dir attribute in HTML and the direction property in CSS should be used with caution, since they do not affect the directionality of characters only. They also affect the direction of document formatting. This is natural in many ways. If you have, say, a bulleted list, then the bullet should apparently be placed near the start of each item. If the character directionality in the items is right to left, this means that the bullets should appear on the right and the text should be right-aligned. Setting directionality in HTML or CSS affects the following:

  • Writing direction of text as just described

  • The layout direction of blocks that appear side by side

  • The layout direction of columns in tables

  • The direction of horizontal overflow, when content does not fit into its block

  • The default value of horizontal alignment of text lines (the align attribute in HTML, the text-align property in CSS)

  • The alignment of the last line of a block of text that is justified on both sides



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net