Section 8.3. Latin-1 Supplement (ISO 8859-1)


8.3. Latin-1 Supplement (ISO 8859-1)

The Latin-1 Supplement block in Unicode is the same as the upper half of ISO 8859-1. In ISO 8859-1, these characters are those that have the most significant bit seti.e., characters in code positions from 128 to 255 in decimal. This means the range U+0020 to U+00FF in Unicode.

Like the ASCII repertoire, the Latin-1 Supplement contains a mixture of characters for historical reasons, sometimes for no good reason. This means that many of the characters in it would belong to other blocks, if blocks were formed purely according to the meanings of characters. For example, the multiplication sign x would really belong to the Mathematical Operators block. However, Unicode was designed to preserve all the code point assignments in ISO 8859-1.

While all printable ASCII characters have got some widespread use at least in specialized notations, many of the Latin-1 Supplement characters have very little use. There was less need for assigning arbitrary meanings to characters. You will hardly find any use for a character like broken bar , for example.

The Latin-1 Supplement was designed to cover the needs of most languages spoken in Western or Northern Europe. These languages use the Latin alphabet as the basis but also contain various diacritical marks, a few extra letters. In its repertoire of punctuation characters, the Latin-1 Supplement is illogical: it contains, for example, the chevrons (« and ») but not the "smart quotes" used in English. However, it can be argued that the ASCII quotation mark can be reasonably used as a substitute for smart quotes but not chevrons.

8.3.1. Diacritic Marks and Letters with Them

The Latin-1 Supplement contains Latin letters with diacritic marks as used in languages of Western and Northern Europe. It covers only a small fraction of all such characters. As we can see from Table 8-4, the characters do not constitute a systematic grid. Even more unsystematically, the uppercase form of ÿ (, U+0178) does not belong to the Latin-1 Supplement.

Table 8-4. The unsystematic grid of diacritic marks in the Latin-1 Supplement
Lowercase letters     

à

á

â

ã

ä

å

 

è

é

ê

 

ë

  

ì

í

î

 

ï

  

ò

ó

ô

õ

ö

  

ù

ú

û

 

ü

  
 

 

ÿ

  
   

ñ

   
      

ç


The use of diacritic marks is strongly language-dependent. It will be discussed later in the section "Diacritic Marks" (where we mention some additional marks, too).

The following characters are spacing clones of diacritic marks, and they have very little use as characters (see notes in the section "Diacritic Marks" later in this chapter):

  • Acute accent ´ (U+00B4), which is a clone of the combining acute accent U+0301.

  • Cedilla ¸ (U+00B8), which is a clone of the combining cedilla U+0327.

  • Dieresis ¨ (U+00A8), which is a clone of the combining dieresis U+0308.

  • Macron ¯ (U+00AF), which is a clone of the combining macron U+0304.

The acute accent is often used as an apostrophe (e.g., "John´s"), since it resembles a typographically correct apostrophe more than the ASCII apostrophe does. Such usage may confuse both human readers and computer programs.

The macron occasionally has some special uses. The Unicode standard mentions "overline" and "APL overbar" as synonyms for this character. Consecutive macrons connect in many fonts, so the character can be used to create a long line ( ¯¯¯¯¯ ).

8.3.2. Other Letters

The feminine ordinal indicatorª (U+00AA) and the masculine ordinal indicator º (U+0⁠0⁠B⁠A) can be regarded as letters, too. These characters are defined as compatibility characters that are equivalent to letters "a" and "o" in superscript style, but they are meant to be used in specific contexts only. They are used in Spanish after numbers to indicate an ordinal number of feminine or masculine gender, respectively. For example, 1ª = primera, 1º = primero, both meaning "first." The masculine ordinal indicator is very often confused with the degree sign (see "Mathematical, Logical, and Physical Symbals" later in this chapter).

Characters in Table 8-5 are regarded as independent letters, although some of them are historically combinations of two letters or a letter and a diacritic. Only the short names are given here; full names are "Latin capital letter AE," "Latin small letter ae," etc.

Table 8-5. Special letters in Latin-1 Supplement

Glyphs

Codes

Name

Usage notes (not exhaustive)

Æ æ

U+00C6, U+00E6

Letter ae

Scandinavian languages, English, IPA

U+00D0, U+00F0

Eth

Icelandic (as voiced "th" in English)

U+00DE, U+00FE

Thorn

Icelandic (as unvoiced "th" in English)

Ø ø

U+00D8, U+00F8

O with stroke

Danish, Norwegian, Faroese, IPA

ß

U+00DF

Sharp s

German, denotes unvoiced "s" sound


In modern German orthography, the sharp "s," ß, is used after a long vowel only. It has no uppercase equivalent. When converting data to uppercase, ß is replaced by "SS."

The following characters are not regarded as letters, despite being historically formed from stylized letters: ¢, £, ¥, ©, ®, and µ (micro sign).

8.3.3. Superscript Digits (1 2 3) and Vulgar Fractions (¼ ½ ¾)

In Unicode, there are versions of digits used as superscripts or subscripts coded as separate characters. Only the superscripts corresponding to 1, 2, and 3 belong to Latin-1 Supplement. The first one is not used much, but the others have common usagee.g., in denoting square meter (m2) and cubic meter (m3). The others are in the block Superscripts and Subscripts, discussed later. The Latin-1 Supplement contains two characters that may look like superscript 0: the degree sign (°) and the masculine ordinal indicator (º).

The so-called vulgar fractions are characters denoting fractional numbers as single characters. In Latin-1 Supplement, there are such characters for the fractions 1/4, 1/2, and 3/4 (namely ¼, ½, and ¾). This reflects the character repertoire on many typewriters. Depending on the font, the bar (which corresponds to fraction slash) can be horizontal or slanted.

For usage notes, see the section "Mathematical and Technical Symbols" later in this chapter.

8.3.4. Punctuation

Latin-1 Supplement has just a few punctuation characters:

  • Left-pointing angle quotation mark « (U+00AB) and right-pointing angle quotation mark» (U+00BB), often called guillemets or chevrons and used as normal quotation markse.g., in French, as in the following: Il a dit : « L'État, c'est moi. »

  • Inverted exclamation mark ¡(U+00A1). It is used in Spanish and some other languages at the beginning of an exclamation. The exclamation is terminated by a normal exclamation markfor example: ¡Buenos días, señor!

  • Inverted question mark ¿ (U+00BF). It is used in Spanish and some other languages at the beginning of a question. The question is terminated by a normal question markfor example: ¿Cómo está usted?

  • Soft hyphen (U+00AD), which is either rendered as normal hyphen-minus "-" or not rendered at all (and treated as invisible hyphenation hint). It will be discussed later in conjunction with other hyphen-like characters in the section "General Punctuation" later in this chapter.

8.3.5. Currency Symbols

Cent sign ¢ (U+00A2) is used in many countries. It is most widely known as the symbol for "cent" as one hundredth of the U.S. dollar. In the English language, this character is written immediately after a numbere.g., 75¢. It is never used when writing a sum of money that begins with dollar sign ($) ; in such cases, cents are indicated as fractions of dollare.g., $0.75, $49.95.

The currency unit euro is divided into 100 cents, also known as eurocents . There is no recommendation on using the cent sign as a symbol for cent in that meaning. Different abbreviations like "c" and "ct" are used for the eurocent.

Currency sign ¤ (U+00A4) has no definite semantics. It is hardly ever used in normal text. Most naturally, it is used as a generic currency symbol: a placeholder for actual currency symbols . Localization settings in software may use the currency sign in patterns used to specify the formatting of monetary quantities. For example, in such settings, the string "1,1 ¤" might be the way to tell the system to put the currency symbol (to be specified in another setting) after the number and separated from it with a space.

When data in ISO 8859-15 encoding is displayed by a program that does not support that encoding or does not properly recognize information about the encoding, the program typically defaults to displaying the data as if it were ISO 8859-1 encoded. Thus, an octet intended to represent the euro sign € would be displayed as the currency sign, ¤.

Pound sign £ (U+00A3) is best known as denoting the pound as the currency unit of the United Kingdom. It may be used for other currencies as well. The Unicode standard distinguishes the pound sign from the lira sign ₤ (U+20A4), which has two crossbars, as opposed to one crossbar in the pound sign. On the other hand, the standard says that the lira sign is not used much and that the preferred sign for lira is £ (U+00A3).

Yen sign ¥ (U+00A5) has an alternative name "yuan, "reflecting its dual use for the currencies of Japan and China. A glyph for the character may have one or two crossbars, with no difference in meaning.

The euro sign, €, does not belong to the Latin-1 Supplement block but to the Currency Symbols block, discussed in the section "Other Blocks" later in this chapter.

8.3.6. Mathematical, Logical, and Physical Symbols

There is a limited and rather haphazard set of mathematically oriented symbols in Latin-1 Supplement. Together with the characters in Basic Latin, such as +, -, and /, they let us write very simple arithmetic expressions.

Degree sign ° (U+00B0) denotes temperature in degrees (e.g., 100 °F, 38 °C) or degrees when expressing angles (e.g., 90° angle). Notice that when a temperature is expressed in kelvins, the degree sign is not used; the symbol of kelvin is simply K (e.g., 311 K).

According to the rules of the SI system of units, a space should be used between a numeric value and a unit symbol, with the exception of angle notations like 30°22'8". When the degree sign is used for temperatures, the normal rule applies (e.g., 42 °C). A no-break space can be used instead of a normal space to prevent undesired line breaks.

In practice, you may find the degree sign used for different other purposes, too. The Unicode standard even mentions (in 14.2: Letterlike Symbols): "Legacy data encoded in ISO/IEC 8859-1 (Latin-1) or other 8-bit character sets may also have represented the numero sign by a sequence of 'N' followed by the degree sign (U+00B0 DEGREE SIGN). Implementations interworking with legacy data should be aware of such alternative representations for the numero sign when converting data." This statement describes legacy data rather than adequate use of the degree sign.

The degree sign is not the same as masculine ordinal indicator (º), although the glyphs for the two characters may look similar. In Chapter 1, we discussed some of the reasons for being strict in such issues. The degree sign is not to be confused with superscript zero U+2070 (digit "0" in superscript style) either.

Division sign ÷ (U+00F7)is a mathematical symbol that mostly denotes division. Its intended scope of use is unclear. It has been used in school mathematics, as in "100 ÷ 5 makes 20." In some numeric keypads of computer keyboards, there is a key with the ÷ symbol, which means division in calculator usage but may generate the solidus / when used for character input.

It is probably best to avoid using the division sign, except in special cases where its meaning can be made clear. It has no tangible benefits over using the solidus /. Moreover, the symbol ÷ is also used to denote subtraction in Denmark and elsewhere in Europe.

Micro sign µ (U+00B5) corresponds to the prefix "micro-" and denotes division by one million when used as prefix of a unit. For example, "µm" is micrometeri.e., one millionth of a meter (previously called "micron" and denoted by "µ" alone).

This character is historically based on the Greek letter mu. In Unicode, these characters are however distinct. On the other hand, Unicode defines micro sign as a compatibility character which has Greek small letter mu U+03BC as its compatibility decomposition.

In Unicode Version 4, the sample glyphs for the micro sign and the letter mu look very similar, if not identical. In many fonts, however, there are differences, which vary from hardly noticeable to substantial. In Times New Roman, for example, the glyphs are µ (micro) and μ (mu).

Multiplication sign x (U⁠+⁠00D7) is a mathematical symbol denoting multiplication. Examples: "2x2 makes 4," where x can be read as "times"; "a 5x10 metres area," where x can be read as "by." In biology, this character is used when naming hybrids e.g., Salix xcapreola indicates that the species results from hybridization, and Agrostis stolonifera x Polypogon monspeliensis is a "hybrid formula" that indicates the hybrid of two named species. The Unicode standard mentions an alternative name "z notation Cartesian product," reflecting the usage for Cartesian (direct) product of sets. Cf. to the middle dot (·), discussed in "Specialized Characters" later in the chapter.

Not sign ¬ (U+00AC) denotes logical negation, though mostly in formal logic texts only, not in programming languages. Even logic texts often use the Basic Latin character ~ (tilde) instead. The Unicode standard also mentions that in typography, this character is called an "angled dash."

MS Word displays an "optional hyphen" (i.e., an invisible hyphenation hint) as ¬ when in "show formatting" (Show ¶) mode. It was probably chosen partly because the not sign looks like a hyphen with a special mark on it, and partly just because it is a conveniently available character that rarely appears in running text.

Plus-minus sign ± (U+00B1) means "plus or minus." It has different uses:

  • It is sometimes used to refer to two quantities at the same time, as in "the solutions of the equation x2 - 4 = 0 are ±2," meaning that the solutions are +2 and -2.

  • It is also used to indicate an interval of uncertainty in measurements and estimates, as in "according to the measurements, the weight is 42.4 kg ± 0.5 kg." This means that the weight is expected to be between 42.4 - 0.5 and 42.4 + 0.5 kilograms. Typically, this does not specify absolute limits; the quantity after the ± sign is often some statistical measure like standard deviation. According to rules for using the SI, notations like 42.4 ± 0.5 kg should not be used; you should either repeat the unit as above or use parentheses: (42.4 ± 0.5) kg to make it "completely clear to which unit symbols the numerical values of the quantities belong."

  • Yet another (informal) usage seems to be to let ± denote "about, circa" (e.g., "he is ±50 years old"), which can be quite confusing.

When the character repertoire is limited to Basic Latin, the string "+/-" is commonly used instead of ±.

8.3.7. Specialized Characters

Broken bar (U+00A6) has no specific meaning. In some old fonts (and keyboards), the vertical line | character appears as a broken line. For no apparent reason, this variant has been coded as a separate character in Latin-1. The Unicode standard mentions that an alternative name for the character in typography is "parted rule."

Copyright sign © (U+00A9) consists of letter C in a circle, and it is used in copyright statements, such as "© 2006 Jukka K. Korpela." The character can be used instead of or in addition to the word "copyright," partly because the character is, in principle, language-neutral and universal.

Middle dot · (U+00B7) is a multi-purpose character, which was originally included into Latin 1 due to its use as punctuation in the Catalan language. It is more often used as a special character, usually as multiplication sign of a kind. Uses include the following:

  • In the SI system of units, a middle dot, called "half-high dot" or "raised dot" in that context, can be used when denoting the product of two or more unitse.g., "N·m" (newton multiplied by meter). An alternative is to use a space (e.g., "N m"). See notes on multiplication symbols in "Mathematical and Technical Symbols" later in this chapter.

  • In mathematics, a middle dot is often used as a multiplication symbol. If such a symbol is needednote that in algebra it is often implied: ab means a multiplied by bthen it is usually better to use the multiplication sign (x).

  • In chemistry, a middle dot is used in some cases to separate major parts of a complex formula such as components of a double salt. Example: K2SO4·Al2(SO4)3.

  • In Catalan, the middle dot is used to distinguish between "ll" and "l·l," which are pronounced differently. In Unicode, there are separate characters Latin capital letter "l" with middle dot (U+013F) and Latin small letter "l" with middle dot (U+0140), but they are compatibility equivalent to letter "L" or "l" followed by the middle dot. However, typographers have differing views on Catalan middle dots.

  • In dictionaries, the middle dot is used as a surrogate for hyphenation point U+2027i.e., to indicate correct word breaking as in dic·tion·ar·ies.

  • In Greek, the middle dot is often used for a punctuation character "ano teleia," which should actually appear higher than the middle dot. Unicode has Greek ano teleia (U+0387) as a separate character, but it has the middle dot as its canonical mapping. However, in several fonts, Greek ano teleia is an upper dot, not a middle dot, so it is a better punctuation character for Greek texts when it is available.

Note that a raised decimal point should not be interpreted as a middle dot but as a full stop "." character in particular usage and style.

The middle dot is distinct from the following characters: bullet (U+2022), one dot leader (U+2024), bullet operator (U+2219), dot operator (U+22C5), and hyphenation point (U+2027). However, it is often used as a surrogate for theme.g., as a small list bullet, although it is not visually suitable for such use, since the glyph for middle dot is typically rather small.

No-break space " " (U+00A0) is used in place of a normal space character as a "binding space," to prevent a line break between words or other expressions. It will be discussed in detail in "General Punctuation" later in this chapter.

Pilcrow sign ¶ (U+00B6) is a "section sign in some European usage," as the Unicode standard puts it. In old manuscripts, there was a tendency to present a new paragraph by writing a pilcrow sign and continuing in-line, due to the considerable cost of the recording media in those days. However, such usage is now largely outdated, and the character is used as a marker for special notes.

The pilcrow sign appears as paragraph sign (and is typically called that way) in some U.S. usage, in much the same way as the paragraph sign (§) is often used in Europe. For example, clause 6 of an agreement or verdict is referred to by "¶ 6" and clauses from 20 to 28 are referred to by "¶¶ 2028."

Many word processors display paragraph breaks as ¶ when requested to "show formatting." This does not mean that the data itself (e.g., as saved onto disk) would contain such characters; it is usually just a visual indication on the screen.

Registered sign ® (U+00AE) consists of letter R in a circle. It is written after a name or other expression to indicate it as a registered trademark (at least in some country). There is considerable variation in glyphs for this character. The letter R inside the circle may have different shapes, but in addition to that, the size and position may vary. For example, in the Lucida Sans Unicode font, ® is a small superscript, whereas in Verdana, ® extends below baseline (making the R in the symbol line up with the baseline), and the symbol is relatively large.

Section sign § (U+00A7) is used as a section sign especially in the U.S., and as a paragraph sign in some European usage, especially when referring to paragraphs in laws, contracts, rules, etc. For that reason, § is often used to symbolize law in general. Reflecting the variation, the character is called paragraph sign in many standards.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net