Section 3.3. Windows Latin 1 and Other Windows Codes


3.3. Windows Latin 1 and Other Windows Codes

The ISO 8859 character codes, which have been defined by international standards, have Microsoft-specific counterparts, which are here called "Windows codes." The main difference is that some code positions are reserved for control characters (and mostly unused) in ISO 8859 but assigned to various printable characters, especially punctuation marks, in Windows codes. Although defined only by a software vendor, the Windows codes are very important due to the market share of Microsoft.

3.3.1. Windows Latin 1

Microsoft defined its own Latin 1 encoding as different from ISO Latin 1, although only in the sense that some positions that are reserved for control codes in ISO Latin 1 (codes 128159 decimal) are used for printable characters in Windows Latin 1. The main reason was very understandable: the inclusion of typographically correct quotation marks, as in "foo" and 'foo, and em dash (') and en dash (). The right single quote is also the typographically correct apostrophe. Some other characters were added as well.

Windows Latin 1 is one of the most commonly used encodings in the world. In most contexts where the default is said to be ISO Latin 1, it's really Windows Latin 1 (sometimes called WinLatin1 ). For example, if a web document is labeled as ISO-8859-1 but contains octets with values 128149, browsers will generally display them according to Windows Latin 1. The practical reason is that most often this is what the document's author really meant.

However, the use of octets in the range 128159 in any data to be processed by a program that expects ISO 8859-1 encoded data is an error, and it might cause problems. The octets might for example be ignored, or be processed in a manner that looks meaningful, or (in rare cases) be interpreted as control characters.

The encoding has been registered under the name windows-1252. In practice, the name cp-1252, or cp1252, was widely used before the registration, and it can still be seen.

Windows Latin 1 is often referred to as the ANSI character set, but this is completely misleading. ANSI, the American National Standards Institute, never adopted the set as a standard. Microsoft started using the name because they based the design on a draft for an ANSI standard. Other Windows character codes have also been called "ANSI."

The Windows Latin 1 encoding has existed in somewhat different variants. The main difference in practice is that early versions did not include the euro sign, €. Table 3-4 presents the modern version of the characters in Windows Latin 1 that do not belong to ISO Latin 1. The table is grouped by character semantics and uses Unicode names for the characters. The names used in Microsoft documentation are partly different and vary by document.

Table 3-3. Additional characters in Windows Latin 1

Glyph

Unicode name of character

Code

Win

Comments

En dash

U+2013

150

 

'

Em dash

U+2014

151

 

"

Left double quotation mark

U+201C

147

 

"

Right double quotation mark

U+201D

148

 

'

Left single quotation mark

U+2018

145

 

'

Right single quotation mark

U+2109

146

Also apostrophe

Single left-pointing angle quotation mark

U+2039

139

Left guillemet

Single right-pointing angle quotation mark

U+203A

155

Right guillemet

"

Double low-9 quotation mark

U+201E

132

Baseline quote

'

Single low-9 quotation mark

U+201A

130

 

...

Horizontal ellipsis

U+2026

133

 

Bullet

U+2022

149

 

Dagger

U+2020

134

 

Double dagger

U+2021

135

 

Small tilde

U+02DC

152

Diacritic-like

ˆ

Modifier letter circumflex accent

U+02C6

136

Diacritic-like

Per mille sign

U+2030

137

One thousandth

Trademark sign

U+2122

153

 

Latin small letter "f" with hook

U+0192

131

"Florin"

Latin small letter "s" with caron

U+0161

154

 

Latin capital letter "S" with caron

U+0160

138

 

Latin small letter "z" with caron

U+017E

158

Added with euro

Latin capital letter "Z" with caron

U+017D

142

Added with euro

œ

Latin small ligature oe

U+0153

156

In French

Œ

Latin capital ligature OE

U+0152

140

In French

Latin capital letter "Y" with dieresis

U+0178

159

In French

Euro sign

U+20AC

128

Added later


3.3.2. Other Windows Character Codes

Microsoft has also defined other Windows-specific 8-bit character codes that resemble ISO 8859 encodings, such as Windows Latin 2, also known as Windows Central European or Windows East European. They, too, use the range of control codes (128159) for added punctuation and other characters. In addition to this, the encodings may differ from the corresponding ISO 8859 encoding in other positions. In particular, Windows Latin 2 differs from ISO 8859-2 in several positions.

The Windows codes are widely used as de facto standards in many environments. If you travel to Central/Eastern Europe and use computers there, you will find that they very often have Windows Latin 2 as the default encoding.

The Windows codes are known as windows-1250 through windows-1258 in the official registry of character encodings; these names are often called MIME names of encodings, for reasons explained in Chapter 10. Moreover, there is windows-874, which has not been officially registered. In practice, somewhat different names are used, as shown in Table 3-5. Note that the numbering of windows-1250 etc. differs from the numbering of the corresponding ISO 8859 standards. The table also compares the codes with ISO 8859 codes; differences in the range 128159 are not mentioned here.

Table 3-5. Widely used Windows character codes

MIME

Common name

Compare to

Differences

windows-1250

Windows Central/East Eur.

ISO 8859-2

Differ in some positions

windows-1251

Windows Cyrillic

ISO 8859-5

Different ordering

windows-1252

Windows Latin 1 (West Eur.)

ISO 8859-1

 

windows-1253

Windows Greek

ISO 8859-7

Differ in some positions

windows-1254

Windows Turkish

ISO 8859-9

 

windows-1255

Windows Hebrew

ISO 8859-8

Some differences

windows-1256

Windows Arabic

ISO 8859-6

Major differences

windows-1257

Windows Baltic

ISO 8859-13

A few differences

windows-1258

Windows Vietnamese

(ISO 8859-1)

Separate design

windows-874

Windows Thai

ISO 8859-11

 


The windows-1258 encoding has no direct ISO 8859 counterpart, but its overall design is the same as in ISO 8859-1, with the added characters as in windows-1252 and with some modifications made to meet some needs of the Vietnamese language.

Names like cp1250 or cp-1250 (instead of windows-1250) are often used, but they are not official (registered).

For detailed information, consult Microsoft's documentation "Code pages supported by Windows," http://www.microsoft.com/globaldev/reference/wincp.mspx.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net