Encoding


A computer stores text as a series of numbers. For each letter, punctuation mark, or space there is a corresponding number, called a code point, which represents that letter. An encoding is simply a system that is used to identify characters using numbers. Binary data can be encoded as well, and later on in this chapter you will learn how to encode binary data using REALbasic's Base64 classes.

Encoding has been encountered previously in this book, but only superficially; however, it is a crucial part of developing programs, especially ones that run on multiple platforms, because you will no doubt be confronted with the full range of encoding possibilities. As it turns out, understanding character encoding is an important part of being able to make effective use of XML. In this section, I will discuss REALbasic's encoding classes and related tools and share a sample utility application I have used to explore encoding on REALbasic.

There are a lot of different encodings and if you do an appreciable amount of working with text, you will inevitably run into encoding headaches. One of the earliest encodings is ASCII (American Standard Code for Information Interchange, first standardized in 1963), which was limited to a 7-bit character set, or 127 characters. The first 32 characters, 031, are control characters, and 32 to 126 are the characters that make up the basic American-English alphabet.

Following is a list of all the printable ASCII characters, in numeric order:

!"#$%&'()*+,-./0123456789:;<=>? @ ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ ` abcdefghijklmnopqrstuvwxyz{|}~


ASCII is fine if you speak English and need only 127 different characters, but not everyone in the world is English and not everyone in the world speaks English, so it quickly became evident that either additional encodings were required, or a new approach to encoding altogether would be needed. At first, a range of encodings emerged. Both Macintosh and Windows computers used 1-byte character encodings, which provided space for 256 characters. Although both systems shared a common ASCII heritage, the code points above 127 represented different characters on each platform. Macintosh used MacRoman (plus many variants for different languages) and Windows used Latin-1 (ISO 8859-1).

Since that time, a new standard has emerged (or perhaps more accurately, is emerging) that rationalizes character encoding and that allocates a large enough pool of code points to represent all languages. The standard is called Unicode, and it is managed by the Unicode Consortium. The first Unicode standard was first published in 1991.

Unicode uses up to four bytes to represent a character, and there is room in the standard for 1,114,112 code points. These code points are organized into 17 planes, each one representing 65536 code points (216). The first plane, plane "0", is called the Basic Multilingual Plane (BMP); this is the most commonly used plane, where code points have been assigned to a large portion of modern languages.

There is a standard approach to representing Unicode code points, which consists of "U+", followed by a hexadecimal digit. The range of code points in the BMP is U+0000 to U+FFFF (Unicode assigns code points beyond the BMP up to U+10FFFF, and the original specification allowed for ranges up to U+7FFFFFFF). The first 256 code points in the BMP are identical to Latin-1, which also means that the first 127 code points are identical to ASCII.

The Unicode standard uses several formats to encode code points. They come in two campsUCS, the Universal Character Set, and UTF, the Unicode Transformation Format.

UCSUniversal Character Set (UCS-2, UCS-4)

The Universal Character Set uses either 2 bytes or 4 bytes to represent characters. In UCS-2, every character is represented by a 2-byte code point (which, of necessity, limits the number of characters available and does not represent the complete range of Unicode). UCS-4, on the other hand, uses 4 bytes. There are two problems with UCS ("problems" is probably not the right wordbut these are the reasons that UCS is not used in practice as much as UTF). First off, early non-Unicode character sets (ASCII and LATIN-1) used 8 bytes. Using UCS means that legacy documents would have to be converted to either 2-byte or 4-byte formats to be viewed. Second, using either 2 bytes or 4 bytes for all characters means that your text will take up a lot more space than it would if you were able to use 1 byte for some characters, 2 bytes for others, and so on, which is exactly what UTF does.

UTFUnicode Transformation Format (UTF-8, UTF-16, UTF-32)

REALbasic supports UTF-8 and UTF-16 as native formats. There are also two additional UTF formats, UTF-7 and UTF-32. In UTF, characters are represented by code points of varying sizes, and this is what differentiates UTF from UCS. UTF-8, for example, identifies some characters with 1 byte, others with 2, all the way up to 4 bytes. UTF-16 starts with 2-byte codepoints, but represents some characters with 4 bytes and UTF-32 starts with 4-byte codepoints.

The following table shows the range of code points and the number of bytes used by UTF-16 and UTF-8 to represent it. The first three rows represent the BMP.

Table 6.1. UTF-8: Native Strings

Code Range (hex)

UTF-16 (Binary)

UTF-8 (Binary)

Comments

00000000007F

00000000 0---------

0--------

The UTF-8 values in this range are the same as ASCII. The first byte in UTF-8 is always 0.

0000800007FF

00000----- ----------

110------ 10------

UTF-8 uses 2 bytes to represent this range. The first byte begins with 110 and the second byte begins with 10. This range includes the Latin characters.

00080000FFFF

---------- ----------

1110----- 10-----
10--------

UTF-8 uses 3 bytes to represent this range, which is the upper limit to the BMP. The first byte begins with 1110, and the second and third begin with 10.

01000010FFFF

110110--- ----------
110111--- ----------

11110---- 10--------
10-------- 10--------

For code points beyond the BMP, both encodings use 4 bytes. However, note the difference in the prefixed values. UTF-16 uses a "surrogate pair" to represent values over U+FFFF. &h10000 is subtracted from the UTF-16 prefix, so that it can be distinguished from UTF-8.


Byte Order Mark

The byte order mark (BOM) is a character that is placed at the beginning of a file that can used to identify the byte order of the document; byte order is simply a reference to endianness, which is one of those topics that keeps popping up in any kind of cross-platform development. The character in question is supposed to be a zero-width non-breaking space. UCS-2 and UTF-16 are the two Unicode formats that use the BOM for the determination of endianness. When used with UTF-8 and others, it's used to identify the encoding of the file itself. In other words, based on the value of the first four bytes, you can determine the encoding of the stringthat is, if the BOM has been has been set. REALbasic usually makes this determination for you, but in case you want to check directly, here are the values and what they mean:

00 00 FE FF

UCS-4, big-endian machine (1234 order)

FF FE 00 00

UCS-4, little-endian machine (4321 order)

00 00 FF FE

UCS-4, unusual octet order (2143)

FE FF 00 00

UCS-4, unusual octet order (3412)

FE FF -- --

UTF-16, big-endian

FF FE -- --

UTF-16, little-endian

EF BB BF

UTF-8


Converting Encodings

Now that the basics of Unicode have been reviewed, it's time to turn to REALbasic's encoding classes and learn how to use the tools provided to effectively manage character encoding in your application.

TextEncoding Class

The TextEncoding class represents a particular encodingwhether Unicode or some native encoding like MacRoman.

TextEncoding.Base as Integer TextEncoding.Code as Integer TextEncoding.Variant as Integer TextEncoding.Format as Integer TextEncoding.InternetName as String


The TextEncoding class offers an alternative to the global Chr function discussed earlier in the book. When using the global function, it is assumed that you are using UTF-8, but that may not be what you want. If, for whatever reason, you want to get a character using a codepoint for another encoding, you can use this method on a TextEncoding instance that represents the encoding you want to use:

TextEncoding.Chr(codepoint as Integer) as String


You can test to see if the encoding of one string is equal to that of another this way:

TextEncoding.Equals(otherEncoding as TextEncoding) as Boolean


Encodings Object

The Encodings object is always available, and it is used to get a reference to a particular TextEncoding object. You can get a reference to the encoding using the encoding's name:

Encodings.EncodingName as TextEncoding


You can also get a reference to a particular encoding using the code for that encoding. The code is an integer that represents a particular encoding. This method lets you get a reference to a TextEncoding object by passing the code as an argument.

Encodings.GetFromCode(aCode as Integer) as TextEncoding


You can get references to characters through encoding, by calling the following method:

Encodings.UTF-8.Chr(aCodePoint as Integer)


To use the Encodings object, you need to know the names of the available encodings and, optionally, their codes. The following table provides a list of all the encodings recognized by REALbasic's Encodings object and the values associated with each encoding.

Table 6.2. Encodings Available from the Encoding Object

Encodings Object

Internet Name

Base

Variant

Format

Code

Encodings.SystemDefault (Mac)

macintosh

0

2

0

131072

encodings.SystemDefault (Windows)

windows-1252

1280

0

0

1280

Encodings.UTF8

UTF-8

256

0

2

134217984

Encodings.UTF16

UTF-16

256

0

0

256

Encodings.UCS4

UTF-32

256

0

3

201326848

Encodings.ASCII

US-ASCII

1536

0

0

1536

Encodings.WindowsLatin1

windows-1252

1280

0

0

1280

Encodings.WindowsLatin2

windows-1250

1281

0

0

1281

Encodings.WindowsLatin5

windows-1254

1284

0

0

1284

Encodings.WindowsKoreanJohab

Johab

1296

0

0

1296

Encodings.WindowsHebrew

windows-1255

1285

0

0

1285

Encodings.WindowsGreek

windows-1253

1283

0

0

1283

Encodings.WindowsCyrillic

windows-1251

1282

0

0

1282

Encodings.WindowsBalticRim

windows-1257

1287

0

0

1287

Encodings.WindowsArabic

windows-1256

1286

0

0

1286

Encodings.WindowsANSI

windows-1252

1280

0

0

1280

Encodings.WindowsVietnamese

windows-1258

1288

0

0

1288

Encodings.MacRoman

macintosh

0

0

0

0

Encodings.MacVietnamese

X-MAC-VIETNAMESE

30

0

0

30

Encodings.MacTurkish

X-MAC-TURKISH

35

0

0

35

Encodings.MacTibetan

X-MAC-TIBETAN

26

0

0

26

Encodings.MacThai

TIS-620

21

0

0

21

Encodings.MacTelugu

X-MAC-TELUGU

15

0

0

15

Encodings.MacTamil

X-MAC-TAMIL

14

0

0

14

Encodings.MacSymbol

Adobe-Symbol-Encoding

33

0

0

33

Encodings.MacSinhalese

X-MAC-SINHALESE

18

0

0

18

Encodings.MacRomanLatin1

ISO-8859-1

2564

0

0

2564

Encodings.MacRomanian

X-MAC-ROMANIAN

38

0

0

38

Encodings.MacOriya

X-MAC-ORIYA

12

0

0

12

Encodings.MacMongolian

X-MAC-MONGOLIAN

27

0

0

27

Encodings.MacMalayalam

X-MAC-MALAYALAM

17

0

0

17

Encodings.MacLaotian

X-MAC-LAOTIAN

22

0

0

22

Encodings.MacKorean

EUC-KR

3

0

0

3

Encodings.MacKhmer

X-MAC-KHMER

20

0

0

20

Encodings.MacKannada

X-MAC-KANNADA

16

0

0

16

Encodings.MacJapanese

Shift_JIS

1

0

0

1

Encodings.MacIcelandic

X-MAC-ICELANDIC

37

0

0

37

Encodings.MacHebrew

X-MAC-HEBREW

5

0

0

5

Encodings.MacGurmukhi

X-MAC-GURMUKHI

10

0

0

10

Encodings.MacGujarati

X-MAC-GUJARATI

11

0

0

11

Encodings.MacGree

X-MAC-GREEK

6

0

0

6

Encodings.MacGeorgian

X-MAC-GEORGIAN

23

0

0

23

Encodings.MacGaelic

 

40

0

0

40

Encodings.MacExtArabic

X-MAC-EXTARABIC

31

0

0

31

Encodings.MacEthiopic

X-MAC-ETHIOPIC

28

0

0

28

Encodings.MacDingbats

X-MAC-DINGBATS

34

0

0

34

Encodings.MacDevanagari

X-MAC-DEVANAGARI

9

0

0

9

Encodings.MacCyrillic

X-MAC-CYRILLIC

7

0

0

7

Encodings.MacCroatian

X-MAC-CROATIAN

36

0

0

36

Encodings.MacChineseTrad

Big5

2

0

0

2

Encodings.MacChineseSimp

GB2312

25

0

0

25

Encodings.MacCentralEurRoman

X-MAC-CE

29

0

0

29

Encodings.MacCeltic

 

39

0

0

39

Encodings.MacBurmese

X-MAC-BURMESE

19

0

0

19

Encodings.MacBengali

X-MAC-BENGALI

13

0

0

13

Encodings.MacArmenian

X-MAC-ARMENIAN

24

0

0

24

Encodings.MacArabic

X-MAC-ARABIC

4

0

0

4

Encodings.ISOLatin1

ISO-8859-1

513

0

0

513

Encodings.ISOLatin2

ISO-8859-2

514

0

0

514

Encodings.ISOLatin3

ISO-8859-3

515

0

0

515

Encodings.ISOLatin4

ISO-8859-4

516

0

0

516

Encodings.ISOLatin5

ISO-8859-9

521

0

0

521

Encodings.ISOLatin6

ISO-8859-10

522

0

0

522

Encodings.ISOLatin7

ISO-8859-13

525

0

0

525

Encodings.ISOLatin8

ISO-8859-14

526

0

0

526

Encodings.ISOLatin9

ISO-8859-15

527

0

0

527

Encodings.ISOLatinHebrew

ISO-8859-8-I

520

0

0

520

Encodings.ISOLatinGreek

ISO-8859-7

519

0

0

519

Encodings.ISOLatinCyrillic

ISO-8859-5

517

0

0

517

Encodings.ISOLatinArabic

ISO-8859-6-I

518

0

0

518

Encodings.DOSTurkish

cp857

1044

0

0

1044

Encodings.DOSThai

TIS-620

1053

0

0

1053

Encodings.DOSRussian

cp866

1051

0

0

1051

Encodings.DOSPortuguese

 

1045

0

0

1045

Encodings.DOSNordic

 

1050

0

0

1050

Encodings.DOSLatinUS

cp437

1024

0

0

1024

Encodings.DOSLatin2

cp852

1042

0

0

1042

Encodings.DOSLatin1

cp850

1040

0

0

1040

Encodings.DOSKorean

EUC-KR

1058

0

0

1058

Encodings.DOSJapanese

Shift_JIS

1056

0

0

1056

Encodings.DOSIcelandic

cp861

1046

0

0

1046

Encodings.DOSHebrew

DOS-862

1047

0

0

1047

Encodings.DOSGreek2

IBM869

1052

0

0

1052

Encodings.DOSGreek1

 

1041

0

0

1041

Encodings.DOSGreek

cp737

1029

0

0

1029

Encodings.DOSCyrillic

 

1043

0

0

1043

Encodings.DOSChineseTrad

Big5

1059

0

0

1059

Encodings.DOSChineseSimplif

GBK

1057

0

0

1057

Encodings.DOSCanadianFrench

 

1048

0

0

1048

Encodings.DOSBalticRim

cp775

1030

0

0

1030

Encodings.DOSArabic

cp864

1049

0

0

1049

Encodings.shiftJIS

Shift_JIS

2561

0

0

2561

Encodings.kOI8_R

KOI8-R

2562

0

0

2562


After you have a reference to a particular encoding, you can use it to define the encoding of a string or to convert the encoding from one String to another.

DefineEncoding Function

The DefineEncoding global function is used to apply an encoding to a string whose encoding you already know. In simpler terms, it adds a byte order mark at the beginning of the string so that the encoding can be determined by checking it.

DefineEncoding(aString as String, enc as TextEncoding) as String


ConvertEncoding Function

This global function lets you convert a string from one encoding to another. This is a more recent addition to the language than the TextConverter class and is generally easier to use. It assumes that REALbasic is able to figure out the encoding of the string that is being passed so that it can accurately convert it into the encoding passed in the second argument.

ConvertEncoding(aString as String, enc as TextEncoding) as String


Some additional global functions can be used, but they have largely been replaced by the methods outlined previously. There is a GetTextConverter function and a GetTextEncoding function, but it is generally simpler to use the Encodings object and ConvertEncoding function outlined earlier.

Base64 Encoding

Base64 encoding is unrelated to Unicode, but it is quite relevant to our discussion of XML and Internet applications in general. Basically, Base64 is a means of encoding arbitrary binary information into a string of ASCII characters that can then be emailed or otherwise transferred through the Internet.

REALbasic provides global functions for your encoding and decoding pleasure:

REALbasic.EncodeBase64(aString as String, [linewrap as Integer]) as String REALbasic.DecodeBase64(aString as String, [linewrap as Integer]) as String


Consider the following code snippet:

Dim s as String s = "ABCs are fun to learn." MsgBox EncodeBase64(s)


The MsgBox will display the following:

QUJDcyBhcmUgZnVuIHRvIGxlYXJuLg==


XML and HTML Entity Encoding

XML and HTML are encoded in UTF-8 or UTF-16 by default; if another encoding is used, it must be declared at the beginning of the document. However, because some Internet-related protocols can work only with 8-bit character sets, you sometimes need to encode characters outside that range in a special format. You can refer to specific characters using an XML entity reference. Entity references always start with "&" and end with ";" and every time an XML parser encounters an entity, it tries to expand itwhich means that it tries to replace it with the appropriate replacement text for that entity. An entity reference can refer to a long string of text, either declared within the document or residing outside the document, but the references used for character encoding are automatically expanded.

The generic way to refer to a particular character is to use a numeric entity reference, like so:

&#39;


This particular reference refers to the apostrophe character. The ampersand is followed by "#", which identifies it as a numeric character reference and the numbers that follow it are decimal representations of a particular code point. In addition to using entities to refer to characters that are outside the 1-byte range, there are also some characters that need to be encoded when you want to refer to them literally because they are used to mark up an XML document. The following table specifies those entities and what they represent:

&amp;

&

&gt;

>

&lt;

<

&apos;

'

&quot;

"


Detecting Encoding in XML Documents

If an XML document does not declare an encoding, it is supposed to be either UTF-8 or UTF-16. (The BOM is optional.) However, despite the fact that they are supposed to declare their encoding, some do not, and others may declare an encoding that is different from what the text is actually encoded in. This is surprisingly common (in my experience, at least), so checking is helpful. XML documents have to start with an XML declaration:

<?xml version="1.0" encoding="UTF-8"?>


Regardless of the version or encoding, you definitely know that each XML document will start with <?xml. Because you know what those characters have to be, you can deduce the encoding being used by examining the data in the first four positions. In some cases, you are just narrowing down the options, so you need to check the encoding declaration in the XML document as well to make sure it is consistent with what you have found.

00 00 00 3C

UCS-4, big-endian, or other 32-bit format

3C 00 00 00

UCS-4, little-endian

00 00 3C 00

UCS-4, unusual byte order

00 3C 00 00

UCS-4, unusual byte order

00 3C 00 3F

UTF-16BE, big-endian ISO-10646-UCS-2

3C 00 3F 00

UTF-16LE, little-endian ISO-10646-UCS-2

3C 3F 78 6D

UTF-8, ISO 646, ASCII, ISO 8859, Shift-JIS, EUC.

4C 6F A7 94

EBCDIC

Other?

Unknown





REALbasic Cross-Platform Application Development
REALbasic Cross-Platform Application Development
ISBN: 0672328135
EAN: 2147483647
Year: 2004
Pages: 149

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net