Encoding

A computer stores text as a series of numbers. For each letter, punctuation mark, or space there is a corresponding number, called a code point, which represents that letter. An encoding is simply a system that is used to identify characters using numbers. Binary data can be encoded as well, and later on in this chapter you will learn how to encode binary data using REALbasic's Base64 classes.

Encoding has been encountered previously in this book, but only superficially; however, it is a crucial part of developing programs, especially ones that run on multiple platforms, because you will no doubt be confronted with the full range of encoding possibilities. As it turns out, understanding character encoding is an important part of being able to make effective use of XML. In this section, I will discuss REALbasic's encoding classes and related tools and share a sample utility application I have used to explore encoding on REALbasic.

There are a lot of different encodings and if you do an appreciable amount of working with text, you will inevitably run into encoding headaches. One of the earliest encodings is ASCII (American Standard Code for Information Interchange, first standardized in 1963), which was limited to a 7-bit character set, or 127 characters. The first 32 characters, 031, are control characters, and 32 to 126 are the characters that make up the basic American-English alphabet.

Following is a list of all the printable ASCII characters, in numeric order:

!"#$%&'()*+,-./0123456789:;<=>? @ ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ ` abcdefghijklmnopqrstuvwxyz{|}~

ASCII is fine if you speak English and need only 127 different characters, but not everyone in the world is English and not everyone in the world speaks English, so it quickly became evident that either additional encodings were required, or a new approach to encoding altogether would be needed. At first, a range of encodings emerged. Both Macintosh and Windows computers used 1-byte character encodings, which provided space for 256 characters. Although both systems shared a common ASCII heritage, the code points above 127 represented different characters on each platform. Macintosh used MacRoman (plus many variants for different languages) and Windows used Latin-1 (ISO 8859-1).

Since that time, a new standard has emerged (or perhaps more accurately, is emerging) that rationalizes character encoding and that allocates a large enough pool of code points to represent all languages. The standard is called Unicode, and it is managed by the Unicode Consortium. The first Unicode standard was first published in 1991.

Unicode uses up to four bytes to represent a character, and there is room in the standard for 1,114,112 code points. These code points are organized into 17 planes, each one representing 65536 code points (2¹⁶). The first plane, plane "0", is called the Basic Multilingual Plane (BMP); this is the most commonly used plane, where code points have been assigned to a large portion of modern languages.

There is a standard approach to representing Unicode code points, which consists of "U+", followed by a hexadecimal digit. The range of code points in the BMP is U+0000 to U+FFFF (Unicode assigns code points beyond the BMP up to U+10FFFF, and the original specification allowed for ranges up to U+7FFFFFFF). The first 256 code points in the BMP are identical to Latin-1, which also means that the first 127 code points are identical to ASCII.

The Unicode standard uses several formats to encode code points. They come in two campsUCS, the Universal Character Set, and UTF, the Unicode Transformation Format.

UCSUniversal Character Set (UCS-2, UCS-4)

The Universal Character Set uses either 2 bytes or 4 bytes to represent characters. In UCS-2, every character is represented by a 2-byte code point (which, of necessity, limits the number of characters available and does not represent the complete range of Unicode). UCS-4, on the other hand, uses 4 bytes. There are two problems with UCS ("problems" is probably not the right wordbut these are the reasons that UCS is not used in practice as much as UTF). First off, early non-Unicode character sets (ASCII and LATIN-1) used 8 bytes. Using UCS means that legacy documents would have to be converted to either 2-byte or 4-byte formats to be viewed. Second, using either 2 bytes or 4 bytes for all characters means that your text will take up a lot more space than it would if you were able to use 1 byte for some characters, 2 bytes for others, and so on, which is exactly what UTF does.

UTFUnicode Transformation Format (UTF-8, UTF-16, UTF-32)

REALbasic supports UTF-8 and UTF-16 as native formats. There are also two additional UTF formats, UTF-7 and UTF-32. In UTF, characters are represented by code points of varying sizes, and this is what differentiates UTF from UCS. UTF-8, for example, identifies some characters with 1 byte, others with 2, all the way up to 4 bytes. UTF-16 starts with 2-byte codepoints, but represents some characters with 4 bytes and UTF-32 starts with 4-byte codepoints.

The following table shows the range of code points and the number of bytes used by UTF-16 and UTF-8 to represent it. The first three rows represent the BMP.

Table 6.1. UTF-8: Native Strings
Code Range (hex)	UTF-16 (Binary)	UTF-8 (Binary)	Comments
00000000007F	00000000 0---------	0--------	The UTF-8 values in this range are the same as ASCII. The first byte in UTF-8 is always 0.
0000800007FF	00000----- ----------	110------ 10------	UTF-8 uses 2 bytes to represent this range. The first byte begins with 110 and the second byte begins with 10. This range includes the Latin characters.
00080000FFFF	---------- ----------	1110----- 10----- 10--------	UTF-8 uses 3 bytes to represent this range, which is the upper limit to the BMP. The first byte begins with 1110, and the second and third begin with 10.
01000010FFFF	110110--- ---------- 110111--- ----------	11110---- 10-------- 10-------- 10--------	For code points beyond the BMP, both encodings use 4 bytes. However, note the difference in the prefixed values. UTF-16 uses a "surrogate pair" to represent values over U+FFFF. &h10000 is subtracted from the UTF-16 prefix, so that it can be distinguished from UTF-8.

Byte Order Mark

The byte order mark (BOM) is a character that is placed at the beginning of a file that can used to identify the byte order of the document; byte order is simply a reference to endianness, which is one of those topics that keeps popping up in any kind of cross-platform development. The character in question is supposed to be a zero-width non-breaking space. UCS-2 and UTF-16 are the two Unicode formats that use the BOM for the determination of endianness. When used with UTF-8 and others, it's used to identify the encoding of the file itself. In other words, based on the value of the first four bytes, you can determine the encoding of the stringthat is, if the BOM has been has been set. REALbasic usually makes this determination for you, but in case you want to check directly, here are the values and what they mean:

00 00 FE FF	UCS-4, big-endian machine (1234 order)
FF FE 00 00	UCS-4, little-endian machine (4321 order)
00 00 FF FE	UCS-4, unusual octet order (2143)
FE FF 00 00	UCS-4, unusual octet order (3412)
FE FF -- --	UTF-16, big-endian
FF FE -- --	UTF-16, little-endian
EF BB BF	UTF-8

Converting Encodings

Now that the basics of Unicode have been reviewed, it's time to turn to REALbasic's encoding classes and learn how to use the tools provided to effectively manage character encoding in your application.

TextEncoding Class

The TextEncoding class represents a particular encodingwhether Unicode or some native encoding like MacRoman.

TextEncoding.Base as Integer TextEncoding.Code as Integer TextEncoding.Variant as Integer TextEncoding.Format as Integer TextEncoding.InternetName as String

The TextEncoding class offers an alternative to the global Chr function discussed earlier in the book. When using the global function, it is assumed that you are using UTF-8, but that may not be what you want. If, for whatever reason, you want to get a character using a codepoint for another encoding, you can use this method on a TextEncoding instance that represents the encoding you want to use:

TextEncoding.Chr(codepoint as Integer) as String

You can test to see if the encoding of one string is equal to that of another this way:

TextEncoding.Equals(otherEncoding as TextEncoding) as Boolean

Encodings Object

The Encodings object is always available, and it is used to get a reference to a particular TextEncoding object. You can get a reference to the encoding using the encoding's name:

Encodings.EncodingName as TextEncoding

You can also get a reference to a particular encoding using the code for that encoding. The code is an integer that represents a particular encoding. This method lets you get a reference to a TextEncoding object by passing the code as an argument.

Encodings.GetFromCode(aCode as Integer) as TextEncoding

You can get references to characters through encoding, by calling the following method:

Encodings.UTF-8.Chr(aCodePoint as Integer)

To use the Encodings object, you need to know the names of the available encodings and, optionally, their codes. The following table provides a list of all the encodings recognized by REALbasic's Encodings object and the values associated with each encoding.

Table 6.2. Encodings Available from the Encoding Object
Encodings Object	Internet Name	Base	Variant	Format	Code
Encodings.SystemDefault (Mac)	macintosh	0	2	0	131072
encodings.SystemDefault (Windows)	windows-1252	1280	0	0	1280
Encodings.UTF8	UTF-8	256	0	2	134217984
Encodings.UTF16	UTF-16	256	0	0	256
Encodings.UCS4	UTF-32	256	0	3	201326848
Encodings.ASCII	US-ASCII	1536	0	0	1536
Encodings.WindowsLatin1	windows-1252	1280	0	0	1280
Encodings.WindowsLatin2	windows-1250	1281	0	0	1281
Encodings.WindowsLatin5	windows-1254	1284	0	0	1284
Encodings.WindowsKoreanJohab	Johab	1296	0	0	1296
Encodings.WindowsHebrew	windows-1255	1285	0	0	1285
Encodings.WindowsGreek	windows-1253	1283	0	0	1283
Encodings.WindowsCyrillic	windows-1251	1282	0	0	1282
Encodings.WindowsBalticRim	windows-1257	1287	0	0	1287
Encodings.WindowsArabic	windows-1256	1286	0	0	1286
Encodings.WindowsANSI	windows-1252	1280	0	0	1280
Encodings.WindowsVietnamese	windows-1258	1288	0	0	1288
Encodings.MacRoman	macintosh	0	0	0	0
Encodings.MacVietnamese	X-MAC-VIETNAMESE	30	0	0	30
Encodings.MacTurkish	X-MAC-TURKISH	35	0	0	35
Encodings.MacTibetan	X-MAC-TIBETAN	26	0	0	26
Encodings.MacThai	TIS-620	21	0	0	21
Encodings.MacTelugu	X-MAC-TELUGU	15	0	0	15
Encodings.MacTamil	X-MAC-TAMIL	14	0	0	14
Encodings.MacSymbol	Adobe-Symbol-Encoding	33	0	0	33
Encodings.MacSinhalese	X-MAC-SINHALESE	18	0	0	18
Encodings.MacRomanLatin1	ISO-8859-1	2564	0	0	2564
Encodings.MacRomanian	X-MAC-ROMANIAN	38	0	0	38
Encodings.MacOriya	X-MAC-ORIYA	12	0	0	12
Encodings.MacMongolian	X-MAC-MONGOLIAN	27	0	0	27
Encodings.MacMalayalam	X-MAC-MALAYALAM	17	0	0	17
Encodings.MacLaotian	X-MAC-LAOTIAN	22	0	0	22
Encodings.MacKorean	EUC-KR	3	0	0	3
Encodings.MacKhmer	X-MAC-KHMER	20	0	0	20
Encodings.MacKannada	X-MAC-KANNADA	16	0	0	16
Encodings.MacJapanese	Shift_JIS	1	0	0	1
Encodings.MacIcelandic	X-MAC-ICELANDIC	37	0	0	37
Encodings.MacHebrew	X-MAC-HEBREW	5	0	0	5
Encodings.MacGurmukhi	X-MAC-GURMUKHI	10	0	0	10
Encodings.MacGujarati	X-MAC-GUJARATI	11	0	0	11
Encodings.MacGree	X-MAC-GREEK	6	0	0	6
Encodings.MacGeorgian	X-MAC-GEORGIAN	23	0	0	23
Encodings.MacGaelic		40	0	0	40
Encodings.MacExtArabic	X-MAC-EXTARABIC	31	0	0	31
Encodings.MacEthiopic	X-MAC-ETHIOPIC	28	0	0	28
Encodings.MacDingbats	X-MAC-DINGBATS	34	0	0	34
Encodings.MacDevanagari	X-MAC-DEVANAGARI	9	0	0	9
Encodings.MacCyrillic	X-MAC-CYRILLIC	7	0	0	7
Encodings.MacCroatian	X-MAC-CROATIAN	36	0	0	36
Encodings.MacChineseTrad	Big5	2	0	0	2
Encodings.MacChineseSimp	GB2312	25	0	0	25
Encodings.MacCentralEurRoman	X-MAC-CE	29	0	0	29
Encodings.MacCeltic		39	0	0	39
Encodings.MacBurmese	X-MAC-BURMESE	19	0	0	19
Encodings.MacBengali	X-MAC-BENGALI	13	0	0	13
Encodings.MacArmenian	X-MAC-ARMENIAN	24	0	0	24
Encodings.MacArabic	X-MAC-ARABIC	4	0	0	4
Encodings.ISOLatin1	ISO-8859-1	513	0	0	513
Encodings.ISOLatin2	ISO-8859-2	514	0	0	514
Encodings.ISOLatin3	ISO-8859-3	515	0	0	515
Encodings.ISOLatin4	ISO-8859-4	516	0	0	516
Encodings.ISOLatin5	ISO-8859-9	521	0	0	521
Encodings.ISOLatin6	ISO-8859-10	522	0	0	522
Encodings.ISOLatin7	ISO-8859-13	525	0	0	525
Encodings.ISOLatin8	ISO-8859-14	526	0	0	526
Encodings.ISOLatin9	ISO-8859-15	527	0	0	527
Encodings.ISOLatinHebrew	ISO-8859-8-I	520	0	0	520
Encodings.ISOLatinGreek	ISO-8859-7	519	0	0	519
Encodings.ISOLatinCyrillic	ISO-8859-5	517	0	0	517
Encodings.ISOLatinArabic	ISO-8859-6-I	518	0	0	518
Encodings.DOSTurkish	cp857	1044	0	0	1044
Encodings.DOSThai	TIS-620	1053	0	0	1053
Encodings.DOSRussian	cp866	1051	0	0	1051
Encodings.DOSPortuguese		1045	0	0	1045
Encodings.DOSNordic		1050	0	0	1050
Encodings.DOSLatinUS	cp437	1024	0	0	1024
Encodings.DOSLatin2	cp852	1042	0	0	1042
Encodings.DOSLatin1	cp850	1040	0	0	1040
Encodings.DOSKorean	EUC-KR	1058	0	0	1058
Encodings.DOSJapanese	Shift_JIS	1056	0	0	1056
Encodings.DOSIcelandic	cp861	1046	0	0	1046
Encodings.DOSHebrew	DOS-862	1047	0	0	1047
Encodings.DOSGreek2	IBM869	1052	0	0	1052
Encodings.DOSGreek1		1041	0	0	1041
Encodings.DOSGreek	cp737	1029	0	0	1029
Encodings.DOSCyrillic		1043	0	0	1043
Encodings.DOSChineseTrad	Big5	1059	0	0	1059
Encodings.DOSChineseSimplif	GBK	1057	0	0	1057
Encodings.DOSCanadianFrench		1048	0	0	1048
Encodings.DOSBalticRim	cp775	1030	0	0	1030
Encodings.DOSArabic	cp864	1049	0	0	1049
Encodings.shiftJIS	Shift_JIS	2561	0	0	2561
Encodings.kOI8_R	KOI8-R	2562	0	0	2562

After you have a reference to a particular encoding, you can use it to define the encoding of a string or to convert the encoding from one String to another.

DefineEncoding Function

The DefineEncoding global function is used to apply an encoding to a string whose encoding you already know. In simpler terms, it adds a byte order mark at the beginning of the string so that the encoding can be determined by checking it.

DefineEncoding(aString as String, enc as TextEncoding) as String

ConvertEncoding Function

This global function lets you convert a string from one encoding to another. This is a more recent addition to the language than the TextConverter class and is generally easier to use. It assumes that REALbasic is able to figure out the encoding of the string that is being passed so that it can accurately convert it into the encoding passed in the second argument.

ConvertEncoding(aString as String, enc as TextEncoding) as String

Some additional global functions can be used, but they have largely been replaced by the methods outlined previously. There is a GetTextConverter function and a GetTextEncoding function, but it is generally simpler to use the Encodings object and ConvertEncoding function outlined earlier.

Base64 Encoding

Base64 encoding is unrelated to Unicode, but it is quite relevant to our discussion of XML and Internet applications in general. Basically, Base64 is a means of encoding arbitrary binary information into a string of ASCII characters that can then be emailed or otherwise transferred through the Internet.

REALbasic provides global functions for your encoding and decoding pleasure:

REALbasic.EncodeBase64(aString as String, [linewrap as Integer]) as String REALbasic.DecodeBase64(aString as String, [linewrap as Integer]) as String

Consider the following code snippet:

Dim s as String s = "ABCs are fun to learn." MsgBox EncodeBase64(s)

The MsgBox will display the following:

QUJDcyBhcmUgZnVuIHRvIGxlYXJuLg==

XML and HTML Entity Encoding

XML and HTML are encoded in UTF-8 or UTF-16 by default; if another encoding is used, it must be declared at the beginning of the document. However, because some Internet-related protocols can work only with 8-bit character sets, you sometimes need to encode characters outside that range in a special format. You can refer to specific characters using an XML entity reference. Entity references always start with "&" and end with ";" and every time an XML parser encounters an entity, it tries to expand itwhich means that it tries to replace it with the appropriate replacement text for that entity. An entity reference can refer to a long string of text, either declared within the document or residing outside the document, but the references used for character encoding are automatically expanded.

The generic way to refer to a particular character is to use a numeric entity reference, like so:

&#39;

This particular reference refers to the apostrophe character. The ampersand is followed by "#", which identifies it as a numeric character reference and the numbers that follow it are decimal representations of a particular code point. In addition to using entities to refer to characters that are outside the 1-byte range, there are also some characters that need to be encoded when you want to refer to them literally because they are used to mark up an XML document. The following table specifies those entities and what they represent:

&	&
>	>
<	<
'	'
"	"

Detecting Encoding in XML Documents

If an XML document does not declare an encoding, it is supposed to be either UTF-8 or UTF-16. (The BOM is optional.) However, despite the fact that they are supposed to declare their encoding, some do not, and others may declare an encoding that is different from what the text is actually encoded in. This is surprisingly common (in my experience, at least), so checking is helpful. XML documents have to start with an XML declaration:

<?xml version="1.0" encoding="UTF-8"?>

Regardless of the version or encoding, you definitely know that each XML document will start with <?xml. Because you know what those characters have to be, you can deduce the encoding being used by examining the data in the first four positions. In some cases, you are just narrowing down the options, so you need to check the encoding declaration in the XML document as well to make sure it is consistent with what you have found.

00 00 00 3C	UCS-4, big-endian, or other 32-bit format
3C 00 00 00	UCS-4, little-endian
00 00 3C 00	UCS-4, unusual byte order
00 3C 00 00	UCS-4, unusual byte order
00 3C 00 3F	UTF-16BE, big-endian ISO-10646-UCS-2
3C 00 3F 00	UTF-16LE, little-endian ISO-10646-UCS-2
3C 3F 78 6D	UTF-8, ISO 646, ASCII, ISO 8859, Shift-JIS, EUC.
4C 6F A7 94	EBCDIC
Other?	Unknown