4.5 The String Type

	ActionScript for Flash MX: The Definitive Guide, 2nd Edition By Colin Moock

	Chapter 4. Primitive Datatypes

String is the datatype used for textual data (letters, punctuation marks, and other characters). A string literal is any combination of characters enclosed in single or double quotation marks:

"asdfksldfsdfeoif"  // A frustrated string "greetings"         // A friendly string "unity@moock.org"   // A self-promotional string "123"               // It may look like a number, but it's a string 'singles'           // Single quotes are acceptable too

Before we see how to form string literals, let's examine which characters are permitted in strings.

4.5.1 Character Encoding

Like all computer data, text characters are stored internally using a numeric code. They are encoded for storage and decoded for display using a character set, which maps (i.e., relates) characters to their numeric codes. Character sets vary for different languages and alphabets. Originally, most Western applications used some derivative of ASCII, a standard character set that includes only 128 characters the English alphabet, numbers, basic punctuation marks, and special control characters. In time, Western applications supported a family of character sets known collectively as ISO-8859, an extension of ASCII. Each of the ISO-8859 character sets encodes the standard Latin alphabet (A to Z), plus a varying set of letters needed in the target languages. Flash Player 5, for example, uses ISO-8859-1, also known as Latin 1, as its primary character map (it also supports a second character set, called Shift-JIS, for Japanese characters).

The Latin 1 character set accommodates most Western European languages French, German, Italian, Spanish, Portuguese, and so on but not languages such as Greek, Turkish, Slavic, and Russian. In order to support the full range of the world's languages, Flash Player 6 uses Unicode, the preferred international standard for character encoding, which can map a million or more characters. Flash Player 6 can display most characters supported by Unicode, regardless of the regional settings, provided that the appropriate font is available on the end user's system. However, Flash Player 6 does not support right-to-left or bidirectional scripts, such as Arabic and Hebrew.

Appendix B lists each character's Unicode code point, which is the character's numeric position in the Unicode set. Later, we'll see how to use code points to manipulate characters in our scripts and to display Unicode characters. For more information on Unicode, see the resources cited in Appendix A.

4.5.2 String Literals

The most common way to create a string is to put either single or double quotation marks around a group of characters:

"hello" 'Nice night for a walk.' "The equation is 12 + 4 = 16, which programmers see as 12 + 4 =  = 16."

If we use a double quotation mark to start a string, we must end it with a double quotation mark as well. Likewise, if we use a single quotation mark to start a string, we must end that string with a single quotation mark. However, a double-quoted string may contain single-quoted characters and vice versa. These strings, for example, contain legal uses of single and double quotes:

"Nice night, isn't it?"               // Single apostrophe inside double quotes 'I said, "What a pleasant evening!"'  // Double quotes inside single quotes

4.5.2.1 The empty string

The shortest possible string is the empty string, a string with no characters:

"" ''

The empty string is handy when we're trying to detect whether a variable contains a usable string value or not. Here we check whether the input text field, firstName_txt, has a usable value:

if (firstName_txt.text =  = "") {   trace("You forgot to enter your name!"); }

The empty string is not considered equal to one or more space characters (e.g., "" is not the same as " "). To check if a string has at least one usable character, we revise our code to detect any character above Unicode code point 32 (everything below code point 32 is either a control character or a whitespace character), as shown in Example 4-2.

Example 4-2. A custom String.isEmpty( ) method

// Checks whether the string contains any characters above Unicode 32. String.prototype.isEmpty = function ( ) {   // If a useful character is found...   for (var i=0; i < this.length; i++) {     if (this.charCodeAt(i) > 32) {        // The string is not "empty."         return false;     }   }       // No useful characters were found. The string is "empty."   return true; }     // Now use the new method to test if our firstName_txt text field has a useful value. if (firstName_txt.text.isEmpty( )) {   trace("You forgot to enter your name!"); } else {   trace("Welcome, " + firstName_txt.text); }

4.5.2.2 Escape sequences

We have seen that single quotes (' ) may be used inside double-quoted literals, and double quotes (" ) may be used inside single-quoted literals. But what if we want to use both? For example:

'I remarked "Nice night, isn't it?"'

As is, this line of code causes an error because the ActionScript compiler thinks that the string literal ends with the apostrophe in the word "isn't." The compiler reads it as:

'I remarked "Nice night, isn'  // The rest is considered unintelligible garbage

To use the single quote inside a string literal delimited by single quotes, we must use an escape sequence. An escape sequence represents a literal string value using a backslash (\), followed by the desired character or a code that represents the character. The escape sequences for single and double quotes are:

\' \"

So, our cordial evening greeting, properly expressed as a string literal, should be:

'I remarked "Nice night, isn\'t it?"'  // Escape the apostrophe!

Other escape sequences, which can be used to represent various special or reserved characters, are listed in Table 4-1.

Table 4-1. ActionScript escape sequences
Escape sequence	Meaning
`\b`	Backspace character (ASCII 8)
`\f`	Form feed character (ASCII 12)
`\n`	Newline character; causes a line break (ASCII 10)
`\r`	Carriage return (CR) character; causes a line break (ASCII 13)
`\t`	Tab character (ASCII 9)
`\`'	Single quotation mark
`\`"	Double quotation mark
`\\`	Backslash character; necessary when using backslash as a literal character, to prevent `\` from being interpreted as the beginning of an escape sequence

4.5.2.3 Unicode escape sequences

Not all Unicode characters are accessible from a keyboard. In order to include inaccessible characters in a string, we use Unicode escape sequences.

A Unicode escape sequence starts with a backslash and a lowercase u (i.e., \u) followed by a four-digit hex number that corresponds to the Unicode character's code point, such as:

\u0040  // The @ sign \u00A9  // The copyright symbol \u0041  // The capital letter "A" \u2014  // The em dash

A code point is a unique identification number that is assigned to each character in the Unicode character set. See Appendix B for a list of the Unicode code points for Latin 1. Code points for other languages can be found in the character charts at the Unicode Consortium site:

http://www.unicode.org/charts

If you have trouble finding a character amongst the thousands of code points, consult the Unicode Consortium's helpful suggestions:

http://www.unicode.org/unicode/standard/where/

To escape characters from the Latin 1 character set only, we can use a short form for the standard Unicode escape sequence. The short form consists of the prefix \x followed by a two-digit hexadecimal number that represents the Latin 1 code point of the character. Since Latin 1 code points are the same as the first 256 Unicode code points, you can still use the reference chart in Appendix B, but simply remove the u00, as in the following examples:

\u0040  // Unicode escape sequence \x40    // \x shortcut form \u00A9  // Unicode... \xA9    // ...you get the idea

In addition to using Unicode escape sequences, we can insert any character into a string via the built-in fromCharCode( ) function, described later in this chapter under Section 4.6.9. The fromCharCode( ) function accepts a code point as any number-yielding expression, whereas an escape sequence requires a numeric literal.

Flash Player 5 supported Unicode-style escape sequences, but it did not support Unicode. A Unicode escape sequence in Flash 5 could be used to specify code points in the Latin 1 or Shift-JIS character sets only, not Unicode code points. Inserting Unicode code points will not yield the correct character in Flash Player 5 or in .swf files exported in Flash 5 format.

4.5.3 Entering Multilingual and Special Characters

Though Flash Player 6 fully supports Unicode, the Flash MX authoring tool does not. The Flash MX authoring tool allows entry of characters from the Latin 1, Shift-JIS, and MacRoman character sets. Other character sets, combined with international keyboards and OS combinations may pose display problems. Hence, the most flexible way to enter Unicode text in Flash MX is to load it from an external Unicode-formatted text file. On Windows, Notepad, Word, and WordPad all support saving as Unicode. On Macintosh OS X, TextEdit supports saving as Unicode. The simplest way to import a Unicode text file is to use the #include directive in concert with the special //!-- UTF8 comment, which effectively copies and pastes the contents of the file into a Flash movie at compile time. For information on using #include with Unicode, see the #include directive in the Language Reference. External Unicode text can also be stored as XML or URL-encoded variables. For information on loading external XML or variable files, see the XML.load( ) and LoadVars.load( ) methods in the ActionScript Language Reference.

Note that the specific encoding format supported by #include and LoadVars is UTF-8, while the encoding formats supported by XML are UTF-8, UTF16-BE, and UTF-16LE (i.e., either big-endian or little-endian). When a file is UTF-16 encoded, it is expected to start with a byte order marker (BOM) indicating whether the encoding is big-endian or little-endian. Most text editors add the BOM automatically. When no BOM is present in a file, the encoding is assumed to be UTF-8. When in doubt, you should use UTF-8 encoding, where byte order is not an issue. For more information on UTFs and BOMs, see:

http://www.unicode.org/unicode/faq/utf_bom.html

To include small amounts of Unicode directly within the Flash MX authoring tool, use either the standard Unicode hex escape sequence (described earlier in this chapter) or String.fromCharCode( ) (described later in this chapter).

For example, the following code creates a global variable, euro, that contains the euro sign character, and then inserts that character into the price_txt text field for display.

_global.euro = "\u20AC"; this.createTextField("price_txt", 1, 100, 100, 200, 20); price_txt.text = "99 " + euro;

In Flash Player 5, all Unicode (\u) escape sequences must reference a code point in the Latin 1 or Shift-JIS character sets; other Unicode characters will not display correctly. For example, in Flash Player 5, the code in our example would not display the euro sign character correctly.