19.13.1. ProblemYou want to work with UTF-8-encoded text in your programs. For example, you want to properly calculate the length of multibyte strings and make sure that all text is output as proper UTF-8-encoded characters. 19.13.2. SolutionUse a combination of PHP functions for the variety of tasks that UTF-8 compliance demands. If the mbstring extension is available, use its string functions for UTF-8-aware string manipulation. Example 19-26 uses the mb_strlen( ) function to compute the number of characters in each of two UTF-8-encoded strings. Using mb_strlen( )
Example 19-26 prints: Kurt Gödel is 11 bytes and 10 chars is 9 bytes and 3 chars The iconv extension, which is available by default in PHP 5, also offers a few multibyte-aware string manipulation functions, as shown in Example 19-27. Using iconv
Use the optional third argument to functions such as htmlentities( ) and htmlspecialchars( ) that instructs them to treat input as UTF-8 encoded, as shown in Example 19-28. UTF-8 HTML encoding
19.13.3. DiscussionEternal vigilance is the price of proper character encoding, at least until PHP 6 is released. If you've followed the instructions in Recipes 19.11 and 19.12, data coming into your program should be UTF-8 encoded and browsers will properly handle data coming out of your program as UTF-8 encoded. This leaves you with two responsibilities: to operate on strings in a UTF-8-aware manner and to generate text that is UTF-8 encoded. Fulfulling the first responsibility is made easier once you have adopted the fundamental credo of internationalization awareness: a character is not a byte. The PHP-specific correlary to this axiom is that PHP's string functions only know about bytes, not characters. For example, the strlen( ) function counts the number of bytes in a string, not the number of characters. In the prelapsarian days of ISO-8859-1 encoding, this wasn't a problem'each of the 256 characters in the character set took up one byte. A UTF-8-encoded character, on the other hand, uses between one and four bytes. The mbstring and iconv extensions provide alternatives for some string functions that operate on a character-by-character basis, not a byte-by-byte basis. These functions are listed in Table 20-3.
For mbstring to work properly, it needs to be told to use the UTF-8 encoding scheme. As in Example 19-26, you can do this in script with the mb_internal_encoding( ) function. Or to set this value system-wide, set the mbstring.internal_encoding configuration directive to UTF-8. iconv has similar needs. Use the iconv_set_encoding( ) function as in Example 19-27 or set the iconv.internal_encoding configuration directive. mbstring provides alternatives for the ereg family of regular expression functions. However, you can always use UTF-8 strings with the PCRE (preg_*( )) regular expression functions. The u modifier tells a preg function that the pattern string is UTF-8 encoded and enables the use of various Unicode properties in patterns. Example 19-29 uses the "lowercase letter" Unicode property to count the number of lowercase letters in each of two strings. UTF-8 regular expression matching
Example 19-29 prints: There are 7 lowercase letters in Kurt Gödel. There are 3 lowercase letters in . Other functions help you translate between other character encodings and UTF-8. The utf8_encode( ) and utf8_decode( ) functions move strings between the ISO-8859-1 encoding and UTF-8. Because ISO-8859-1 is the default encoding in many situations, these functions are a handy way to bring non-UTF-8-aware data into compliance. For example, the dictionaries that the pspell extension uses often have their entries encoded in ISO-8859-1. In Example 19-30, the utf8_encode( ) function is necessary to turn the output of pspell_suggest( ) into a proper UTF-8-encoded string. Applying UTF-8 encoding to ISO-8859-1 strings
It may ease the cognitive burden of proper character encoding to think of it as a task similar to HTML entity encoding. In each case, text must be processed so that it is appropriately formatted for a particular context. With entity encoding, that usually means running data retrieved from an external source through htmlentities( ) or htmlspecialchars( ). With character encoding, it means turning everything into UTF-8 before you process it, using a character-aware function for string operations, and ensuring strings are UTF-8 encoded before outputting them. 19.13.4. See AlsoRecipes 19.11 and 19.12 for setting up your programs for receiving and sending UTF-8-encoded strings; documentation on mbstring at http://www.php.net/mbstring, on iconv at http://www.php.net/iconv, on htmlentities( ) at http://www.php.net/htmlentities, on htmlspecialchars( ) at http://www.php.net/htmlspecialchars, on PCRE pattern syntax at http://www.php.net/reference.pcre.pattern.syntax, on utf8_encode( ) at http://www.php.net/utf8_encode, and on utf8_decode( ) at http://www.php.net/utf8_decode. Good background resources on managing PHP and character set issues include:
|