Recipe 19.13. Manipulating UTF-8 Text

19.13.1. Problem

You want to work with UTF-8-encoded text in your programs. For example, you want to properly calculate the length of multibyte strings and make sure that all text is output as proper UTF-8-encoded characters.

19.13.2. Solution

Use a combination of PHP functions for the variety of tasks that UTF-8 compliance demands.

If the mbstring extension is available, use its string functions for UTF-8-aware string manipulation. Example 19-26 uses the mb_strlen( ) function to compute the number of characters in each of two UTF-8-encoded strings.

Using mb_strlen( )

<?php // Set the encoding properly mb_internal_encoding('UTF-8'); // ö is two bytes $name = 'Kurt Gödel'; // Each of these Hangul characters is three bytes $dinner = 
; $name_len_bytes = strlen($name); $name_len_chars = mb_strlen($name); $dinner_len_bytes = strlen($dinner); $dinner_len_chars = mb_strlen($dinner); print "$name is $name_len_bytes bytes and $name_len_chars chars\n"; print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n"; ?>

Example 19-26 prints:

Kurt Gödel is 11 bytes and 10 chars

is 9 bytes and 3 chars

The iconv extension, which is available by default in PHP 5, also offers a few multibyte-aware string manipulation functions, as shown in Example 19-27.

Using iconv

<?php // Set the encoding properly iconv_set_encoding('internal_encoding','UTF-8'); // ö is two bytes $name = 'Kurt Gödel'; // Each of these Hangul characters is three bytes $dinner = 
; $name_len_bytes = strlen($name); $name_len_chars = iconv_strlen($name); $dinner_len_bytes = strlen($dinner); $dinner_len_chars = iconv_strlen($dinner); print "$name is $name_len_bytes bytes and $name_len_chars chars\n"; print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n <br/>"; print "The seventh character of $name is " . iconv_substr($name,6,1) . "\n"; print "The last two characters of $dinner are " . iconv_substr($dinner,-2); ?>

Use the optional third argument to functions such as htmlentities( ) and htmlspecialchars( ) that instructs them to treat input as UTF-8 encoded, as shown in Example 19-28.

UTF-8 HTML encoding

<?php $encoded_name = htmlspecialchars($_POST['name'], ENT_QUOTES, 'UTF-8'); $encoded_dinner = htmlentities($_POST['dinner'], ENT_QUOTES, 'UTF-8'); ?>

19.13.3. Discussion

Eternal vigilance is the price of proper character encoding, at least until PHP 6 is released. If you've followed the instructions in Recipes 19.11 and 19.12, data coming into your program should be UTF-8 encoded and browsers will properly handle data coming out of your program as UTF-8 encoded. This leaves you with two responsibilities: to operate on strings in a UTF-8-aware manner and to generate text that is UTF-8 encoded.

Fulfulling the first responsibility is made easier once you have adopted the fundamental credo of internationalization awareness: a character is not a byte. The PHP-specific correlary to this axiom is that PHP's string functions only know about bytes, not characters. For example, the strlen( ) function counts the number of bytes in a string, not the number of characters. In the prelapsarian days of ISO-8859-1 encoding, this wasn't a problem'each of the 256 characters in the character set took up one byte. A UTF-8-encoded character, on the other hand, uses between one and four bytes. The mbstring and iconv extensions provide alternatives for some string functions that operate on a character-by-character basis, not a byte-by-byte basis. These functions are listed in Table 20-3.

Table 19-3. Character-Based Functions
Regular function	mbstring function	iconv function
strlen( )	mb_strlen( )	iconv_strlen( )
strpos( )	mb_strpos( )	iconv_strpos( )
strrpos( )	mb_strrpos( )	iconv_strrpos( )
substr( )	mb_substr( )	iconv_substr( )
strtolower( )	mb_strtolower( )	-
strtoupper( )	mb_strtoupper( )	-
substr_count( )	mb_substr_count( )	-
ereg( )	mb_ereg( )	-
eregi( )	mb_eregi( )	-
ereg_replace( )	mb_ereg_replace( )	-
eregi_replace( )	mb_eregi_replace( )	-
split( )	mb_split( )	-
mail( )	mb_send_mail( )	-

For mbstring to work properly, it needs to be told to use the UTF-8 encoding scheme. As in Example 19-26, you can do this in script with the mb_internal_encoding( ) function. Or to set this value system-wide, set the mbstring.internal_encoding configuration directive to UTF-8.

iconv has similar needs. Use the iconv_set_encoding( ) function as in Example 19-27 or set the iconv.internal_encoding configuration directive.

mbstring provides alternatives for the ereg family of regular expression functions. However, you can always use UTF-8 strings with the PCRE (preg_*( )) regular expression functions. The u modifier tells a preg function that the pattern string is UTF-8 encoded and enables the use of various Unicode properties in patterns. Example 19-29 uses the "lowercase letter" Unicode property to count the number of lowercase letters in each of two strings.

UTF-8 regular expression matching

<?php $name = 'Kurt Gödel'; $dinner = 
; $name_lower = preg_match_all('/\p{Ll}/u',$name,$match); $dinner_lower = preg_match_all('/\p{Ll}/u',$dinner,$match); print "There are $name_lower lowercase letters in $name. \n"; print "There are $dinner_lower lowercase letters in $dinner. \n"; ?>

Example 19-29 prints:

There are 7 lowercase letters in Kurt Gödel. There are 3 lowercase letters in

Other functions help you translate between other character encodings and UTF-8. The utf8_encode( ) and utf8_decode( ) functions move strings between the ISO-8859-1 encoding and UTF-8. Because ISO-8859-1 is the default encoding in many situations, these functions are a handy way to bring non-UTF-8-aware data into compliance. For example, the dictionaries that the pspell extension uses often have their entries encoded in ISO-8859-1. In Example 19-30, the utf8_encode( ) function is necessary to turn the output of pspell_suggest( ) into a proper UTF-8-encoded string.

Applying UTF-8 encoding to ISO-8859-1 strings

<?php $lang = isset($_GET['lang']) ? $_GET['lang'] : 'en'; $word = isset($_GET['word']) ? $_GET['word'] : 'asparagus'; $ps = pspell_new($lang); $check = pspell_check($ps, $word); print htmlspecialchars($word,ENT_QUOTES,'UTF-8'); print $check ? ' is ' : ' is not '; print ' found in the dictionary.'; print '<hr/>'; if (! $check) {     $suggestions = pspell_suggest($ps, $word);     if (count($suggestions)) {         print 'Suggestions: <ul>';         foreach ($suggestions as $suggestion) {             $utf8suggestion = utf8_encode($suggestion);             $safesuggestion = htmlspecialchars($utf8suggestion,                                                ENT_QUOTES,'UTF-8');             print "<li>$safesuggestion</li>";         }         print '</ul>'; } ?>

It may ease the cognitive burden of proper character encoding to think of it as a task similar to HTML entity encoding. In each case, text must be processed so that it is appropriately formatted for a particular context. With entity encoding, that usually means running data retrieved from an external source through htmlentities( ) or htmlspecialchars( ). With character encoding, it means turning everything into UTF-8 before you process it, using a character-aware function for string operations, and ensuring strings are UTF-8 encoded before outputting them.

19.13.4. See Also

Recipes 19.11 and 19.12 for setting up your programs for receiving and sending UTF-8-encoded strings; documentation on mbstring at http://www.php.net/mbstring, on iconv at http://www.php.net/iconv, on htmlentities( ) at http://www.php.net/htmlentities, on htmlspecialchars( ) at http://www.php.net/htmlspecialchars, on PCRE pattern syntax at http://www.php.net/reference.pcre.pattern.syntax, on utf8_encode( ) at http://www.php.net/utf8_encode, and on utf8_decode( ) at http://www.php.net/utf8_decode.

Good background resources on managing PHP and character set issues include:

"An Overview on Globalizing Oracle PHP Applications" by Peter Linsley (http://www.oracle.com/technology/tech/php/pdf/globalizing_oracle_php_applications.pdf)
Character Sets/Character Encoding Issues on the PHP WACT Wiki (http://www.phpwact.org/php/i18n/charsets)
"Characters vs. Bytes" by Tim Bray (http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF)
"A Tutorial on Character Code Issues" by Jukka Korpela (http://www.cs.tut.fi/~jkorpela/chars.html)