Recipe 12.11. Handling Content Encoding


12.11.1. Problem

PHP XML extensions use UTF-8, but your data is in a different content encoding.

12.11.2. Solution

Use the iconv library to convert it before passing it into an XML extension:

$utf_8 = iconv('ISO-8859-1', 'UTF-8', $iso_8859_1); 

Then convert it back when you are finished:

$iso_8859_1 = iconv('UTF-8', 'ISO-8859-1', $utf_8); 

12.11.3. Discussion

Character encoding is a major PHP 5 weakness. Fortunately, Unicode support is the major driver behind PHP 6. Since PHP 6 is still under development, in the meantime, you can run into problems if you're trying to use XML extensions with arbitrary encoded data.

For simplicity, the XML extensions all exclusively use the UTF-8 character encoding. That means they all expect data in UTF-8 and output all data in UTF-8. If your data is ASCII, then you don't need to worry, UTF-8 is a superset of ASCII. However, if you're using other encodings, then you will run into trouble sooner or later.

To work around this issue, use the iconv extension to manually encode data back and forth between your character sets and UTF-8. For example, to convert from ISO-8859-1 to UTF-8:

$utf_8 = iconv('ISO-8859-1', 'UTF-8', $iso_8859_1); 

The iconv function supports two special modifiers for the destination encoding: //TRANSLIT and //IGNORE. The first option tells iconv that whenever it cannot exactly duplicate a character in the destination encoding, it should try to approximate it using a series of other characters. The other option makes iconv silently ignore any unconvertible characters.

For example, the string $geb holds the text Gödel, Escher, Bach. A straight conversion to ASCII produces an error:

echo iconv('UTF-8', 'ASCII', $geb); PHP Notice:  iconv(): Detected an illegal character in input string...

Enabling the //IGNORE feature allows the conversion to occur:

echo iconv('UTF-8', 'ASCII//IGNORE', $geb);

However, the output isn't nice, because the ö is missing:

Gdel, Escher, Bach

The best solution is to use //trANSLIT:

echo iconv('UTF-8', 'ASCII//TRANSLIT', $geb);

This produces a better-looking string:

Gdel, Escher, Bach

However, be careful when you use //trANSLIT, as it can increase the number of characters. For example, the single character ö becomes two characters: " and o.

12.11.4. See Also

More information about working with UTF-8 text is in 19.13; documentation on iconv at http://www.php.net/iconv; the GNU libiconv home page at http://www.gnu.org/software/libiconv/.




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net