With our newfound (or refreshed) knowledge of character sets, it is now time to address how all this fits into PHP5. As we said earlier, PHP is largely an 8-bit English (ISO-8859-1) language engine. It, along with the web server engine, sends any character data it sees to the output stream and leaves the output to define itself, or the receiving browser to interpret the data.
Working with Other Character Sets
There are two places we can indicate our character set:
To save text in a certain character set, you need a text editor that supports this functionality. For various 8-bit character sets (such as ISO-8859-XX and Windows-YYYY), no special effort is needed, as all formats are still streams of bytesyou only need to worry about how programs interpret and display them.
If one were to write a blurb in Turkish, it would look like this if displayed in the proper iso-8859-9 character set:
Benim adlm Mark. Kanadallylm. Nasllsiniz? (Turkçe çok zor)
(Turkish is famous for the small letter i without the dot on top, l.) The same file looks as follows if displayed in the default iso-8859-1 character set (used on most English language computers):
Benim adm Mark. Kanadalym. Nasylsiniz? (Turkçe çok zor)
As for other formats where characters are split into multiple bytes, you need to work with an editor that understands how to save the characters into the various formats. Fortunately, most modern text editors support the various Unicode encodings, including the choice of whether or not to include a signature at the beginning of the file. Many editors also support some, if not all, of the Asian character sets.
In a pinch, the NOTEPAD.EXE that ships with the latest versions of Microsoft Windows supports loading and saving files in Unicode format. This is done via the "Save" dialog box you see when saving a file for the first time or when selecting "Save As" from the "File" menu (shown in Figure 6-1).
Figure 6-1. Saving files as Unicode in Notepad under Microsoft Windows.
Telling a client in which character set the current output stream is encoded depends on what you are sending them. For HTML, you will want to include a directive along the following lines in the <head> section of your output:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
By changing what is in the charset portion of the content attribute, you can influence how the end client displays your output. (Most web browsers honor this field.)
Trouble Rears Its Ugly Head
While we have said that PHP is not particularly concerned with which character set you are using, the fact remains that most multi-byte character sets do require some conscious effort on the program developer's part to ensure that they operate correctly.
Problems will arise when you use built-in functions that are not aware of multi-byte character sets, such as many of the string functions we will see later in this chapter, and the routines for manipulating files on the file system of your computer. (See Chapter 24, "Files and Directories" for more detail.)
If we consider the following Japanese text (which merely professes the author's love of sushi) encoded as a UTF-8 string
we would expect the strlen functionwhich returns the length of the string passed to it as a parameter (see the following code)to return 9 (one for each character). In fact, the strlen function will return the value 27, meaning that the characters average three bytes each!
What is even more frustrating is that some of these multi-byte characters can contain individual bytes that match up with other individual characters in a single-byte character set, such as ASCII. Although we only see 9 characters in the previous string, searching for ¿ (used for questions in Spanish) with the strpos function will yield some unexpected results. The ¿ character is represented in the ASCII character set by the byte value 0xbf (191):
<?php // // chr() takes a number and returns the ASCII character for // that number. // echo strpos('', chr(0xbf)); ?>
We would expect the function for the snippet to return FALSE, indicating that it could not find the string. Instead, it returns 8!
In short, if we are using multi-byte character set (MBCS) strings, we have to be very careful of which functions we pass them to and take appropriate measures for cases when we call a function that is not MBCS-safe. Throughout this book, we will point out which functions are safe or not safe to call with multi-byte strings.
How We Will Work with Characters
With all the stress associated with modern character sets, particularly Unicode, you may ask why we cannot just stick with a single-byte character set and avoid the whole mess in the first place. If your web application were to deal only with English-speaking customers with simple names who live in places with names that are easily represented in ASCII characters, all would be well. However, you would be doing yourself and your customers a disservice.
In an increasingly globalized world (or even within English-speaking countries), you will have customers with names requiring many characters, and international customers with addresses such as zmir, Turkey or ód, Poland. For many of these, you might get away with single-byte character sets, but as the variety increases, you will start to see problems and limits arising from your choice. For the small web log application we will write in Chapter 32, "A Blogging Engine," we will want to encourage users to write entries and comments in any language they choose.
We will therefore do the following as we use PHP: