Making Sense of It All in PHP


With our newfound (or refreshed) knowledge of character sets, it is now time to address how all this fits into PHP5. As we said earlier, PHP is largely an 8-bit English (ISO-8859-1) language engine. It, along with the web server engine, sends any character data it sees to the output stream and leaves the output to define itself, or the receiving browser to interpret the data.

Working with Other Character Sets

There are two places we can indicate our character set:

  • Via the encoding where we save our scripts and associated content HTML files Files can be saved in any encoding, since files are just a collection of bytes that sometimes contain only text characters along with appropriate special characters (TABs, spaces, newlines, and so on). Some character encodings even support the placement of a few bytes at the beginning of the saved text file (often called a file signature) to indicate that the file is of the given format. (UTF is the most notable example of this.)

  • In an output stream For cases when we are using a format that we can indicate what encoding our output stream is using (for example, HTML or XHTML), we can put a directive in our output to tell receiving clients (in this case, web browsers) how to interpret what they receive.

To save text in a certain character set, you need a text editor that supports this functionality. For various 8-bit character sets (such as ISO-8859-XX and Windows-YYYY), no special effort is needed, as all formats are still streams of bytesyou only need to worry about how programs interpret and display them.

If one were to write a blurb in Turkish, it would look like this if displayed in the proper iso-8859-9 character set:

 Benim adlm Mark. Kanadallylm. Nasllsiniz? (Turkçe çok zor) 

(Turkish is famous for the small letter i without the dot on top, l.) The same file looks as follows if displayed in the default iso-8859-1 character set (used on most English language computers):

 Benim adm Mark. Kanadalym. Nasylsiniz? (Turkçe çok zor) 

As for other formats where characters are split into multiple bytes, you need to work with an editor that understands how to save the characters into the various formats. Fortunately, most modern text editors support the various Unicode encodings, including the choice of whether or not to include a signature at the beginning of the file. Many editors also support some, if not all, of the Asian character sets.

In a pinch, the NOTEPAD.EXE that ships with the latest versions of Microsoft Windows supports loading and saving files in Unicode format. This is done via the "Save" dialog box you see when saving a file for the first time or when selecting "Save As" from the "File" menu (shown in Figure 6-1).

Figure 6-1. Saving files as Unicode in Notepad under Microsoft Windows.


Telling a client in which character set the current output stream is encoded depends on what you are sending them. For HTML, you will want to include a directive along the following lines in the <head> section of your output:

 <meta http-equiv="Content-Type"       content="text/html; charset=iso-8859-1"> 

By changing what is in the charset portion of the content attribute, you can influence how the end client displays your output. (Most web browsers honor this field.)

Trouble Rears Its Ugly Head

While we have said that PHP is not particularly concerned with which character set you are using, the fact remains that most multi-byte character sets do require some conscious effort on the program developer's part to ensure that they operate correctly.

Problems will arise when you use built-in functions that are not aware of multi-byte character sets, such as many of the string functions we will see later in this chapter, and the routines for manipulating files on the file system of your computer. (See Chapter 24, "Files and Directories" for more detail.)

If we consider the following Japanese text (which merely professes the author's love of sushi) encoded as a UTF-8 string

we would expect the strlen functionwhich returns the length of the string passed to it as a parameter (see the following code)to return 9 (one for each character). In fact, the strlen function will return the value 27, meaning that the characters average three bytes each!

What is even more frustrating is that some of these multi-byte characters can contain individual bytes that match up with other individual characters in a single-byte character set, such as ASCII. Although we only see 9 characters in the previous string, searching for ¿ (used for questions in Spanish) with the strpos function will yield some unexpected results. The ¿ character is represented in the ASCII character set by the byte value 0xbf (191):

 <?php   //   // chr() takes a number and returns the ASCII character for   // that number.   //   echo strpos('', chr(0xbf)); ?> 

We would expect the function for the snippet to return FALSE, indicating that it could not find the string. Instead, it returns 8!

In short, if we are using multi-byte character set (MBCS) strings, we have to be very careful of which functions we pass them to and take appropriate measures for cases when we call a function that is not MBCS-safe. Throughout this book, we will point out which functions are safe or not safe to call with multi-byte strings.

How We Will Work with Characters

With all the stress associated with modern character sets, particularly Unicode, you may ask why we cannot just stick with a single-byte character set and avoid the whole mess in the first place. If your web application were to deal only with English-speaking customers with simple names who live in places with names that are easily represented in ASCII characters, all would be well. However, you would be doing yourself and your customers a disservice.

In an increasingly globalized world (or even within English-speaking countries), you will have customers with names requiring many characters, and international customers with addresses such as zmir, Turkey or ód, Poland. For many of these, you might get away with single-byte character sets, but as the variety increases, you will start to see problems and limits arising from your choice. For the small web log application we will write in Chapter 32, "A Blogging Engine," we will want to encourage users to write entries and comments in any language they choose.

We will therefore do the following as we use PHP:

  • Use UTF-8 as our primary character set for saving code and HTML text and transmitting output to client browsers (an overwhelming majority of which have supported Unicode for some time). All .php and .html files we write will be saved in Unicode format.

  • Configure PHP as much as possible to support Unicode. We will discuss this more later in the chapter.

  • Note and pay special attention to which functions are not multi-byte safe and try to use the safe versions whenever possible. For the string and regular expression functions we will learn about later in this chapter, we will enable and use the available multi-byte versions.

  • Convert between the various character sets whenever we must use a function that does not support the required one. We will often use the functions utf8_encode and utf8_decode (more on these later in this chapter) to facilitate our efforts.




Core Web Application Development With PHP And MYSQL
Core Web Application Development with PHP and MySQL
ISBN: 0131867164
EAN: 2147483647
Year: 2005
Pages: 255

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net