< Day Day Up > |
4.7.1 The Character Set Used for Data and SortingBy default, MySQL uses the ISO-8859-1 (Latin1) character set with sorting according to Swedish /Finnish rules. These defaults are suitable for the United States and most of western Europe. All MySQL binary distributions are compiled with --with-extra-charsets=complex . This adds code to all standard programs that enables them to handle latin1 and all multi-byte character sets within the binary. Other character sets will be loaded from a character-set definition file when needed. The character set determines what characters are allowed in names . It also determines how strings are sorted by the ORDER BY and GROUP BY clauses of the SELECT statement. You can change the character set with the --default-character-set option when you start the server. The character sets available depend on the --with-charset= charset and --with-extra-charsets= list-of-charsets complex all none options to configure , and the character set configuration files listed in SHAREDIR /charsets/Index . See Section 2.3.2, "Typical configure Options." As of MySQL 4.1.1, you can also change the character set collation with the -- default-collation option when you start the server. The collation must be a legal collation for the default character set. (Use the SHOW COLLATION statement to determine which collations are available for each character set.) See Section 2.3.2, "Typical configure Options." If you change the character set when running MySQL, that may also change the sort order. Consequently, you must run myisamchk -r -q --set-character-set= charset on all tables, or your indexes may not be ordered correctly. When a client connects to a MySQL server, the server indicates to the client what the server's default character set is. The client will switch to use this character set for this connection. You should use mysql_real_escape_string() when escaping strings for an SQL query. mysql_real_escape_string() is identical to the old mysql_escape_string() function, except that it takes the MYSQL connection handle as the first parameter so that the appropriate character set can be taken into account when escaping characters. If the client is compiled with different paths than where the server is installed and the user who configured MySQL didn't include all character sets in the MySQL binary, you must tell the client where it can find the additional character sets it will need if the server runs with a different character set than the client. You can do this by specifying a --character-sets-dir option to indicate the path to the directory in which the dynamic MySQL character sets are stored. For example, you can put the following in an option file: [client] character-sets-dir=/usr/local/mysql/share/mysql/charsets You can force the client to use a specific character set as follows : [client] default-character-set= charset This is normally unnecessary, however. 4.7.1.1 Using the German Character SetTo get German sorting order, you should start mysqld with a --default-character-set=latin1_de option. This affects server behavior in several ways:
4.7.2 Setting the Error Message LanguageBy default, mysqld produces error messages in English, but they can also be displayed in any of these other languages: Czech, Danish, Dutch, Estonian, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Norwegian-ny, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, or Swedish. To start mysqld with a particular language for error messages, use the --language or -L option. The option value can be a language name or the full path to the error message file. For example: shell> mysqld --language=swedish Or: shell> mysqld --language=/usr/local/share/swedish The language name should be specified in lowercase. The language files are located (by default) in the share/ LANGUAGE directory under the MySQL base directory. To change the error message file, you should edit the errmsg.txt file, and then execute the following command to generate the errmsg.sys file: shell> comp_err errmsg.txt errmsg.sys If you upgrade to a newer version of MySQL, remember to repeat your changes with the new errmsg.txt file. 4.7.3 Adding a New Character SetThis section discusses the procedure for adding another character set to MySQL. You must have a MySQL source distribution to use these instructions. To choose the proper procedure, decide whether the character set is simple or complex:
For example, latin1 and danish are simple character sets, whereas big5 and czech are complex character sets. In the following procedures, the name of your character set is represented by MYSET . For a simple character set, do the following:
For a complex character set, do the following:
The sql/share/charsets/README file includes additional instructions. If you want to have the character set included in the MySQL distribution, mail a patch to the MySQL internals mailing list. See Section 1.7.1.1, "The MySQL Mailing Lists." 4.7.4 The Character Definition Arraysto_lower[] and to_upper[] are simple arrays that hold the lowercase and uppercase characters corresponding to each member of the character set. For example: to_lower['A'] should contain 'a' to_upper['a'] should contain 'A' sort_order[] is a map indicating how characters should be ordered for comparison and sorting purposes. Quite often (but not for all character sets) this is the same as to_upper[] , which means that sorting will be case-insensitive. MySQL will sort characters based on the values of sort_order[] elements. For more complicated sorting rules, see the discussion of string collating in Section 4.7.5, "String Collating Support." ctype[] is an array of bit values, with one element for one character. (Note that to_lower[] , to_upper[] , and sort_order[] are indexed by character value, but ctype[] is indexed by character value + 1. This is an old legacy convention to be able to handle EOF .) You can find the following bitmask definitions in m_ctype.h : #define _U 01 /* Uppercase */ #define _L 02 /* Lowercase */ #define _N 04 /* Numeral (digit) */ #define _S 010 /* Spacing character */ #define _P 020 /* Punctuation */ #define _C 040 /* Control character */ #define _B 0100 /* Blank */ #define _X 0200 /* heXadecimal digit */ The ctype[] entry for each character should be the union of the applicable bitmask values that describe the character. For example, 'A' is an uppercase character ( _U ) as well as a hexadecimal digit ( _X ), so ctype['A'+1] should contain the value: _U + _X = 01 + 0200 = 0201 4.7.5 String Collating SupportIf the sorting rules for your language are too complex to be handled with the simple sort_order[] table, you need to use the string collating functions. Right now the best documentation for this is the character sets that are already implemented. Look at the big5 , czech , gbk , sjis , and tis160 character sets for examples. You must specify the strxfrm_multiply_ MYSET = N value in the special comment at the top of the file. N should be set to the maximum ratio the strings may grow during my_strxfrm_ MYSET (it must be a positive integer). 4.7.6 Multi-Byte Character SupportIf you want to add support for a new character set that includes multi-byte characters, you need to use the multi-byte character functions. Right now the best documentation on this consists of the character sets that are already implemented. Look at the euc_kr , gb2312 , gbk , sjis , and ujis character sets for examples. These are implemented in the ctype-'charset'.c files in the strings directory. You must specify the mbmaxlen_ MYSET = N value in the special comment at the top of the source file. N should be set to the size in bytes of the largest character in the set. 4.7.7 Problems with Character SetsIf you try to use a character set that is not compiled into your binary, you might run into the following problems:
For MyISAM tables, you can check the character set name and number for a table with myisamchk -dvv tbl_name . |
< Day Day Up > |