5.4. String Data Types


The following table lists the string data types provided in MySQL.

Type

Description

CHAR

Fixed-length non-binary string

VARCHAR

Variable-length non-binary string

TEXT

Variable-length non-binary string

BINARY

Fixed-length binary string

VARBINARY

Variable-length binary string

BLOB

Variable-length binary string

ENUM

Enumeration consisting of a fixed set of legal values

SET

Set consisting of a fixed set of legal values


When you choose a string data type, consider the following factors:

  • Whether you need to store non-binary or binary strings; are character set and collation important?

  • The maximum length of values you need to store.

  • Whether to use a fixed or variable amount of storage.

  • How trailing spaces are handled for comparison, storage, and retrieval.

  • The number of distinct values required; ENUM or SET may be useful if the set of values is fixed.

The following discussion first describes the general differences between non-binary and binary strings, and then the specific characteristics of each of the string data types.

5.4.1. Character Set Support

Strings in MySQL may be treated as non-binary or binary. The differences between these types of strings make them suited to different purposes. The most general difference is that non-binary strings have a character set and consist of characters in that character set, whereas binary strings consist simply of bytes that are distinguished only by their numeric values. This section explores the implications of this difference.

Non-binary strings have the following characteristics:

  • A non-binary string is a sequence of characters that belong to a specific character set. Characters may consist of a single byte, or multiple bytes if the character set allows it. For example, MySQL's default character set is latin1 (also known as ISO-8859-1). The latin1 character set uses one byte per character. In contrast, sjis (the Japanese SJIS character set), contains so many characters that they cannot all be represented in a single byte, so each character requires multiple bytes to store.

  • Multi-byte character sets may require a fixed or variable number of bytes per character. The ucs2 Unicode character set uses two bytes per character, whereas the utf8 Unicode character set uses from one to three bytes per character.

  • Non-binary string comparisons are based on the collation (sorting order) of the character set associated with the string. A given character set may have one or more collations, but a given string has only one of those collations.

  • Multi-byte character comparisons are performed in character units, not in byte units.

  • The collation determines whether uppercase and lowercase versions of a given character are equivalent. If the collation is not case sensitive, strings such as 'ABC', 'Abc', and 'abc' are all considered equal. If the collation is case sensitive, the strings are all considered different.

  • The collation also determines whether to treat instances of a given character with different accent marks as equivalent. The result is that comparisons of non-binary strings may not be accent sensitive. For example, an 'a' with no accent may be considered the same as the 'á' and 'à' characters. A given collation may be case or accent sensitive, or both.

  • A collation can be a binary collation. In this case, comparisons are based on numeric character values. One effect of this is that for character sets with uppercase and lowercase characters or accented characters, the collation is case sensitive and accent sensitive because each of these characters has a different numeric value. Comparison based on a binary collation differs from comparison of binary strings: A binary collation is performed per character, and characters might consist of multiple bytes. Comparisons for binary strings are always byte-based.

A given character set may have several collations to choose from. This enables you to select different sort orders for the same character set. For example, with the latin1 character set, you can choose from any of the following collations, many of which correspond to the sorting order rules of specific languages:

 mysql> SHOW COLLATION LIKE 'latin1%'; +-------------------+---------+----+---------+----------+---------+ | Collation         | Charset | Id | Default | Compiled | Sortlen | +-------------------+---------+----+---------+----------+---------+ | latin1_german1_ci | latin1  |  5 |         |          |       0 | | latin1_swedish_ci | latin1  |  8 | Yes     | Yes      |       1 | | latin1_danish_ci  | latin1  | 15 |         |          |       0 | | latin1_german2_ci | latin1  | 31 |         | Yes      |       2 | | latin1_bin        | latin1  | 47 |         | Yes      |       1 | | latin1_general_ci | latin1  | 48 |         |          |       0 | | latin1_general_cs | latin1  | 49 |         |          |       0 | | latin1_spanish_ci | latin1  | 94 |         |          |       0 | +-------------------+---------+----+---------+----------+---------+ 

Each collation name ends with _ci, _cs, or _bin, signifying that the collation is case insensitive, case sensitive, or binary.

Binary strings have the following characteristics:

  • A binary string is treated as a sequence of byte values. It might appear to contain characters, because you can write a binary string value as a quoted string, but it "really" contains binary data as far as MySQL is concerned.

  • Because binary strings contain bytes, not characters, comparisons of binary strings are performed on the basis of the byte values in the string. This has the implication that the concept of lettercase does not apply the same way as for non-binary strings. Binary strings may appear to be case sensitive, but that is because uppercase and lowercase versions of a given character have different numeric byte values. A binary string also may appear to be accent sensitive, but that is because versions of a character with different accents have different byte values.

    The following example shows the difference in how non-binary and binary strings are treated with respect to lettercase. The non-binary string is converted to uppercase by UPPER() because it contains characters for which lettercase applies. The binary string remains unchanged because it consists of byte values that have no lettercase.

     mysql> SELECT UPPER('AaBb'), UPPER(BINARY 'AaBb'); +---------------+----------------------+ | UPPER('AaBb') | UPPER(BINARY 'AaBb') | +---------------+----------------------+ | AABB          | AaBb                 | +---------------+----------------------+ 

  • A multi-byte character, if stored in a binary string, is treated simply as multiple individual bytes. Character boundaries of the original data no longer apply.

String comparison rules are addressed in more detail in Section 10.3.1, "Case Sensitivity in String Comparisons."

The different treatment of non-binary and binary strings in MySQL is important when it comes to choosing data types for table columns. You normally base the decision on whether you want to treat column values as containing characters or raw bytes. Thus, non-binary columns are more suitable for character strings such as textual descriptions, and binary columns are more suitable for raw data such as images or compressed data.

Three data types store non-binary strings: CHAR, VARCHAR, and TEXT. THRee data types store binary strings: BINARY, VARBINARY, and BLOB. They're each described further in the following sections.

You can mix non-binary and binary string columns within a single table. Also, for non-binary string columns, different columns can use different character sets and collations. For example, assume that you want to create a table named auth_info, to store login name and password authorization information for users of an application, as well as a picture to associate with each user. You want login names to match in any lettercase, passwords to be case sensitive, and the picture column must store binary image data. The following table definition satisfies these requirements:

 CREATE TABLE auth_info (     login    CHAR(32) CHARACTER SET latin1,     password CHAR(32) CHARACTER SET latin1 COLLATE latin1_general_cs,     picture  MEDIUMBLOB ); 

5.4.2. Non-Binary String Data Types: CHAR, VARCHAR, TEXT

The CHAR, VARCHAR, and TEXT data types store non-binary strings (that is, strings of characters that have a character set and collation). The types differ in terms of their maximum allowable length and in how trailing spaces are handled.

The CHAR data type is a fixed-length type. To define a CHAR column, provide the column name, the keyword CHAR, and the maximum length of acceptable values in parentheses. The length should be a number from 0 to 255.

The CHAR data type holds strings up to the length specified in the column definition. Values in a CHAR column always take the same amount of storage. For example, a column defined as CHAR(30) requires 30 characters for each value, even empty values. Values shorter than the designated length are padded with spaces to that length when they are stored. Trailing spaces are removed from CHAR values when they are retrieved, so retrieved values might not be the same length as when stored.

VARCHAR is a variable-length data type. VARCHAR columns are defined similarly to CHAR columns, but the maximum length can be a number up to 65,535. (The actual allowable maximum length is a few characters less due to internal restrictions imposed by storage engines.) A string stored into a VARCHAR column takes only the number of characters required to store it, plus one or two bytes to record the string's length. (One byte for columns declared with a length less than 256, two bytes otherwise.)

Values in a VARCHAR column are stored as given. Trailing spaces are not removed or added for storage or retrieval.

The TEXT data type comes in four different sizes, differing in the maximum length of values they can store. All are variable-length types, so an individual value requires storage equal to the length (in characters) of the value, plus 1 to 4 bytes to record the length of the value. Trailing spaces are not removed or added for storage or retrieval.

The following table summarizes the non-binary string data types. For the storage requirement values, M represents the maximum length of a column. L represents the actual length of a given value, which may be 0 to M.

Type

Storage Required

Maximum Length

CHAR(M)

M characters

255 characters

VARCHAR(M)

L characters plus 1 or 2 bytes

65,535 characters (subject to limitations)

TINYTEXT

L characters + 1 byte

255 characters

TEXT

L characters + 2 bytes

65,535 characters

MEDIUMTEXT

L characters + 3 bytes

16,777,215 characters

LONGTEXT

L characters + 4 bytes

4,294,967,295 characters


For fixed-length (CHAR) columns, MySQL must allocate enough space to store any value containing up to as many characters allowed by the column declaration. For CHAR(10), 10 bytes are required if the column has a single-byte character set. If the column has a multi-byte character set, MySQL must allocate 10 times the width of the widest allowed character. For utf8, each character takes from one to three bytes, so MySQL must allocate three bytes per character, or 30 bytes per column value. This amount of storage is required even for storing an empty string.

For variable-length (VARCHAR, TEXT) columns, MySQL allocates only the required amount of space for each stored value. A 10-character utf8 VARCHAR column requires 10 bytes (plus a length byte) for a value that contains only single-byte characters, but 30 bytes (plus a length byte) if it contains only triple-byte characters.

Non-binary strings have a character set and collation, and non-binary string columns by default are assigned the character set and collation of the table that contains them. The CHARACTER SET and COLLATE attributes can be used to designate specific values for a column, as described in Section 5.6, "Column Attributes."

5.4.3. Binary String Data Types: BINARY, VARBINARY, BLOB

The BINARY, VARBINARY, and BLOB data types are the binary string equivalents of the non-binary CHAR, VARCHAR, and TEXT data types. That is, they store strings that consist of bytes rather than characters, and they have no character set or collation. Like the corresponding non-binary types, binary string types differ in terms of their maximum allowable length and in how trailing spaces are handled.

BINARY is a fixed-length data type. The length should be a number from 0 to 255. Values shorter than the designated length are padded with spaces to that length when they are stored. Trailing spaces are removed from BINARY values when they are retrieved, so retrieved values might not be the same length as when stored. For this reason, BINARY may not be suited for applications that store binary data if stored values can have trailing spaces. For example, if an encrypted value happens to end with spaces, the retrieved value will be different from the value that was stored.

VARBINARY is a variable-length data type. The maximum length can be a number up to 65,535. (The actual allowable maximum length is a few bytes less due to internal restrictions imposed by storage engines.) Values in a VARBINARY column are stored as given. Trailing spaces are not removed or added for storage or retrieval.

The BLOB data type comes in four different sizes, differing in the maximum length of values they can store. All are variable-length types, so an individual value requires storage equal to the length (in bytes) of the value, plus 1 to 4 bytes to record the length of the value.

The following table summarizes the binary string data types. For the storage requirement values, M represents the maximum length of a column. L represents the actual length of a given value, which may be 0 to M.

Type

Storage Required

Maximum Length

BINARY(M)

M bytes

255 bytes

VARBINARY(M)

L bytes plus 1 or 2 bytes

65,535 bytes (subject to limitations)

TINYBLOB

L + 1 bytes

255 bytes

BLOB

L + 2 bytes

65,535 bytes

MEDIUMBLOB

L + 3 bytes

16,777,215 bytes

LONGBLOB

L + 4 bytes

4,294,967,295 bytes


5.4.4. The ENUM and SET Data Types

The ENUM and SET string data types are used when the values to be stored in a column are chosen from a fixed set of values. You define columns for both types in terms of string values, but MySQL represents them internally as integers. This leads to very efficient storage, but can have some results that are unintuitive unless you keep this string/integer duality in mind.

ENUM is an enumeration type. An ENUM column definition includes a list of allowable values; each value in the list is called a "member" of the list. Every value stored in the column must equal one of the values in the list. A simple (and very common) use for ENUM is to create a two-element list for columns that store yes/no or true/false choices. The following table shows how to declare such columns:

 CREATE TABLE booleans (     yesno     ENUM('Y','N'),     truefalse ENUM('T','F') ); 

Enumeration values aren't limited to being single letters or uppercase. The columns could also be defined like this:

 CREATE TABLE booleans (     yesno     ENUM('yes','no'),     truefalse ENUM('true','false') ); 

An ENUM column definition may list up to 65,535 members. Enumerations with up to 255 members require one byte of storage per value. Enumerations with 256 to 65,535 members require two bytes per value. The following table contains an enumeration column continent that lists continent names as valid enumeration members:

 CREATE TABLE Countries (     name char(30),     continent ENUM ('Asia','Europe','North America','Africa',                     'Oceania','Antarctica','South America') ); 

The values in an ENUM column definition are given as a comma-separated list of quoted strings. Internally, MySQL stores the strings as integers, using the values 1 through n for a column with n enumeration members. The following statement assigns the enumeration value 'Africa' to the continent column; MySQL actually stores the value 4 because 'Africa' is the fourth continent name listed in the enumeration definition:

 INSERT INTO Countries (name,continent) VALUES('Kenya','Africa'); 

MySQL reserves the internal value 0 as an implicit member of all ENUM columns. It's used to represent illegal values assigned to an enumeration column. For example, if you assign 'USA' to the continent column, MySQL will store the value 0, rather than any of the values 1 through 7, because 'USA' is not a valid enumeration member. If you select the column later, MySQL displays 0 values as the empty string ''. (In strict SQL mode, an error occurs if you try to store an illegal ENUM value.)

The SET data type, like ENUM, is declared using a comma-separated list of quoted strings that define its valid members. But unlike ENUM, a given SET column may be assigned a value consisting of any combination of those members. The following definition contains a list of symptoms exhibited by allergy sufferers:

 CREATE TABLE allergy (     symptom SET('sneezing','runny nose','stuffy head','red eyes') ); 

A patient may have any or all (or none) of these symptoms, and symptom values therefore might contain zero to four individual SET members, separated by commas. The following statements set the symptom column to the empty string (no SET members), a single SET member, and multiple SET members, respectively:

 INSERT INTO allergy (symptom) VALUES(''); INSERT INTO allergy (symptom) VALUES('stuffy head'); INSERT INTO allergy (symptom) VALUES('sneezing,red eyes'); 

MySQL represents SET columns as a bitmap using one bit per member, so the elements in the symptom definition have internal values of 1, 2, 4, and 8 (that is, they have the values of bits 0 through 3 in a byte). Internally, MySQL stores the values shown in the preceding INSERT statements as 0 (no bits set), 4 (bit 2 set), and 9 (bits 0 and 3 set; that is, 1 plus 8).

A SET definition may contain up to 64 members. The internal storage required for set values varies depending on the number of SET elements (1, 2, 3, 4, or 8 bytes for sets of up to 8, 16, 24, 32, or 64 members).

If you try to store an invalid list member into a SET column, it's ignored because it does not correspond to any bit in the column definition. For example, setting a symptom value to 'coughing,sneezing,wheezing' results in an internal value of 1 ('sneezing'). The 'coughing' and 'wheezing' elements are ignored because they aren't listed in the column definition as legal set members. (In strict SQL mode, an error occurs if you try to store an illegal SET value.)

As mentioned earlier in this section, the conversion between string and numeric representations of ENUM and SET values can have unintuitive results. For example, although you would normally refer to an enumeration column using the string forms of its values, you can also use the internal numeric values. The effect of this can be very subtle if the string values look like numbers. Suppose that you define a table t like this:

 CREATE TABLE t (age INT, siblings ENUM('0','1','2','3','>3')); 

In this case, the enumeration values are the strings '0', '1', '2', '3', and '>3', and the matching internal numeric values are 1, 2, 3, 4, and 5, respectively. Now suppose that you issue the following statement:

 INSERT INTO t (age,siblings) VALUES(14,'3'); 

The siblings value is specified here as the string '3', and that is the value assigned to the column in the new record. However, you can also specify the siblings value as a number, as follows:

 INSERT INTO t (age,siblings) VALUES(14,3); 

But in this case, 3 is interpreted as the internal value, which corresponds to the enumeration value '2'! The same principle applies to retrievals. Consider the following two statements:

 SELECT * FROM t WHERE siblings = '3'; SELECT * FROM t WHERE siblings = 3; 

In the first case, you get records that have an enumeration value of '3'. In the second case, you get records where the internal value is 3; that is, records with an enumeration value of '2'.



MySQL 5 Certification Study Guide
MySQL 5.0 Certification Study Guide
ISBN: 0672328127
EAN: 2147483647
Year: 2006
Pages: 312

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net