MySQL Cookbook
Authors: DuBois P.
Published year: 2004
Pages: 75-77/375
Buy this book on amazon.com >>

Chapter 5. Working with Strings

Section 5.0.  Introduction

Recipe 5.1.  String Properties

Recipe 5.2.  Choosing a String Data Type

Recipe 5.3.  Setting the Client Connection Character Set Properly

Recipe 5.4.  Writing String Literals

Recipe 5.5.  Checking a String's Character Set or Collation

Recipe 5.6.  Changing a String's Character Set or Collation

Recipe 5.7.  Converting the Lettercase of a String

Recipe 5.8.  Converting the Lettercase of a Stubborn String

Recipe 5.9.  Controlling Case Sensitivity in String Comparisons

Recipe 5.10.  Pattern Matching with SQL Patterns

Recipe 5.11.  Pattern Matching with Regular Expressions

Recipe 5.12.  Controlling Case Sensitivity in Pattern Matching

Recipe 5.13.  Breaking Apart or Combining Strings

Recipe 5.14.  Searching for Substrings

Recipe 5.15.  Using FULLTEXT Searches

Recipe 5.16.  Using a FULLTEXT Search with Short Words

Recipe 5.17.  Requiring or Excluding FULLTEXT Search Words

Recipe 5.18.  Performing Phrase Searches with a FULLTEXT Index



5.0. Introduction

Like most types of data, string values can be compared for equality or inequality or relative ordering. However, strings have some additional features to consider:

  • A string can be binary or nonbinary. Binary strings are used for raw data such as images, music files, or encrypted values. Nonbinary strings are used for character data such as text and are associated with a character set and collation (sorting order).

  • A character set determines which characters are legal in a string. Collations can be chosen according to whether you need comparisons to be case-sensitive or case-insensitive, or to use the rules of a particular language.

  • Data types for binary strings are BINARY , VARBINARY , and BLOB . Data types for nonbinary strings are CHAR , VARCHAR , and TEXT , each of which allows CHARACTER SET and COLLATE attributes. See Section 5.2 for information about choosing data types for string columns .

  • You can convert a binary string to a nonbinary string and vice versa, or convert a nonbinary string from one character set or collation to another.

  • You can use a string in its entirety or extract substrings from it. Strings can be combined with other strings.

  • You can apply pattern-matching operations to strings.

  • FULLTEXT searching is available for efficient queries on large collections of text.

This chapter discusses how to use all those features, so that you can store, retrieve, and manipulate strings according to whatever requirements your applications have.

Scripts to create the tables used in this chapter can be found in the tables directory of the recipes distribution.



Recipe 5.1. String Properties

One property of a string is whether it is binary or nonbinary:

  • A binary string is a sequence of bytes. It can contain any type of information, such as images, MP3 files, or compressed or encrypted data. A binary string is not associated with a character set, even if you store a value such as abc that looks like ordinary text. Binary strings are compared byte by byte using numeric byte values.

  • A nonbinary string is a sequence of characters. It stores text that has a particular character set and collation. The character set defines which characters can be stored in the string. The collation defines the comparison and sorting properties of the characters .

A characteristic of nonbinary strings is that they have a character set. To see which character sets are available, use this statement:

mysql>

SHOW CHARACTER SET;

+----------+-----------------------------+---------------------+--------+
 Charset   Description                  Default collation    Maxlen 
+----------+-----------------------------+---------------------+--------+
 big5      Big5 Traditional Chinese     big5_chinese_ci           2 
 dec8      DEC West European            dec8_swedish_ci           1 
 cp850     DOS West European            cp850_general_ci          1 
 hp8       HP West European             hp8_english_ci            1 
 koi8r     KOI8-R Relcom Russian        koi8r_general_ci          1 
 latin1    cp1252 West European         latin1_swedish_ci         1 
 latin2    ISO 8859-2 Central European  latin2_general_ci         1 
...
 utf8      UTF-8 Unicode                utf8_general_ci           3 
 ucs2      UCS-2 Unicode                ucs2_general_ci           2 
...

The default character set in MySQL is latin1 . If you need to store characters from several languages in a single column, consider using one of the Unicode character sets ( utf8 or ucs2 ) because they can represent characters from multiple languages.

Some character sets contain only single-byte characters, whereas others allow multibyte characters. For some multibyte character sets, all characters have a fixed length. Others contain characters of varying lengths. For example, Unicode data can be stored using the ucs2 character set in which all characters take two bytes or the utf8 character set in which characters take from one to three bytes.

You can determine whether a given string contains multibyte characters using the LENGTH⁠(⁠ ⁠ ⁠) and CHAR_LENGTH⁠(⁠ ⁠ ⁠) functions, which return the length of a string in bytes and characters, respectively. If LENGTH⁠(⁠ ⁠ ⁠) is greater than CHAR_LENGTH⁠(⁠ ⁠ ⁠) for a given string, multibyte characters are present.

  • For the ucs2 Unicode character set, all characters are encoded using two bytes, even if they might be single-byte characters in another character set such as latin1 . Thus, every ucs2 string contains multibyte characters:

    mysql>
    
    SET @s = CONVERT('abc' USING ucs2);
    
    mysql>
    
    SELECT LENGTH(@s), CHAR_LENGTH(@s);
    
    +------------+-----------------+
     LENGTH(@s)  CHAR_LENGTH(@s) 
    +------------+-----------------+
              6                3 
    +------------+-----------------+
    

  • The utf8 Unicode character set has multibyte characters, but a given utf8 string might contain only single-byte characters, as in the following example:

    mysql>
    
    SET @s = CONVERT('abc' USING utf8);
    
    mysql>
    
    SELECT LENGTH(@s), CHAR_LENGTH(@s);
    
    +------------+-----------------+
     LENGTH(@s)  CHAR_LENGTH(@s) 
    +------------+-----------------+
              3                3 
    +------------+-----------------+
    

Another characteristic of nonbinary strings is collation , which determines the sort order of characters in the character set. Use SHOW COLLATION to see which collations are available; add a LIKE clause to see the collations for a particular character set:

mysql>

SHOW COLLATION LIKE 'latin1%';

+-------------------+---------+----+---------+----------+---------+
 Collation          Charset  Id  Default  Compiled  Sortlen 
+-------------------+---------+----+---------+----------+---------+
 latin1_german1_ci  latin1    5           Yes             1 
 latin1_swedish_ci  latin1    8  Yes      Yes             1 
 latin1_danish_ci   latin1   15           Yes             1 
 latin1_german2_ci  latin1   31           Yes             2 
 latin1_bin         latin1   47           Yes             1 
 latin1_general_ci  latin1   48           Yes             1 
 latin1_general_cs  latin1   49           Yes             1 
 latin1_spanish_ci  latin1   94           Yes             1 
+-------------------+---------+----+---------+----------+---------+

In contexts where no collation is indicated, the collation with Yes in the Default column is the default collation used for strings in the given character set. As shown, the default collation for latin1 is latin1_swedish_ci . (Default collations are also displayed by SHOW CHARACTER SET .)

A collation can be case-sensitive ( a and A are different), case-insensitive ( a and A are the same), or binary (two characters are the same or different based on whether their numeric values are equal). A collation name ending in ci , cs , or bin is case-insensitive, case-sensitive, or binary, respectively.

A binary collation provides a sort order for nonbinary strings that is something like the order for binary strings, in the sense that comparisons for binary strings and binary collations both use numeric values. However, there is a difference: binary string comparisons are always based on single-byte units, whereas a binary collation compares nonbinary strings using character numeric values; depending on the character set, some of these might be multibyte values.

The following example illustrates how collation affects sort order. Suppose that a table contains a latin1 string column and has the following rows:

mysql>

CREATE TABLE t (c CHAR(3) CHARACTER SET latin1);

mysql>

INSERT INTO t (c) VALUES('AAA'),('bbb'),('aaa'),('BBB');

mysql>

SELECT c FROM t;

+------+
 c    
+------+
 AAA  
 bbb  
 aaa  
 BBB  
+------+

By applying the COLLATE operator to the column, you can choose which collation to use for sorting and thus affect the order of the result:

  • A case-insensitive collation sorts a and A together, placing them before b and B . However, for a given letter, it does not necessarily order one lettercase before another, as shown by the following result:

    mysql>
    
    SELECT c FROM t ORDER BY c COLLATE latin1_swedish_ci;
    
    +------+
     c    
    +------+
     AAA  
     aaa  
     bbb  
     BBB  
    +------+
    

  • A case-sensitive collation puts A and a before B and b , and sorts uppercase before lowercase:

    mysql>
    
    SELECT c FROM t ORDER BY c COLLATE latin1_general_cs;
    
    +------+
     c    
    +------+
     AAA  
     aaa  
     BBB  
     bbb  
    +------+
    

  • A binary collation sorts characters using their numeric values. Assuming that uppercase letters have numeric values less than those of lowercase letters , a binary collation results in the following ordering:

    mysql>
    
    SELECT c FROM t ORDER BY c COLLATE latin1_bin;
    
    +------+
     c    
    +------+
     AAA  
     BBB  
     aaa  
     bbb  
    +------+
    

    Note that, because characters in different lettercases have different numeric values, a binary collation produces a case-sensitive ordering. However, the order is different than that for the case-sensitive collation.

You can choose a language-specific collation if you require that comparison and sorting operations use the sorting rules of a particular language. For example, if you store strings using the utf8 character set, the default collation ( utf8_general_ci ) treats ch and ll as two-character strings. If you need the traditional Spanish ordering that treats ch and ll as single characters that follow c and l , respectively, use the utf8_spanish2_ci collation. The two collations produce different results, as shown here:

mysql>

CREATE TABLE t (c CHAR(2) CHARACTER SET utf8);

mysql>

INSERT INTO t (c) VALUES('cg'),('ch'),('ci'),('lk'),('ll'),('lm');

mysql>

SELECT c FROM t ORDER BY c COLLATE utf8_general_ci;

+------+
 c    
+------+
 cg   
 ch   
 ci   
 lk   
 ll   
 lm   
+------+
mysql>

SELECT c FROM t ORDER BY c COLLATE utf8_spanish2_ci;

+------+
 c    
+------+
 cg   
 ci   
 ch   
 lk   
 lm   
 ll   
+------+


MySQL Cookbook
Authors: DuBois P.
Published year: 2004
Pages: 75-77/375
Buy this book on amazon.com >>

Similar books on Amazon