Choosing Column Types

The "MySQL Column Types" section earlier in this chapter describes the various MySQL column types from which you can choose and the general properties of those types, such as the kind of values they can contain, how much storage space they take, and so on. But how do you actually decide which types to use when you create a table? This section discusses issues to consider that will help you choose.

The most "generic" column types are the string types. You can store anything in them because numbers and dates can be represented in string form. So why not just declare all your columns as strings and be done with it? Let's consider a simple example. Suppose you have values that look like numbers. You can represent these as strings, but should you? What happens if you do?

For one thing, you'll probably use more space because numbers can be stored more efficiently using numeric columns than string columns. You'll also notice some differences in query results due to the different ways that numbers and strings are handled. For example, the sort order for numbers is not the same as for strings. The number 2 is less than the number 11, but the string '2' is lexically greater than the string '11'. You can work around this by using the column in a numeric context as follows:

 SELECT col_name + 0 as num ... ORDER BY num;

Adding zero to the column forces a numeric sort, but is that a reasonable thing to do? It's a useful technique sometimes, but you don't want to have to use it every time you want a numeric sort. Causing MySQL to treat the column as a number rather than a string has a couple of significant implications. It forces a string-to-number conversion for each column value, which is inefficient. Also, using the column in a calculation prevents MySQL from using any index on the column, which slows down the query further. Neither of these performance degradations occur if you store the values as numbers in the first place. The simple choice of using one representation rather than another has implications for storage requirements, query handling, and processing performance.

The preceding example illustrates that several issues come into play when you choose column types. The following list gives a quick rundown of factors to think about when picking a type for a column.

What kind of values will the column hold? Numbers? Strings? Dates? This is an obvious question, but you must ask it. You can represent any type of value as a string, but as we've just seen, it's likely that you'll get better performance if you use other more appropriate types for numeric values. (This is also true for date and time values.) However, assessing the type of values you're working with isn't necessarily trivial, particularly for other people's data. It's especially important to ask what kind of values the column will hold if you're setting up a table for someone else, and you must be sure to ask enough questions to get sufficient information for making a good decision.
Do your values lie within some particular range? If they are integers, will they always be non-negative? If so, you can use UNSIGNED. If they are strings, will they always be chosen from among a fixed set of values? If so, you may find ENUM or SET a useful type.

There is a tradeoff between the range of a type and the amount of storage it uses. How "big" a type do you need? For numbers, you can choose small types with a limited range of values, or large types that are essentially unlimited. For strings, you can make them short or long, so you wouldn't choose CHAR(255) if all the values you want to store contain fewer than 10 characters.
What are the performance and efficiency issues? Some types can be processed more efficiently than others. Numeric operations generally can be performed more quickly than string operations. Short strings can be compared more quickly than long strings and also involve less disk overhead. For ISAM and MyISAM tables, performance is better for fixed-length types than for variable-length types.
How do you want your values to be compared? For strings, comparisons can be case sensitive or not case sensitive. You choices here also affect sorting and grouping operations, which are based on comparisons.
Do you plan to index a column? If you do, it affects your choice of table type and column type because indexing properties are not the same for all table handlers. For example, with ISAM tables, you cannot index BLOB and TEXT columns, and indexed columns must be defined as NOT NULL (which affects your ability to use NULL values).

Now let's consider each of these issues in more detail. But before we do, allow me to point something out. You want to make the best column type choices you can when you create a table, but if you make a choice that turns out to be non-optimal, it's not the end of the world. You can use ALTER TABLE to change the type to a better one. This can be as simple as changing a SMALLINT to MEDIUMINT after finding out your data contain values larger than you originally thought. Or it can be more complex, such as changing a CHAR to an ENUM with a specific set of allowed values. In MySQL 3.23 and later, you can use PROCEDURE ANALYSE() to obtain information about your table's columns, such as the minimum and maximum values as well as a suggested optimal type to cover the range of values in a column:

 SELECT * FROM tbl_name PROCEDURE ANALYSE();

The output from this query can help you determine that a smaller type can be used, which can improve the performance of queries that involve the table and reduce the amount of space required for table storage.

What Kind of Values Will the Column Hold?

The first thing you think of when you're trying to decide on a column type is the kind of values the column will be used for, because this has the most evident implications for the type you choose. In general, you do the obvious thing you store numbers in numeric columns, strings in string columns, and dates and times in date and time columns. If your numbers have a fractional part, you use a floating-point column type rather than an integer type, and so on. But sometimes there are exceptions. The principle here is that you need to understand the nature of your data to be able to choose the type in an informed manner. If you're going to store your own data, you probably have a good idea of how to characterize it. On the other hand, if others ask you to set up a table for them, it's sometimes a different story. It may not be so easy to know just what you're working with. Be sure to ask enough questions to find out what kind of values the table really should contain.

Suppose you're told that a table needs a column to record "amount of precipitation." Is that a number? Or is it "mostly" numeric that is, typically but not always coded as a number? For example, when you watch the news on tele vision, the weather report generally includes a measure of precipitation. Sometimes this is a number (as in "0.25 inches of rain"), but sometimes it's a "trace" of precipitation, meaning "not much at all." That's fine for the weather report, but what does it mean for storage in a database? You either need to quantify "trace" as a number so that you can use a numeric column type to record precipitation amounts, or you need to use a string so that you can record the word "trace." Or you could come up with some more complicated arrangement, using a number column and a string column where you fill in one column and leave the other one NULL. It should be obvious that you want to avoid that option, if possible; it makes the table harder to understand and it makes query-writing much more difficult.

I would probably try to store all rows in numeric form, and then convert them as necessary for display purposes. For example, if any non-zero amount of precipitation less than .01 inches is considered a trace amount, you could display values from the column as follows:

 SELECT IF(precip>0 AND precip<.01,'trace',precip) FROM ... ;

Some values are obviously numeric, but you must determine whether to use a floating-point or integer type. You should ask what your units are and what accuracy you require. Is whole-unit accuracy sufficient, or do you need to represent fractional units? This may help you distinguish between integer and floating-point column types. For example, if you're representing weights, you can use an integer column if you record values to the nearest pound. You'd use a floating-point column if you want to record fractional units. In some cases, you might even use multiple fields for example, if you want to record weight in terms of pounds and ounces.

Height is a numeric type of information for which there are several representational possibilities:

As a string such as '6-2' for a value like "6 feet, 2 inches". This has the advantage of having a form that's easy to look at and understand (certainly more so than "74 inches"), but it's difficult to use this kind of value for mathematical operations such as summation or averaging.
As one numeric field for feet and another for inches. This would be a little easier to work with for numerical operations, but two fields are more difficult to use than one.
As one numeric field representing inches. This is easiest for a database to work with, and least meaningful for humans. But remember that you don't have to present values in the same format that you use to work with them. You can reformat values for meaningful display using MySQL's many functions. That means this might be the best way to represent height.

Another type of numeric information is money. For monetary calculations, you're working with values that have dollars and cents parts. These look like floating-point values, but FLOAT and DOUBLE are subject to rounding error and may not be suitable except for records in which you need only approximate accuracy. Because people tend to be touchy about their money, it's more likely you need a type that affords perfect accuracy. You have a couple of choices:

You can represent money as a DECIMAL(M,2) type, choosing M as the maximum width appropriate for the range of values you need. This gives you floating point values with two decimal places of accuracy. The advantage of DECIMAL is that values are represented as strings and are not subject to roundoff error. The disadvantage is that string operations are less efficient than operations on values represented internally as numbers.
You can represent all monetary values internally as cents using an integer type. The advantage is that calculations are done internally using integers, which is very fast. The disadvantage is that you will need to convert values on input or output by multiplying or dividing by 100.

If you need to store date information, do the values include a time? That is, will they ever need to include a time? MySQL doesn't provide a date type that has an optional time part; DATE never has a time and DATETIME must have a time. If the time really is optional, use a DATE column to record the date and a separate TIME column to record the time. Then allow the TIME column to be NULL and interpret that as "no time:"

 CREATE TABLE mytbl  (     date DATE NOT NULL,    # date is required     time TIME NULL         # time is optional (may be NULL) );

One type of situation in which it's especially important to determine whether you need a time value occurs when you're joining two tables with a master-detail relationship that are "linked" based on date information.

Suppose you're conducting research involving subjects who come in to your office to be tested. Following a standard initial set of tests, you may run several additional tests, with the choice of tests varying according to the results of the initial tests. You might represent this information using a master-detail relationship in which the subject identification information and the standard initial tests are stored in a master record and any additional tests are stored as rows in a secondary detail table. Then you link together the two tables based on subject ID and the date on which the tests are given.

The question you must answer in this situation is whether you can use just the date or whether you need both date and time. This depends on whether or not a subject may go through the testing procedure more than once during the same day. If so, record the time (say, the time that the procedure begins) using either a DATETIME column or separate DATE and TIME columns that both must be filled in. Without the time value, you will not be able to associate a subject's detail records with the proper master records if the subject is tested twice in a day.

I've heard people claim "I don't need a time; I will never test a subject twice on the same day." Sometimes they're correct, but I have also seen some of these same people turn up later wondering how to prevent detail records from being mixed up with the wrong master record after entering data for subjects who were tested multiple times in a day. Sorry, then it's too late!

Sometimes you can deal with this problem by retrofitting a TIME column into the tables. Unfortunately, it's difficult to fix existing records unless you have some independent data source, such as the original paper records. Otherwise, you have no way to disambiguate detail records to associate them to the proper master record. Even if you have an independent source of information, this is very messy and likely to cause problems for applications you've already written to use the tables. It's best to explain the issues to the table owners and make sure you've got a good characterization of the data values before creating their tables.

Sometimes you may have incomplete data, and this will influence your choice of column types. You may be collecting birth and death dates for genealogical research, and sometimes all you can find out is the year or year and month someone was born or died not the exact date. If you use a DATE column, you can't enter a date unless you have the full date. If you want to be able to record whatever information you have, even if it's incomplete, you may have to keep separate year, month, and day fields. Then you can enter the parts of the date that you have and leave the rest NULL. Another possibility is available in MySQL 3.23 and later, which allows the day or month and day parts of DATE values to be 0. Such "fuzzy" dates can be used to represent incomplete date values.

Do Your Values Lie Within Some Particular Range?

If you've decided on the general category from which to pick a type for a column, thinking about the range of values you want to represent will help you narrow down your choices to a particular type within that category. Suppose you want to store integer values. The range of your values determines the types you can use. If you need values in the range from 0 to 1000, you can use anything from a SMALLINT up to a BIGINT. If your values range up to 2 million, you can't use SMALLINT, and your choices range from MEDIUMINT to BIGINT. Then you need to pick one type from among the possibilities.

Of course, you could simply use the largest type for the kind of value you want to store (BIGINT for the examples in the previous paragraph). Generally, however, you should use the smallest type that is large enough for your purposes. By doing so, you'll minimize the amount of storage used by your tables, and they will give you better performance because smaller columns usually can be processed more quickly than larger ones. (Reading smaller values requires less disk activity, and more key values fit into the index cache, allowing indexed searches to be performed faster.)

If you don't know the range of values you'll need to be able to represent, you either must guess or use BIGINT to accommodate the worst possible case. (If you guess and the type you choose does turn out to be too small, all is not lost; you can use ALTER TABLE later to make the column bigger.)

In Chapter 1, we created a score table for the grade-keeping project that had a score column for recording quiz and test scores. The table was created using INT to keep the discussion simpler, but you can see now that if scores are in the range from 0 to 100, a better choice would be TINYINT UNSIGNED because that would use less storage.

The range of values in your data also affects the attributes you can use with your column type. If values are never negative, you can use UNSIGNED; otherwise, you can't.

String types don't have a "range" in the same way numeric columns do, but they have a length, and the maximum length you need affects the column types you can use. If your strings are shorter than 256 characters, you can use CHAR, VARCHAR, TINYTEXT, or TINYBLOB. If you want longer strings, you can use a TEXT or BLOB type, but CHAR and VARCHAR are no longer options.

For string columns that you will use to represent a fixed set of values, you might consider using an ENUM or SET column type. These can be good choices because they are represented internally as numbers. Operations on them are performed numerically, which makes them more efficient than other string types. They can also be more compact than other string types, which saves space.

When characterizing the range of values you have to deal with, the best terms are "always" and "never" (as in "always less than 1000" or "never negative") because they allow you to constrain your column type choices more tightly. But be wary of using these terms when they're not really justified. Be especially wary if you're consulting with other people about their data and they start throwing around those two terms. When people say "always" or "never," be sure they really mean it. Sometimes people say their data always have a particular characteristic when they really mean "almost always."

For example, suppose you're designing a table for a group of investigators who tell you, "Our test scores are always 0 to 100." Based on that statement, you choose TINYINT and you make it UNSIGNED because the values are always non-negative. Then you find out that the people who code the data for entry into the database sometimes use 1 to mean "student was absent due to illness." Oops. They didn't tell you that. It may be acceptable to use NULL to represent such values, but if not, you'll have to record a 1 and then you can't use an UNSIGNED column. (This is an instance where ALTER TABLE comes to your rescue!)

Sometimes decisions about these types of cases can be made more easily by asking a simple question: Are there ever exceptions? If an exceptional case ever occurs, even just once, you must allow for it. You will find that people who talk to you about designing a database invariably think that if exceptions don't occur very often, they don't matter. When you're creating a table, you can't think that way. The question you need to ask isn't "How often do exceptions occur?" It's "Do exceptions ever occur?" If they do, you must take them into account.

What Are the Performance and Efficiency Issues?

Your choice of column type can influence query performance in several ways. If you keep the general guidelines discussed in the following sections in mind, you'll be able to choose types that will help MySQL process your tables more efficiently.

Numeric Versus String Operations

Numeric operations are generally faster than string operations. Consider comparison operations. Numbers can be compared in a single operation. String comparisons may involve several byte-by-byte or character-by-character comparisons, more so as the strings become longer.

If a string column has a limited number of values, use an ENUM or SET type to get the advantages of numeric operations. These types are represented internally as numbers and can be processed more efficiently.

Consider alternative representations for strings. Sometimes you can improve performance by representing string values as numbers. For example, to represent IP numbers in dotted-quad notation, such as 192.168.0.4, you might use a string. But as an alternative, you could convert the IP numbers to integer form by storing each part of the dotted-quad form in one byte of a four-byte INT UNSIGNED type. Storing integers would both save space and speed lookups. On the other hand, representing IP numbers as INT values might make it difficult to perform pattern matches, such as you might do if you wanted to look for numbers in a given subnet. So you cannot consider only space issues; you must decide which representation is more appropriate based on what you want to do with the values. (Whichever way you choose, the INET_ATON() and INET_NTOA() functions can help convert between the two representations.)

Smaller Types Versus Bigger Types

Smaller types can be processed more quickly than larger types. A general principle is that they take less space and involve less overhead for disk activity. For strings in particular, processing time is in direct relationship to string length.

For columns that use fixed-size types, choose the smallest type that will hold the required range of values. For example, don't use BIGINT if MEDIUMINT will do. Don't use DOUBLE if you only need FLOAT precision. For variable-size types, you may still be able to save space. A BLOB uses 2 bytes to record the length of the value, a LONGBLOB uses 4 bytes. If you're storing values that are never as long as 64KB, using BLOB saves you 2 bytes per value. (Similar considerations apply for TEXT types, of course.)

Fixed-Length Versus Variable-Length Types

Fixed-length and variable-length types have different performance implications, although the particular effects of each depends on the table type.

For MyISAM and ISAM tables, fixed-length types generally can be processed more quickly than variable-length types:

With variable-length columns, you get more fragmentation of a table on which you perform many deletes or updates due to the differing sizes of the records. You'll need to run OPTIMIZE TABLE periodically to maintain performance. This is not an issue with fixed-length rows.
Tables with fixed-length rows are easier to reconstruct if you have a table crash. The beginning of each record can be determined because they all are at positions that are multiples of the fixed record size, something that is not true with variable-length rows. This is not a performance issue with respect to query processing, but it can certainly speed up the table repair process.

If you have variable-length columns in a MyISAM or ISAM table, converting them to fixed-length columns will improve performance because fixed-length records are easier to process. Before you attempt to do this, though, consider the following:

Fixed-length columns are faster but take more space. CHAR(n) columns always take n bytes per value (even empty ones) because values are padded with trailing spaces when stored in the table. VARCHAR(n) columns take less space because only as much space is allocated as is necessary to store each value, plus one byte per value to record the length. Thus, if you are choosing between CHAR and VARCHAR columns, the tradeoff is one of time versus space. If speed is your primary concern, use CHAR columns to get the performance benefits of fixed-length columns. If space is at a premium, use VARCHAR columns. As a rule of thumb, you can assume that fixed-length rows will improve performance even though more space is used. But for an especially critical application, you may want to implement a table both ways and run some tests to determine which alternative actually is better for your particular application.
You cannot convert just one variable-length column; you must convert them all. Additionally, you must convert them all at the same time using a single ALTER TABLE statement or the attempt will have no effect.
Sometimes you cannot use a fixed-length type, even if you want to. There is no fixed-length type for strings longer than 255 bytes, for example.

For InnoDB tables, fixed-length and variable-length rows are both stored the same way (as a row header containing pointers to individual column values, plus storage for the values). This means that fixed-length rows aren't any simpler to process. Consequently, the primary performance factor is the amount of storage used for rows. The implication is that variable-length rows will usually be faster for InnoDB tables because they require less space and thus less disk I/O to process.

Indexable Types

Indexes speed up queries, so choose types you can index, at least for columns that you plan to use for comparisons in searches. See the "Do You Plan to Index a Column?" section later in this chapter for more information.

NULL Versus NOT NULL Types

If you declare a column NOT NULL, it can be handled more quickly because MySQL doesn't have to check the column's values during query processing to see whether they are NULL. It also saves one bit per row in the table. Avoiding NULL in columns may make your queries simpler (because you don't have to think about NULL as a special case), and simpler queries generally are processed more quickly.

How Do You Want Your Values to be Compared?

You can often control case sensitivity of string values for comparison and sorting purposes by the type of column you use to store them. The determining factor is whether the column contains binary strings (case sensitive) or non-binary strings (not case sensitive). Table 2.15 shows each binary string type and the corresponding non-binary type. Some types (CHAR, VARCHAR) are binary or not binary according to the presence or absence of the keyword BINARY in the column declaration. The "binary-ness" of other types (BLOB, TEXT) is implicit in the type name.

Table 2.15. Binary and Non-Binary String Types
Binary Type	Non-Binary Type
`CHAR(M) BINARY`	`CHAR(M)`
`VARCHAR(M) BINARY`	`VARCHAR(M)`
`TINYBLOB`	`TINYTEXT`
`BLOB`	`TEXT`
`MEDIUMBLOB`	`MEDIUMTEXT`
`LONGBLOB`	`LONGTEXT`

If you want to use a column for both case-sensitive and not case-sensitive comparisons, use a non-binary type. Then, whenever you want a case-sensitive comparison, use the BINARY keyword to force a string to be treated as a binary string value. For example, if mycol is a CHAR column, you can compare it different ways. The following comparison is not case sensitive:

 mycol = 'ABC'

But the following comparisons are both case sensitive (note that it doesn't matter which string the BINARY operator is applied to):

 BINARY mycol = 'ABC'  mycol = BINARY 'ABC'

If you have string values that you want to sort in some non-lexical order, consider using an ENUM column. Sorting of ENUM values occurs according to the order in which you list the enumeration values in the column declaration, so you can make the values sort in any order you want.

Do You Plan to Index a Column?

Indexes allow MySQL to process queries more efficiently. Choosing indexes is a topic covered in more detail in Chapter 4, but a general principle is that columns you commonly use in WHERE clauses to select rows are good candidates for indexing.

If you want to index a column or include it in a multiple-column index, there may be constraints on the types you can choose. For example, some table types (InnoDB and ISAM) do not allow indexes on BLOB or TEXT columns, and prior to MySQL 3.23.2, all indexed columns must be declared as NOT NULL. If you find yourself bumping up against these restrictions, you may be able to work around them:

If you want to use a BLOB or TEXT column but your table type does not allow them to be indexed, check whether your values ever exceed 255 bytes. If not, use a similar VARCHAR column type instead and index that. You can use VARCHAR(255) BINARY for BLOB values and VARCHAR(255) for TEXT values.
To work around a NOT NULL restriction, if you can designate some value as special, you might be able to treat it as though it means the same thing as NULL. For a DATE column, you might designate '0000-00-00' to mean "no date." In a string column that normally holds only non-empty values, you might designate that the empty string means "missing value." In a numeric column, you might use 1 if the column normally would hold only non-negative values. (Of course, you could not declare the column as UNSIGNED in this case.)

Inter-Relatedness of Column Type Choice Issues

You can't always consider the issues involved in choosing column types as though they are independent of one another. For example, range is related to storage size for numeric types; as you increase the range, you require more storage, which affects performance. Or consider the implications of choosing to use AUTO_INCREMENT to create a column for holding unique sequence numbers. That single choice has several consequences involving the column type, indexing, and the use of NULL:

AUTO_INCREMENT is a column attribute that should be used only with integer types. That immediately limits your choices to TINYINT through BIGINT.
AUTO_INCREMENT columns are intended only for generating sequences of positive values, so you should declare them as UNSIGNED.
AUTO_INCREMENT columns must be indexed. Furthermore, to prevent sequence numbers from being reused, the index must be unique. This means you must declare the column as a PRIMARY KEY or as a UNIQUE index.
AUTO_INCREMENT columns must be NOT NULL.

All of this means you do not just declare an AUTO_INCREMENT column like this:

 mycol arbitrary_type AUTO_INCREMENT

You declare it like this:

 mycol integer_type UNSIGNED AUTO_INCREMENT NOT NULL,  PRIMARY KEY (mycol)

Or like like this:

 mycol integer_type UNSIGNED AUTO_INCREMENT NOT NULL,  UNIQUE (mycol)