The "MySQL Column Types" section earlier in this chapter describes the various MySQL column types from which you can choose and the general properties of those types, such as the kind of values they can contain, how much storage space they take, and so on. But how do you actually decide which types to use when you create a table? This section discusses issues to consider that will help you choose. The most "generic" column types are the string types. You can store anything in them because numbers and dates can be represented in string form. So why not just declare all your columns as strings and be done with it? Let's consider a simple example. Suppose you have values that look like numbers. You can represent these as strings, but should you? What happens if you do? For one thing, you'll probably use more space because numbers can be stored more efficiently using numeric columns than string columns. You'll also notice some differences in query results due to the different ways that numbers and strings are handled. For example, the sort order for numbers is not the same as for strings. The number 2 is less than the number 11, but the string '2' is lexically greater than the string '11'. You can work around this by using the column in a numeric context as follows: SELECT col_name + 0 as num ... ORDER BY num; Adding zero to the column forces a numeric sort, but is that a reasonable thing to do? It's a useful technique sometimes, but you don't want to have to use it every time you want a numeric sort. Causing MySQL to treat the column as a number rather than a string has a couple of significant implications. It forces a string-to-number conversion for each column value, which is inefficient. Also, using the column in a calculation prevents MySQL from using any index on the column, which slows down the query further. Neither of these performance degradations occur if you store the values as numbers in the first place. The simple choice of using one representation rather than another has implications for storage requirements, query handling, and processing performance. The preceding example illustrates that several issues come into play when you choose column types. The following list gives a quick rundown of factors to think about when picking a type for a column.
Now let's consider each of these issues in more detail. But before we do, allow me to point something out. You want to make the best column type choices you can when you create a table, but if you make a choice that turns out to be non-optimal, it's not the end of the world. You can use ALTER TABLE to change the type to a better one. This can be as simple as changing a SMALLINT to MEDIUMINT after finding out your data contain values larger than you originally thought. Or it can be more complex, such as changing a CHAR to an ENUM with a specific set of allowed values. In MySQL 3.23 and later, you can use PROCEDURE ANALYSE() to obtain information about your table's columns, such as the minimum and maximum values as well as a suggested optimal type to cover the range of values in a column: SELECT * FROM tbl_name PROCEDURE ANALYSE(); The output from this query can help you determine that a smaller type can be used, which can improve the performance of queries that involve the table and reduce the amount of space required for table storage. What Kind of Values Will the Column Hold?The first thing you think of when you're trying to decide on a column type is the kind of values the column will be used for, because this has the most evident implications for the type you choose. In general, you do the obvious thing you store numbers in numeric columns, strings in string columns, and dates and times in date and time columns. If your numbers have a fractional part, you use a floating-point column type rather than an integer type, and so on. But sometimes there are exceptions. The principle here is that you need to understand the nature of your data to be able to choose the type in an informed manner. If you're going to store your own data, you probably have a good idea of how to characterize it. On the other hand, if others ask you to set up a table for them, it's sometimes a different story. It may not be so easy to know just what you're working with. Be sure to ask enough questions to find out what kind of values the table really should contain. Suppose you're told that a table needs a column to record "amount of precipitation." Is that a number? Or is it "mostly" numeric that is, typically but not always coded as a number? For example, when you watch the news on tele vision, the weather report generally includes a measure of precipitation. Sometimes this is a number (as in "0.25 inches of rain"), but sometimes it's a "trace" of precipitation, meaning "not much at all." That's fine for the weather report, but what does it mean for storage in a database? You either need to quantify "trace" as a number so that you can use a numeric column type to record precipitation amounts, or you need to use a string so that you can record the word "trace." Or you could come up with some more complicated arrangement, using a number column and a string column where you fill in one column and leave the other one NULL. It should be obvious that you want to avoid that option, if possible; it makes the table harder to understand and it makes query-writing much more difficult. I would probably try to store all rows in numeric form, and then convert them as necessary for display purposes. For example, if any non-zero amount of precipitation less than .01 inches is considered a trace amount, you could display values from the column as follows: SELECT IF(precip>0 AND precip<.01,'trace',precip) FROM ... ; Some values are obviously numeric, but you must determine whether to use a floating-point or integer type. You should ask what your units are and what accuracy you require. Is whole-unit accuracy sufficient, or do you need to represent fractional units? This may help you distinguish between integer and floating-point column types. For example, if you're representing weights, you can use an integer column if you record values to the nearest pound. You'd use a floating-point column if you want to record fractional units. In some cases, you might even use multiple fields for example, if you want to record weight in terms of pounds and ounces. Height is a numeric type of information for which there are several representational possibilities:
Another type of numeric information is money. For monetary calculations, you're working with values that have dollars and cents parts. These look like floating-point values, but FLOAT and DOUBLE are subject to rounding error and may not be suitable except for records in which you need only approximate accuracy. Because people tend to be touchy about their money, it's more likely you need a type that affords perfect accuracy. You have a couple of choices:
If you need to store date information, do the values include a time? That is, will they ever need to include a time? MySQL doesn't provide a date type that has an optional time part; DATE never has a time and DATETIME must have a time. If the time really is optional, use a DATE column to record the date and a separate TIME column to record the time. Then allow the TIME column to be NULL and interpret that as "no time:" CREATE TABLE mytbl ( date DATE NOT NULL, # date is required time TIME NULL # time is optional (may be NULL) ); One type of situation in which it's especially important to determine whether you need a time value occurs when you're joining two tables with a master-detail relationship that are "linked" based on date information. Suppose you're conducting research involving subjects who come in to your office to be tested. Following a standard initial set of tests, you may run several additional tests, with the choice of tests varying according to the results of the initial tests. You might represent this information using a master-detail relationship in which the subject identification information and the standard initial tests are stored in a master record and any additional tests are stored as rows in a secondary detail table. Then you link together the two tables based on subject ID and the date on which the tests are given. The question you must answer in this situation is whether you can use just the date or whether you need both date and time. This depends on whether or not a subject may go through the testing procedure more than once during the same day. If so, record the time (say, the time that the procedure begins) using either a DATETIME column or separate DATE and TIME columns that both must be filled in. Without the time value, you will not be able to associate a subject's detail records with the proper master records if the subject is tested twice in a day. I've heard people claim "I don't need a time; I will never test a subject twice on the same day." Sometimes they're correct, but I have also seen some of these same people turn up later wondering how to prevent detail records from being mixed up with the wrong master record after entering data for subjects who were tested multiple times in a day. Sorry, then it's too late! Sometimes you can deal with this problem by retrofitting a TIME column into the tables. Unfortunately, it's difficult to fix existing records unless you have some independent data source, such as the original paper records. Otherwise, you have no way to disambiguate detail records to associate them to the proper master record. Even if you have an independent source of information, this is very messy and likely to cause problems for applications you've already written to use the tables. It's best to explain the issues to the table owners and make sure you've got a good characterization of the data values before creating their tables. Sometimes you may have incomplete data, and this will influence your choice of column types. You may be collecting birth and death dates for genealogical research, and sometimes all you can find out is the year or year and month someone was born or died not the exact date. If you use a DATE column, you can't enter a date unless you have the full date. If you want to be able to record whatever information you have, even if it's incomplete, you may have to keep separate year, month, and day fields. Then you can enter the parts of the date that you have and leave the rest NULL. Another possibility is available in MySQL 3.23 and later, which allows the day or month and day parts of DATE values to be 0. Such "fuzzy" dates can be used to represent incomplete date values. Do Your Values Lie Within Some Particular Range?If you've decided on the general category from which to pick a type for a column, thinking about the range of values you want to represent will help you narrow down your choices to a particular type within that category. Suppose you want to store integer values. The range of your values determines the types you can use. If you need values in the range from 0 to 1000, you can use anything from a SMALLINT up to a BIGINT. If your values range up to 2 million, you can't use SMALLINT, and your choices range from MEDIUMINT to BIGINT. Then you need to pick one type from among the possibilities. Of course, you could simply use the largest type for the kind of value you want to store (BIGINT for the examples in the previous paragraph). Generally, however, you should use the smallest type that is large enough for your purposes. By doing so, you'll minimize the amount of storage used by your tables, and they will give you better performance because smaller columns usually can be processed more quickly than larger ones. (Reading smaller values requires less disk activity, and more key values fit into the index cache, allowing indexed searches to be performed faster.) If you don't know the range of values you'll need to be able to represent, you either must guess or use BIGINT to accommodate the worst possible case. (If you guess and the type you choose does turn out to be too small, all is not lost; you can use ALTER TABLE later to make the column bigger.) In Chapter 1, we created a score table for the grade-keeping project that had a score column for recording quiz and test scores. The table was created using INT to keep the discussion simpler, but you can see now that if scores are in the range from 0 to 100, a better choice would be TINYINT UNSIGNED because that would use less storage. The range of values in your data also affects the attributes you can use with your column type. If values are never negative, you can use UNSIGNED; otherwise, you can't. String types don't have a "range" in the same way numeric columns do, but they have a length, and the maximum length you need affects the column types you can use. If your strings are shorter than 256 characters, you can use CHAR, VARCHAR, TINYTEXT, or TINYBLOB. If you want longer strings, you can use a TEXT or BLOB type, but CHAR and VARCHAR are no longer options. For string columns that you will use to represent a fixed set of values, you might consider using an ENUM or SET column type. These can be good choices because they are represented internally as numbers. Operations on them are performed numerically, which makes them more efficient than other string types. They can also be more compact than other string types, which saves space. When characterizing the range of values you have to deal with, the best terms are "always" and "never" (as in "always less than 1000" or "never negative") because they allow you to constrain your column type choices more tightly. But be wary of using these terms when they're not really justified. Be especially wary if you're consulting with other people about their data and they start throwing around those two terms. When people say "always" or "never," be sure they really mean it. Sometimes people say their data always have a particular characteristic when they really mean "almost always." For example, suppose you're designing a table for a group of investigators who tell you, "Our test scores are always 0 to 100." Based on that statement, you choose TINYINT and you make it UNSIGNED because the values are always non-negative. Then you find out that the people who code the data for entry into the database sometimes use 1 to mean "student was absent due to illness." Oops. They didn't tell you that. It may be acceptable to use NULL to represent such values, but if not, you'll have to record a 1 and then you can't use an UNSIGNED column. (This is an instance where ALTER TABLE comes to your rescue!) Sometimes decisions about these types of cases can be made more easily by asking a simple question: Are there ever exceptions? If an exceptional case ever occurs, even just once, you must allow for it. You will find that people who talk to you about designing a database invariably think that if exceptions don't occur very often, they don't matter. When you're creating a table, you can't think that way. The question you need to ask isn't "How often do exceptions occur?" It's "Do exceptions ever occur?" If they do, you must take them into account. What Are the Performance and Efficiency Issues?Your choice of column type can influence query performance in several ways. If you keep the general guidelines discussed in the following sections in mind, you'll be able to choose types that will help MySQL process your tables more efficiently. Numeric Versus String OperationsNumeric operations are generally faster than string operations. Consider comparison operations. Numbers can be compared in a single operation. String comparisons may involve several byte-by-byte or character-by-character comparisons, more so as the strings become longer. If a string column has a limited number of values, use an ENUM or SET type to get the advantages of numeric operations. These types are represented internally as numbers and can be processed more efficiently. Consider alternative representations for strings. Sometimes you can improve performance by representing string values as numbers. For example, to represent IP numbers in dotted-quad notation, such as 192.168.0.4, you might use a string. But as an alternative, you could convert the IP numbers to integer form by storing each part of the dotted-quad form in one byte of a four-byte INT UNSIGNED type. Storing integers would both save space and speed lookups. On the other hand, representing IP numbers as INT values might make it difficult to perform pattern matches, such as you might do if you wanted to look for numbers in a given subnet. So you cannot consider only space issues; you must decide which representation is more appropriate based on what you want to do with the values. (Whichever way you choose, the INET_ATON() and INET_NTOA() functions can help convert between the two representations.) Smaller Types Versus Bigger TypesSmaller types can be processed more quickly than larger types. A general principle is that they take less space and involve less overhead for disk activity. For strings in particular, processing time is in direct relationship to string length. For columns that use fixed-size types, choose the smallest type that will hold the required range of values. For example, don't use BIGINT if MEDIUMINT will do. Don't use DOUBLE if you only need FLOAT precision. For variable-size types, you may still be able to save space. A BLOB uses 2 bytes to record the length of the value, a LONGBLOB uses 4 bytes. If you're storing values that are never as long as 64KB, using BLOB saves you 2 bytes per value. (Similar considerations apply for TEXT types, of course.) Fixed-Length Versus Variable-Length TypesFixed-length and variable-length types have different performance implications, although the particular effects of each depends on the table type. For MyISAM and ISAM tables, fixed-length types generally can be processed more quickly than variable-length types:
If you have variable-length columns in a MyISAM or ISAM table, converting them to fixed-length columns will improve performance because fixed-length records are easier to process. Before you attempt to do this, though, consider the following:
For InnoDB tables, fixed-length and variable-length rows are both stored the same way (as a row header containing pointers to individual column values, plus storage for the values). This means that fixed-length rows aren't any simpler to process. Consequently, the primary performance factor is the amount of storage used for rows. The implication is that variable-length rows will usually be faster for InnoDB tables because they require less space and thus less disk I/O to process. Indexable TypesIndexes speed up queries, so choose types you can index, at least for columns that you plan to use for comparisons in searches. See the "Do You Plan to Index a Column?" section later in this chapter for more information. NULL Versus NOT NULL TypesIf you declare a column NOT NULL, it can be handled more quickly because MySQL doesn't have to check the column's values during query processing to see whether they are NULL. It also saves one bit per row in the table. Avoiding NULL in columns may make your queries simpler (because you don't have to think about NULL as a special case), and simpler queries generally are processed more quickly. How Do You Want Your Values to be Compared?You can often control case sensitivity of string values for comparison and sorting purposes by the type of column you use to store them. The determining factor is whether the column contains binary strings (case sensitive) or non-binary strings (not case sensitive). Table 2.15 shows each binary string type and the corresponding non-binary type. Some types (CHAR, VARCHAR) are binary or not binary according to the presence or absence of the keyword BINARY in the column declaration. The "binary-ness" of other types (BLOB, TEXT) is implicit in the type name.
If you want to use a column for both case-sensitive and not case-sensitive comparisons, use a non-binary type. Then, whenever you want a case-sensitive comparison, use the BINARY keyword to force a string to be treated as a binary string value. For example, if mycol is a CHAR column, you can compare it different ways. The following comparison is not case sensitive: mycol = 'ABC' But the following comparisons are both case sensitive (note that it doesn't matter which string the BINARY operator is applied to): BINARY mycol = 'ABC' mycol = BINARY 'ABC' If you have string values that you want to sort in some non-lexical order, consider using an ENUM column. Sorting of ENUM values occurs according to the order in which you list the enumeration values in the column declaration, so you can make the values sort in any order you want. Do You Plan to Index a Column?Indexes allow MySQL to process queries more efficiently. Choosing indexes is a topic covered in more detail in Chapter 4, but a general principle is that columns you commonly use in WHERE clauses to select rows are good candidates for indexing. If you want to index a column or include it in a multiple-column index, there may be constraints on the types you can choose. For example, some table types (InnoDB and ISAM) do not allow indexes on BLOB or TEXT columns, and prior to MySQL 3.23.2, all indexed columns must be declared as NOT NULL. If you find yourself bumping up against these restrictions, you may be able to work around them:
Inter-Relatedness of Column Type Choice IssuesYou can't always consider the issues involved in choosing column types as though they are independent of one another. For example, range is related to storage size for numeric types; as you increase the range, you require more storage, which affects performance. Or consider the implications of choosing to use AUTO_INCREMENT to create a column for holding unique sequence numbers. That single choice has several consequences involving the column type, indexing, and the use of NULL:
All of this means you do not just declare an AUTO_INCREMENT column like this: mycol arbitrary_type AUTO_INCREMENT You declare it like this: mycol integer_type UNSIGNED AUTO_INCREMENT NOT NULL, PRIMARY KEY (mycol) Or like like this: mycol integer_type UNSIGNED AUTO_INCREMENT NOT NULL, UNIQUE (mycol) |