How Big Is the Size Factor?

"Drake's formula looks scientific, but the result can be anywhere from 0% to 100%."

Peter D. Ward and Donald Brownlee, Rare Earth; Why Complex Life Is Uncommon in the Universe , Copernicus Springer-Verlag, New York

These days, disk drives are cheap, and reducing disk space is not in itself a performance issue anymore. What is important is the answer to this questionIf I make every column smaller, will every application get faster, and if so how much? Alas, the answer depends so much on the application, we have to hedge and hum . Yes, but, if.

The most important factor in performance is the number of reads and writes the DBMS makes. The minimum input/output unit is the page . Rows of data never overlap page boundaries (except with InterBase and Oracle), so a 10% reduction in row size won't always cause a 10% reduction in the number of pages. Still, if the row size is small and the average is calculated over many tables, that reduction in size should be true. Given this, if you reduce row sizes by 10%, a 1,000-page table will be reduced to 900 pages. Extrapolating, let's further assume that an index with 1,000 pages will thus be reduced to 910 pages (because there will be more keys in an index page, but the reduction is a bit less than 10% because the ROWID part of an index key never changes in size). The effect on performance is good:

Sequential scans will read 10% fewer pages.
Indexed scans will read log2(9) = 3.17% fewer pages, or one less page 3.17% of the time on average.
Multiple-row fetches will read fewer rows because the probability of two rows being in the same page is greater.
A data-change statement (INSERT, UPDATE, or DELETE) will write the same number of pages. Fewer pages doesn't translate to fewer writes because a typical OLTP transaction causes a single page write, and that fact is unaffected by the number of rows in a page.

Soreducing column sizes might cause the number of I/Os to go down 10%, or 0%. But if all four of the preceding scenarios are equally probable, the number of I/Os goes down 6% on average. ^[1]

^[1] We haven't taken caching into account for these calculations because, although the existence of a cache reduces the number of read/writes, the probability that a page is in cache is slightly greater if files are smaller. Did you know that "seek" takes seven milliseconds on MS Windows NT even if the disk head doesn't move? We did not take no-wait writing into account for similar reasons.

What about non-I/O? Let's assume that about 50% of CPU time is devoted to fixed-overhead calculations like stack pushes or conditional jumps . The rest of the time is spent doing in-memory copying (which will go down 10%), sorting (which will go down 10% because more keys fit in temporary memory), and operations on columns (which will go down 10% because of the on-chip cache and more than 10% for operations on character columns ).

In simplified terms, then, let's conclude that for an example scenario where the row size goes down 10%, the CPU time goes down 5%. Yes, many assumptions went into this conclusionas we said at the start, hedge hum . Yes, but, if! But we can draw this generalization from our conclusionIf opportunities exist to reduce row size by 10%, you now have plausible reasons to conjecture that CPU time will correspondingly be reduced by 5%. (We ran a very simple 10-character versus 9-character test on the Big Eight and got an average speed gain of 6.5%.) Of course you will have to measure real data afterward to know for sure, but the conjecture will help you decide whether it's worth trying to reduce row size at all.

Fixed or Variable?

"And if the bed was too large, Procrustes stretched his guest on a rack until he fitted ."

Plutarch, "Theseus The Athenian Adventurer"

When should a column be fixed-size ? When should it be variable-length? Does the distinction make any difference?

The best-known example of a fixed-size SQL data type is CHARACTER (or CHAR), as opposed to CHARACTER VARYING (or VARCHAR), which may vary in size. In fact, several other data types have fixed- and variable-length equivalents, such as ANSI SQL's BIT versus BIT VARYING, IBM's GRAPHIC versus VARGRAPHIC, Sybase's BINARY versus VARBINARY, and so on. But sometimes the storage method for a column depends not so much on the defined data type, as on the column's nullability attribute, or even on the data type of other columns. To be specific:

For Microsoft 6.5 and Sybase, if a column can contain NULL, then it is also variable-length. That is, a column defined as VARCHAR(15) NOT NULL is variable-lengthbut so is a column defined as CHAR(15) NULL .
For MySQL, if any column in a table is defined as variable-length, then all columns are. So if you define one column as CHAR(15) and another as VARCHAR(15), you end up with two variable-length columns. (On the other hand, if all character columns have four bytes or fewer, you end up with all fixed-size columns regardless of data type definition.)
For all other DBMSs or situations, the column's data type declaration is definitive.

All DBMSs store the size of a variable-length column in a separate value that is between one and four bytes long. Because of this overhead, it is worthless to ask for variable-length if the average size of the column's data will be less than four bytes, or if the average size doesn't vary. The result would be a swelling of row sizenot a shrinking. However, in all other circumstances, variable-length columns are shorter, sometimes by a large factor. They are also more accurate. For example, in a CHAR(8) column, it's not possible to tell whether this string:

 'NANCY   '

contained trailing spaces originally, or whether the spaces were added as filler. With a VARCHAR(8) column, though, you would know that the trailing spaces were not added by the DBMS. ^[2]

^[2] This applies to all DBMSs except InterBase, which uses Run-Length Encoding (RLE) for compression. RLE has the effect of reducing lengths for columns that have a large number of repeating bytes, such as spaces or NULLs. However, InterBase will expand to full defined sizes when reading or transmitting.

The argument against a variable-length column is that it takes longer to update. Let's be clear here. It does not take longer to read a row with variable-length columns, because all DBMSs use size values rather than null-terminators (like C). It does take longer to update a row with variable-length columns though, and the reason is shifts. (A shift is the movement of rows caused by a change in the row size. When the length of a row is changed due to an UPDATE or DELETE operation, that row and all subsequent rows on the page may have to be moved, or shifted.)

Consider a table with three columns defined as CHAR(3), VARCHAR(10), and VARCHAR(10), respectively. Figure 7-1 shows what a page for such a table might look like.

Figure 7-1. A page

graphics/07fig01.gif

Suppose the second column of the first row shown in Figure 7-1 is updated from ABCDE to ABCDEF . How will the DBMS fit the larger string in? Well, first it shifts the next column (containing WOMBATS ) forward one byte to make room for the new string. This causes the WOMBATS column to overflow into the next row, so the DBMS then shifts that row down, and so on all the way down the page. If no free space is at the end of the page, there's a page overflow, and the DBMS then has to make room by taking an entire row of the page and migrating it to another page. So by adding one character to a column value, you can cause thousands of bytes to move, and can even cause an extra page write (or more, where logs are involved).

Shifts, particularly forward shifts, are so severe that the time lost by one UPDATE is greater than the time saved by making the column a few bytes smaller. Don't focus on one column when you think about this! It's not really a matter of whether the column can change sizeit's a matter of whether the row can change size. So even if you define a column (say, column1 ) to be fixed-size solely because you're worried about UPDATE speed, but you define a later column (say, column2 ) to be variable-length, shifts can still occur.

How can you prevent shifts? Obviously, one way is to use fixed-size data types exclusively. But even that won't always work. For example, in Microsoft, rows affected by UPDATE triggers or replicated by certain options will cause a "DELETE followed by an INSERT (elsewhere)" instead of an "UPDATE in place"that is, shifts happen anyway.

Another way, which won't reduce the number of shifts within a page but will at least reduce the overflows, is to declare at the outset that a certain amount of the page should be left free initially, in case rows expand (this option is variously called PCTFREE or FILLFACTOR; see the section "Free Page Space" in Chapter 8, "Tables").

A third way to prevent shifts, which works if values tend to be small at first and grow later, is to start by putting a filler in a variable-length column. For example:

 INSERT INTO Table1 (variable_column)   VALUES ('plain       -')  /* do this initially */ ... time passes ... UPDATE Table1 SET   variable_column = 'Rain in Spain'  /* no forward shift occurs */

Unfortunately, the filler characters can't be all spaces because of a bug, which we'll talk about later.

Finally, to prevent shifts, some DBMS vendors recommend that you not index variable-length columns.

The Bottom Line: The Size Factor

If (maximum size > (mean size + 4 bytes)) and there are many reads for every UPDATE, use variable-length data types because you'll save time and space.

If a column can contain NULL and the DBMS is Microsoft or Sybase, use variable-length data types because you have no choice.

Other considerations to take into account are whether the column is indexed, whether all rows are equally important, and whether triggers are used. We'll discuss these matters in other chapters.

If you use variable-length columns, set aside a small amount of free space per page to allow for swelling.

Once you decide to make a column variable-length, you could also decide to give it an absurd size because the defined length doesn't affect storage size. Don't do thatthe DBMS might have to allot the full defined length in temporary uses. For example, when InterBase sends rows from the server to the client, the packet size is determined based on the defined column sizes, not the average or actual sizes. So never define column lengths greater than you really possibly need. Remember that you can always ALTER TABLE later-you won't have fun, but you can do it.