Section 5.4. Forcing Row Ordering

5.4. Forcing Row Ordering

There is another aspect to an index-organized table than just finding all required data in the index itself without requiring an additional access to the table. Because IOTs, being indexes, are, first and foremost, strongly ordered structures, their rows are internally ordered. Although the notion of order is totally foreign to the relational theory, from a practical point of view whenever a query refers to a range of values, it helps to find them together instead of having to gather data scattered all over the table. The most common example of this sort of application is range searching on time series data, when you are looking for events that occurred between two particular dates.

Most database systems manage to force such an ordering of rows by assigning to an index the role of defining the order of rows in the table. SQL Server and Sybase call such an index a clustered index . DB2 calls it a clustering index , and it has much the same effect in practice as an Oracle IOT. Some queries benefit greatly from this type of organization. But similar to index organized tables, updates to columns pertaining to the index that defines the order are obviously more costly because they entail a physical movement of the data to a new position corresponding to the "rank" of the new values. The ordering of rows inevitably favors one type of range-scan query at the expense of range scans on alternative criteria.

As with IOTs that are defined by the primary key, it is safer to use the primary key index as the clustering index, since primary keys are never updated (and if your application needs to update your primary key, there is something very seriously wrong indeed with your design, and it won't take long before there is something seriously wrong with the integrity of your data). In contrast to IOTs, an index other than the one that enforces the primary key constraint can be chosen as the clustering index. But remember that any ordering unduly favors some processes at the expense of others. The primary key, if it is a natural key, has a logical significance; the associated index is more equal than all the other indexes that may be defined on the table, even unique ones. If some columns must be given some particular prominence through the physical implementation, these are the ones.

Figure 5-4 illustrates the kind of differences you may expect between clustered and non-clustered index performance in practice. If we take the same table as was used for the index-organized table in Figure 5-3's example (a three-column primary key plus nine numeric columns), and if we insert rows in a totally random way, the cost of insertion into a table where the primary key index is clustered is quite high, since tests show that our insertion rate is about half the insertion rate obtained with a non-clustered primary key. But when we run a range scan test on about 50,000 rows, this clustered index provides really excellent performance. In this particular case, the clustered index allows us to outperform the non-clustered approach by a factor of 20. We should, of course, see no difference when fetching a single row.

A structural optimization, such as a clustered index or an IOT, necessarily has some drawbacks. For one thing, such structures apply some strong, tree-based, and therefore hierarchical ordering to tables. This approach resurrects many of the flaws that saw hierarchical databases replaced by relational databases in the corporate world. Any hierarchical structure favors one vision of the data and one access path over all the others. One particular access path will be better than anything you could get with a non-clustered table, but most other access paths are likely to be significantly worse. Updates may prove more costly. The initial tidy disposition of the data inside the database files may deteriorate faster at the physical level, due to chaining, overflow pages, and similar constructs, which take a heavy toll on performance. Clustered structures are excellent in

Figure 5-4. How clustered indexes perform

some cases, boosting performance by an impressive factor. But they always need to be carefully tested, because there is a high probability that they will make many other processes run slower. One must judge their suitability while looking at the global pictureand not on the basis of one particular query.

Range scanning on clustered data can give impressive performance, but other queries will suffer as a consequence.