Section 3.7. System-Generated Keys

3.7. System-Generated Keys

System-generated keys (whether through a special number column defined as self-incrementing or through the use of system-generated counters such as Oracle's sequences) require special care. Some inexperienced designers just love system-generated keys even when they have perfectly valid natural identifiers at their disposal. System-generated sequential numbers are certainly a far better solution than looking for the greatest current value and incrementing it by one (a certain recipe for generating duplicates in an environment with some degree of concurrency), or storing a "next value" that has to be locked and updated into a dedicated table (a mechanism that serializes and dramatically slows down accesses). Nevertheless, when many concurrent insertions are running against the same table in which these automatic keys are being generated, some very serious contention can occur at the creation point of the primary key index level. The purpose of the primary key index is primarily to ensure the uniqueness of the primary key columns.

The problem is usually that if there is one unique generator (as opposed to as many generators as there are concurrent processes, hitting totally disjoint ranges of values) we are going to rapidly generate numbers that are in close proximity to each other. As a result, when trying to insert key values into the primary key index, all processes are going to converge on the same index page, and the DBMS engine will have to serializethrough locks, latches, semaphores, or whichever locking mechanism is at its disposalthe various processes so that each one does not try to overwrite the bytes that another one is writing to. This is a typical example of contention that leads to some severe underuse of the hardware. Processes that could and should work in parallel have to wait in order, one behind the other. This bottleneck can be particularly severe on multi-processor machines, the very environment in which parallelism should be operating.

Some database systems provide some means to reduce the impact of system-generated keys; for instance, Oracle allows you to define reverse indexes, indexes in which the sequence of bits making up the key is inversed before being stored into the index. To indicate a very approximate idea of what such an index looks like, let's simply take the same names of marshals as we did in Figure 3-5 and reverse the letters instead of bits. We get something looking like Figure 3-10.

Figure 3-10. A simplified representation of reverse indexing

It is easy to understand that even when we insert names that are alphabetically very close to one another, they are spread all over the various branches of the index tree: look for instance at the respective positions of MASSENA (a.k.a. ANESSAM), MORTIER (REITROM) and MURAT (TARUM). Therefore, we hit different places in the index and have much less contention than with a normally organized index. Close grouping of the index values would lead to a very high write activity that is very localized within the index. Before searching, Oracle simply applies the same reversing to the value we want to search against, and then proceeds as usual to traverse the index tree.

Of course, every silver lining has its cloud: when our search condition attempts to use a leading string search like this:

     where name like 'M%'

which is a typical range search, the reverse index is no help at all. By contrast, a regular index can be used to quickly identify the range of values beginning with a certain string that we are interested in. The inability of reverse indexes to be used for range searches is, of course, a very minor inconvenience with system-generated keys, which are often unknown to end users and therefore unlikely to be the object of range scans. But it becomes a major hindrance when rows contain timestamps. Timestamped rows might arrive in close succession, making the timestamp column a potentially good candidate for reverse indexing, but then a timestamp is also the type of column against which we are quite likely to be looking for ranges of values.

The hash index used in some database systems represents a different way to avoid bottlenecking index updates all on one index page. With hash indexing, the actual key is transformed into a meaningless, randomly distributed numeric key generated by the system, which is based on the column value being indexed. Although it is not impossible for two values to be transformed into similar, meaningless keys, two originally "close" keys will normally hash into two totally disconnected values. Once again, hash indexes represent a trick to avoid having a hot spot inside an index tree, but that benefit too, comes with certain restrictions. The use of a hash index is on an "equality or nothing" basis; in other words, range searching, or indeed any query against a part of the index key, is out of the question. Nevertheless, direct access based on one particular value of the key can be very fast.

Even when there are solutions to alleviate contention risks, you should not create too many system-generated identifiers. I have sometimes found system-generated keys in each and every table (for instance a special, single detail_id for the type of order_details table mentioned above, instead of the more natural combination of order_id and some sequential number within the order). This blunderbuss approach is something that, by any standard, simply cannot be justifiedespecially for tables that are referenced by no other table.

System-generated keys can provide benefit in the right circumstances, but beware of their indiscriminate use!