Section 3.4. Indexes with Functions and Conversions

3.4. Indexes with Functions and Conversions

Indexes are usually implemented as tree structuresmostly complex treesto avoid a fast decay of indexes on heavily inserted, updated, and deleted tables. To find the physical location of a row, the address of which is stored in the index, one must compare the key value to the value stored in the current node of the tree to be able to determine which sub-tree must be recursively searched. Let's now suppose that the value that drives our search doesn't exactly match an actual column value but can be compared to the result of a function f( ) applied to the column value. In that case we may be tempted to express a condition as follows:

     where f(indexed_column) = 'some value'

This kind of condition will typically torpedo the index, making it useless. The problem is that nothing guarantees that the function f( ) will keep the same order as the index data; in fact, in most cases it will not. For instance, let's suppose that our tree-index looks like Figure 3-6.

Figure 3-6. A simplistic representation of how names might be stored in an index

(If the names look familiar, it is because they are those of some of Napoleon's marshals.) Figure 3-6 is of course an outrageously simplified representation, just for the purpose of explaining a particular point; the actual indexes do not look exactly like the binary tree shown in Figure 3-6. If we look for the MASSENA key, with this search condition:

     where name = 'MASSENA'

then the index search is simple enough. We hit LANNES at the root of the tree and compare MASSENA to LANNES. We find MASSENA to be greater, based on the alphabetical order. We therefore recursively search the right-hand sub-tree, the root of which is MORTIER. Our search key is smaller than MORTIER, so we search the left-hand sub-tree and immediately hit MASSENA. Bingosuccess.

Now, let's say that we have a condition such as:

     where substr(name, 3, 1) = 'R'

The third letter is an uppercase R--which should return BERNADOTTE, MORTIER, and MURAT. When we make the first visit to the index, we hit LANNES, which doesn't satisfy the condition. Not only that, the value that is associated with the current tree node gives us no indication whatsoever as to which branch we should continue our search into. We are at a loss: the fact that the third letter is R is of no help in deciding whether we should search the left sub-tree or the right sub-tree (in fact, we find elements belonging to our result set in both sub-trees), and we are unable to descend the tree in the usual way, by selecting a branch thanks to the comparison of the search key to the value stored in the current node.

Given the index represented in Figure 3-6, selecting names with an R in the third position is going to require a sequential data scan, but here another question arises. If the optimizer is sufficiently sophisticated, it may be able to judge whether the most efficient execution path is a scan of the actual data table or inspecting, in turn, each and every node in the index on the column in question. In the latter case, the search would lead to an index-based retrieval, but not as envisaged in the original model design since we would be using the index in a rather inefficient way.

Recall the discussion on atomicity in Chapter 1. Our performance issue stems from a very simple fact: if we need to apply a function to a column, it means that the atomicity of data in the table isn't suitable for our business requirements. We are not in 1NF!

Atomicity, though, isn't a simple notion. The ultra-classic example is a search condition on dates. Oracle, for instance, uses the date type to store not only the date information, but also the time information , down to the second (this type is actually known as datetime to most other database systems). However, to test the unwary, the default date format doesn't display the time information. If you enter something such as:

     where date_entered = to_date('18-JUN-1815',                                  'DD-MON-YYYY')

then only the rows for which the date (and time!) happens to be exactly the 18th of June 1815 at 00:00 (i.e., at midnight) are returned. Everyone gets caught out by this issue the very first time that they query datetime data. Quite naturally, the first impulse is to suppress the time information from date_entered, which the junior practitioners usually do in the following way:

     where trunc(date_entered) = to_date('18-JUN-1815',                                         'DD-MON-YYYY')

Despite the joy of seeing the query "work," many people fail to realize (before the first performance issues begin to arise) that by writing their query in such a way they have waved goodbye to using the index on date_entered, assuming there was one. Does all this mean that you cannot be in 1NF if you are using datetime columns? Fortunately, no. In Chapter 1, I defined an atomic attribute as an attribute in which a where clause can always be referred to in full. You can refer in full to a date if you are using a range condition. An index on date_entered is usable if the preceding condition is written as:

     where date_entered >= to_date('18-JUN-1815',                                      'DD-MON-YYYY')       and date_entered < to_date('19-JUN-1815',                                     'DD-MON-YYYY')

Finding rows with a given date in this way makes an index on date_entered usable, because the very first condition allows us to descend the tree and reach a sorted list of all keys at the bottom of the index hierarchy (we may envision the index as a sorted list of keys and associated addresses, above which is plugged a tree allowing us to get direct access to every item in the list). Therefore, once the first condition has taken us to the bottom layer of the index and to the very first item of interest in the list, all we have to do is scan the list as long as the second condition is true. This type of access is known as an index range scan .

The trap of functions preventing the use of indexes is often even worse if the DBMS engine is able to perform implicit conversions when a column of a given type is equated or compared to a constant of another type in a where conditiona logical error and yet one that is allowed by SQL. Once again Oracle provides an excellent example of such behavior. For instance, dangers arise when a character column is compared to a number. Instead of immediately generating a run-time error, Oracle implicitly converts the column to a number to enable the comparison to take place. The conversion may indeed generate a run-time error if there is an alpha character in that numerical string, but in many cases when a string of digits without any true numerical meaning is stored as characters (social security numbers or a date of birth shown as mmddyy, or ddmmyy, both meaning the same, but having very different numerical values), the conversion and subsequent comparison will "work"--except that the conversion will have rendered any index on the character column almost useless.

In the light of the neutralization of indexes by functions, Oracle's design choice to apply the conversion to the column rather than to the constant may at first look surprising. However, that decision does make some sense. First of all, comparing potatoes to carrots is a logical error. By applying the conversion to the column, the DBMS is more likely (depending on the execution path) to encounter a value to which the conversion does not apply, and therefore the DBMS is more likely to generate a runtime error. An error at this stage of the process will prove a healthy reminder to the developer, doubtless prompting for a correction in the actual data field and raising agonizing questions about the quality of the data. Second, assuming that no error is generated, the very last thing we want is to return incorrect information. If we encounter:

     where account_number = 12345

it is quite possible, and in fact most likely, that the person who wrote the query was expecting the account 0000012345 to be returnedwhich will be the case if account_number (the alpha string) is converted to number, but not if the query 12345 is converted to a string without any special format specification.

One may think that implicit conversions are a rare occurrence, akin to bugs. There is much truth in the latter point, but implicit conversions are in fact pretty common, especially when such things come into play as a parameters table holding in a column named parameter_value string representations of numbers and dates, as well as filenames or any other regular character string. Always make conversions explicit by using conversion functions.

It is sometimes possible to index the result of a function applied to one or more columns. This facility is available with most products under various names (functional index, function-based index, index extension, and so on, or, more simply, index on a computed column). In my view, one should be careful with this type of feature and use it only as a standby for those cases in which the code cannot be modified.

I have already mentioned the heavy overhead added to data modifications as a result of the presence of indexes. Calling a function in addition to the normal index load each time an index needs to be modified cannot improve the situation: indeed it only adds to the total index maintenance cost. As the date_entered example given earlier demonstrates, creating a function-based index may be the lazy solution to something that can easily be remedied by writing the query in a different way. Furthermore, nothing guarantees that a function applied to a column retains the same degree of precision that a query against the raw column will achieve. Suppose that we store five years of sales online and that the sales_date column is indexed. On the face of it, such an index looks like an efficient one. But indexing with a function that is the month part of the date is not necessarily very selective, especially if every year the bulk of sales occurs in the run up to Christmas. Evaluating whether the resulting functional index will really bring any benefit is not necessarily easy without very careful study.

From a purely design point of view, one can argue that a function is an implicit recognition that the column in question may be storing two or more discrete items of data. Use of a functional index is, in most cases, a way to extract some part of the data from a column. As pointed out earlier, we are violating the famous first normal form, which requires data to be "atomic." Not using strictly "atomic" data in the select list is a forgivable sin. Repeatedly using "subatomic" search criteria is a deadly one.

There are some cases, though, when a function-based index may be justified. Case-insensitive searches are probably the best example; indexing a column converted to upper- or lowercase will allow us to perform case-insensitive searches on that column efficiently. That said, forcing the case during inserts and updates is not a bad solution either. In any event, if data is stored in lowercase, then required in uppercase, one has to question the thoroughness with which the original data design was carried out.

Another tricky conundrum is the matter of duration in the absence of a dedicated interval data type. Given three time fields, a start date, a completion date, and a duration, one value can be determined from any existing twobut only by either building a functional index or by storing redundant data. Whichever solution is followed, redundancy will be the inevitable consequence: in the final analysis, you must weigh the benefits and disadvantages of the issues surrounding function-based indexes so that you can make informed decisions about using them.

Use of functional indexes is often implicit recognition that your data analysis has not even resolved basic data item atomicity.