Section 6.8. Simple or Range Searching on Dates

6.8. Simple or Range Searching on Dates

Among search criteria, dates (and times) hold a particular place that is all their own. Dates are extremely common, and more likely than other types of data to be subjected to range conditions, whether they are bounded ("between this date and that date") or only partially bounded ("before this date"). Very often, and what this situation describes, the result set is derived from searches against date values referenced to the current date (e.g., "six months earlier than the current date," etc.).

The example in the previous section, "Result Set Obtained by Aggregation," refers to a sales_history table; our condition was on an amount, but it is much more common with this type of table to have conditions on date, especially to get a snapshot of the data either at a given date or between two dates. When you are looking for a value on a given date in a table containing historical data , you must pay particular attention to the way you identify current data. The way you handle current data may happen to be a special case of data predicated on an aggregate condition.

I have already pointed out in Chapter 1 that the design of a table destined to store historical data is a tricky affair and that there is no easy, ready-made solution. Much depends on what you plan to do with your data, whether you are primarily interested in current values or in values as of a particular date. The best solution also depends on how fast data becomes outdated. If you are a retailer and wish to keep track of the wares you sell, it is likely that, unless your country suffers severe hyper-inflation, the rate of change of your prices will be pretty slow. The rate of change will be higher, possibly much higher, if you are recording the price of financial instruments or monitoring network traffic.

To a large extent, what matters most with history tables is how much historical data you keep on average per item: you may store a lot of historical information for very few items, or have few historical records for a very large number of items, or anything in between. The point here is that the selectivity of any item depends on the number of items being tracked, the frequency of sampling (e.g., either once per day or every change during the day), and the total time period over which the tracking takes place (infinite, purely annual, etc.). We shall therefore first consider the case when we have many items with few historical values , then the opposite case of few items with a rich history, and then, finally, the problem of how to represent the current value.

6.8.1. Many Items, Few Historical Values

If we don't keep an enormous amount of historical data per item, the identification of an item is quite selective by itself. Specifying the item under study restricts our "working set" to just a few historical rows, and it then becomes fairly easy to identify the value at a given reference date (the current or a previous date) as the value recorded at the closest date prior to the reference date. In this case, we are dealing once again with aggregate values.

Unless some artificial, surrogate key has been created (and this is a case where there is no real need for a surrogate key), the primary key will generally be a composite key on the identifier of items (item_id) and the date associated with the historical value (record_date). We mostly have two ways of identifying the rows that store values that were current as of a given reference date: subqueries and OLAP functions.

6.8.1.1. Using subqueries

If we are looking for the value of one particular item as of a given date, then the situation is relatively simple. In fact, the situation is deceptively simple, and you'll often encounter a reference to the value that was current for a given item at a given date coded as:

      select whatever      from hist_data as outer      where outer.item_id = somevalue        and outer.record_date = (select max(inner.record_date)                                 from hist_data as inner                                 where inner.item_id = outer.item_id                                   and inner.record_date <= reference_date)

It is interesting to see what the consequences of this type of construct suggest in terms of the execution path. First of all, the inner query is correlated to the outer one, since the inner query references the item_id of the current row returned by the outer query. Our starting point is therefore the outer query.

Logically, from a theoretical point of view, the order of the columns in a composite primary key shouldn't matter much. In practice, it is critical. If we have made the mistake of defining the primary key as (record_date, item_id) instead of (item_id, record_date), we desperately need an additional index on item_id for the inner query; otherwise, we will be unable to efficiently descend the tree-structured index. And we know how costly each additional index can be.

Starting with our outer query and finding the various rows that store the history of item_id, we will then use the current value of item_id to execute the subquery each time. Wait! This inner query depends only on item_id, which is, by definition, the same for all the rows we check! The logical conclusion: we are going to execute exactly the same query, returning exactly the same result for each historical row for item_id. Will the optimizer notice that the query always returns the same value? The answer may vary. It is better not to take the chance.

There is no point in using a correlated subquery if it always returns the same value for all the rows for which it is evaluated. We can easily uncorrelate it:

     select whatever     from hist_data as outer     where outer.item_id = somevalue       and outer.record_date = (select max(inner.record_date)                                from hist_data as inner                                where inner.item_id = somevalue                                  and inner.record_date <= reference_date)

Now the subquery can be executed without accessing the table: it finds everything it requires inside the primary key index.

It may be a matter of personal taste, but a construct that emphasizes the primary key is arguably preferable to the preceding approach, if the DBMS allows comparing several columns to the output of a subquery (a feature that isn't supported by all products):

     select whatever     from hist_data as outer     where (outer.item_id, outer.record_date) in                               (select inner.item_id, max(inner.record_date)                                from hist_data as inner                                where inner.item_id = somevalue                                  and inner.record_date <= reference_date                                group by inner.item_id)

The choice of a subquery that precisely returns the columns matching a composite primary key is not totally gratuitous. If we now need to return values for a list of items, possibly the result of another subquery, this version of the query naturally suggests a good execution path. Replace somevalue in the inner query by an in( ) list or a subquery, and the overall query will go on performing efficiently under the very same assumptions that each item has a relatively short history. We have also replaced the equality condition by an in clause: in most cases the behavior will be exactly the same. As usual, it is at the fringes that you encounter differences. What happens if, for instance, the user mistyped the identification of the item? The in( ) will return that no data was found, while the equality may return a different error.

6.8.1.2. Using OLAP functions

With databases, OLAP functions such as row_number( ) that we have already used in the self-joins situation can provide a satisfactory and sometimes even a more efficient way to answer the same question "what was the current value for one particular item at a given date?" (remember that OLAP functionality does, however, introduce a distinctly non-relational aspect to the proceedings^[*]).

^[*] ...even if the term OLAP was coined by Dr. E.F. Codd himself in a 1993 paper.

OLAP functions belong to the non-relational layer of SQL. They represent the final, or almost final, step in query execution, since they have to operate on the post-retrieval result set after the filtering has completed.

With a function such as row_number( ) we can assign a degree of freshness (one meaning most recent) to the data by ranking on date:

     select row_number( ) over (partition by item_id                               order by record_date desc) as freshness,             whatever     from hist_data     where item_id = somevalue       and record_date <= reference_date

Selecting the freshest data is then simply a matter of only retaining the rows with a value of one for freshness:

     select x.<suitable_columns>     from (select row_number( ) over (partition by item_id                                     order by record_date desc) as freshness,                     whatever           from hist_data           where item_id = somevalue             and record_date <= reference_date) as x     where x.freshness = 1

In theory, there should be hardly any difference between the OLAP function approach and the use of subqueries. In practice, an OLAP function hits the table only once, even if the usual sorting happens behind the scene. There is no need for additional access to the table, even a fast one that uses the primary key. The OLAP function approach may therefore be faster (albeit only slightly so).

6.8.2. Many Historical Values Per Item

The picture may be different when we have a very large number of historical valuesfor instance, a monitoring system in which metrics are collected at a rather high frequency. The difficulty here lies in the fact that all the intermediate sorting required for identifying the value at or nearest a given date may have to operate on a really large amount of data.

Sorting is a costly operation. If we apply the principles of Chapter 4, the only way we have to reduce the thickness of the non-relational layer is by doing a bit more work at the relational levelby increasing the amount of filtering. In such a case, it is very important to narrow our scope by bracketing the date (or time) more precisely for which we want the data. If we only provide an upper boundary, then we shall have to scan and sort the full history since the beginning of ages. If data is collected at a high frequency, it is then reasonable to give a lower limit. If we succeed in restraining the "working set" of rows to a manageable size, we are back to the case in which we have relatively few historical values per item. If specifying both an upper boundary (such as the current date) and a lower boundary isn't an option, our only hope is in partitioning per item; operating on a single partition will take us closer to the "large result set" situation.

6.8.3. Current Values

When we are predominantly interested in the most recent or current values , it is very tempting to design a way to avoid either the nested subquery or the OLAP function (which both entail a sort), and hit the proper values directly. We mentioned in Chapter 1 that one solution to this problem is to associate each value with some "end date"the kind of "best before" you find on your cereal boxesand to say that for current values that end date is far, far away into the future (let's say December 31, 2999). We also mentioned that there were some practical issues associated with such a design and the time has now come to explore these issues.

With a fixed date, it certainly becomes extremely easy to find the current value. Our query simply becomes:

     select whatever     from hist_data     where item_id = somevalue       and record_date = fixed_date_in_the future

We then hit the right row, spot on, through the primary key. And of course, nothing prevents us from using either subqueries or OLAP functions whenever we need to refer to a date other than the current one. There are, however, two main drawbacks to this approachan obvious one and a more subtle one:

The obvious drawback is that each insertion of a new historical value will first require updating what used to be the current value with, for example, today's date, to mean that it used to be the current value until today. Then the new value can be inserted with the later date, to mean that it is now the current value until further notice. This process leads to double the amount of work, which is bad enough. Moreover, since in the relational theory the primary key is what identifies a row, the combination (item_id, record_date) can be unique but cannot be the primary key since we have to partially update it. We therefore need a surrogate key to be referenced by foreign keys (identity column or sequence), which further complicates programs. The trouble with big historical tables is that usually, to grow that big, they also undergo a high rate of insertion. Does the benefit of faster querying offset the disadvantage of inserting more slowly? It's difficult to say, but definitely a question worth asking.
The subtle drawback has to do with the optimizer. The optimizer relies on statistics that may be of variable detail, with the result that it is not unusual for it to check the lowest and highest value in a column to try to assess the spread of values. Let us say that our historical table contains values since January 1, 2000. Our data will therefore consist of perhaps 99.9% historical data, spread over several, but relatively few, years, and 0.1% of current data, officially as of December 31, 2999. The view of the optimizer will be of data spread over one millennium. This skewness on the part of the optimizer view of the data range is because it is being misled by the upper boundary date in the query ("and record_date = fixed_date_in_the future"). The problem is then that when you search for something other than current values (for instance if you want to collect variations over time for statistical purposes), the optimizer may well incorrectly decide that since you are accessing such a tiny fraction of the millennium, then using indexes is the thing to do, but what you really need is to scan the data. Skewness can lead to totally wrong execution plans, which are not easy to correct.

You must understand your data and your data distributions if you are to understand how the optimizer views your system.