6.8. Simple or Range Searching on DatesAmong search criteria, dates (and times) hold a particular place that is all their own. Dates are extremely common, and more likely than other types of data to be subjected to range conditions, whether they are bounded ("between this date and that date") or only partially bounded ("before this date"). Very often, and what this situation describes, the result set is derived from searches against date values referenced to the current date (e.g., "six months earlier than the current date," etc.). The example in the previous section, "Result Set Obtained by Aggregation," refers to a sales_history table; our condition was on an amount, but it is much more common with this type of table to have conditions on date, especially to get a snapshot of the data either at a given date or between two dates. When you are looking for a value on a given date in a table containing historical data , you must pay particular attention to the way you identify current data. The way you handle current data may happen to be a special case of data predicated on an aggregate condition. I have already pointed out in Chapter 1 that the design of a table destined to store historical data is a tricky affair and that there is no easy, ready-made solution. Much depends on what you plan to do with your data, whether you are primarily interested in current values or in values as of a particular date. The best solution also depends on how fast data becomes outdated. If you are a retailer and wish to keep track of the wares you sell, it is likely that, unless your country suffers severe hyper-inflation, the rate of change of your prices will be pretty slow. The rate of change will be higher, possibly much higher, if you are recording the price of financial instruments or monitoring network traffic. To a large extent, what matters most with history tables is how much historical data you keep on average per item: you may store a lot of historical information for very few items, or have few historical records for a very large number of items, or anything in between. The point here is that the selectivity of any item depends on the number of items being tracked, the frequency of sampling (e.g., either once per day or every change during the day), and the total time period over which the tracking takes place (infinite, purely annual, etc.). We shall therefore first consider the case when we have many items with few historical values , then the opposite case of few items with a rich history, and then, finally, the problem of how to represent the current value. 6.8.1. Many Items, Few Historical ValuesIf we don't keep an enormous amount of historical data per item, the identification of an item is quite selective by itself. Specifying the item under study restricts our "working set" to just a few historical rows, and it then becomes fairly easy to identify the value at a given reference date (the current or a previous date) as the value recorded at the closest date prior to the reference date. In this case, we are dealing once again with aggregate values. Unless some artificial, surrogate key has been created (and this is a case where there is no real need for a surrogate key), the primary key will generally be a composite key on the identifier of items (item_id) and the date associated with the historical value (record_date). We mostly have two ways of identifying the rows that store values that were current as of a given reference date: subqueries and OLAP functions. 6.8.1.1. Using subqueriesIf we are looking for the value of one particular item as of a given date, then the situation is relatively simple. In fact, the situation is deceptively simple, and you'll often encounter a reference to the value that was current for a given item at a given date coded as: select whatever from hist_data as outer where outer.item_id = somevalue and outer.record_date = (select max(inner.record_date) from hist_data as inner where inner.item_id = outer.item_id and inner.record_date <= reference_date) It is interesting to see what the consequences of this type of construct suggest in terms of the execution path. First of all, the inner query is correlated to the outer one, since the inner query references the item_id of the current row returned by the outer query. Our starting point is therefore the outer query. Logically, from a theoretical point of view, the order of the columns in a composite primary key shouldn't matter much. In practice, it is critical. If we have made the mistake of defining the primary key as (record_date, item_id) instead of (item_id, record_date), we desperately need an additional index on item_id for the inner query; otherwise, we will be unable to efficiently descend the tree-structured index. And we know how costly each additional index can be. Starting with our outer query and finding the various rows that store the history of item_id, we will then use the current value of item_id to execute the subquery each time. Wait! This inner query depends only on item_id, which is, by definition, the same for all the rows we check! The logical conclusion: we are going to execute exactly the same query, returning exactly the same result for each historical row for item_id. Will the optimizer notice that the query always returns the same value? The answer may vary. It is better not to take the chance. There is no point in using a correlated subquery if it always returns the same value for all the rows for which it is evaluated. We can easily uncorrelate it: select whatever from hist_data as outer where outer.item_id = somevalue and outer.record_date = (select max(inner.record_date) from hist_data as inner where inner.item_id = somevalue and inner.record_date <= reference_date) Now the subquery can be executed without accessing the table: it finds everything it requires inside the primary key index. It may be a matter of personal taste, but a construct that emphasizes the primary key is arguably preferable to the preceding approach, if the DBMS allows comparing several columns to the output of a subquery (a feature that isn't supported by all products): select whatever from hist_data as outer where (outer.item_id, outer.record_date) in (select inner.item_id, max(inner.record_date) from hist_data as inner where inner.item_id = somevalue and inner.record_date <= reference_date group by inner.item_id) The choice of a subquery that precisely returns the columns matching a composite primary key is not totally gratuitous. If we now need to return values for a list of items, possibly the result of another subquery, this version of the query naturally suggests a good execution path. Replace somevalue in the inner query by an in( ) list or a subquery, and the overall query will go on performing efficiently under the very same assumptions that each item has a relatively short history. We have also replaced the equality condition by an in clause: in most cases the behavior will be exactly the same. As usual, it is at the fringes that you encounter differences. What happens if, for instance, the user mistyped the identification of the item? The in( ) will return that no data was found, while the equality may return a different error. 6.8.1.2. Using OLAP functionsWith databases, OLAP functions such as row_number( ) that we have already used in the self-joins situation can provide a satisfactory and sometimes even a more efficient way to answer the same question "what was the current value for one particular item at a given date?" (remember that OLAP functionality does, however, introduce a distinctly non-relational aspect to the proceedings[*]).
With a function such as row_number( ) we can assign a degree of freshness (one meaning most recent) to the data by ranking on date: select row_number( ) over (partition by item_id order by record_date desc) as freshness, whatever from hist_data where item_id = somevalue and record_date <= reference_date Selecting the freshest data is then simply a matter of only retaining the rows with a value of one for freshness: select x.<suitable_columns> from (select row_number( ) over (partition by item_id order by record_date desc) as freshness, whatever from hist_data where item_id = somevalue and record_date <= reference_date) as x where x.freshness = 1 In theory, there should be hardly any difference between the OLAP function approach and the use of subqueries. In practice, an OLAP function hits the table only once, even if the usual sorting happens behind the scene. There is no need for additional access to the table, even a fast one that uses the primary key. The OLAP function approach may therefore be faster (albeit only slightly so). 6.8.2. Many Historical Values Per ItemThe picture may be different when we have a very large number of historical valuesfor instance, a monitoring system in which metrics are collected at a rather high frequency. The difficulty here lies in the fact that all the intermediate sorting required for identifying the value at or nearest a given date may have to operate on a really large amount of data. Sorting is a costly operation. If we apply the principles of Chapter 4, the only way we have to reduce the thickness of the non-relational layer is by doing a bit more work at the relational levelby increasing the amount of filtering. In such a case, it is very important to narrow our scope by bracketing the date (or time) more precisely for which we want the data. If we only provide an upper boundary, then we shall have to scan and sort the full history since the beginning of ages. If data is collected at a high frequency, it is then reasonable to give a lower limit. If we succeed in restraining the "working set" of rows to a manageable size, we are back to the case in which we have relatively few historical values per item. If specifying both an upper boundary (such as the current date) and a lower boundary isn't an option, our only hope is in partitioning per item; operating on a single partition will take us closer to the "large result set" situation. 6.8.3. Current ValuesWhen we are predominantly interested in the most recent or current values , it is very tempting to design a way to avoid either the nested subquery or the OLAP function (which both entail a sort), and hit the proper values directly. We mentioned in Chapter 1 that one solution to this problem is to associate each value with some "end date"the kind of "best before" you find on your cereal boxesand to say that for current values that end date is far, far away into the future (let's say December 31, 2999). We also mentioned that there were some practical issues associated with such a design and the time has now come to explore these issues. With a fixed date, it certainly becomes extremely easy to find the current value. Our query simply becomes: select whatever from hist_data where item_id = somevalue and record_date = fixed_date_in_the future We then hit the right row, spot on, through the primary key. And of course, nothing prevents us from using either subqueries or OLAP functions whenever we need to refer to a date other than the current one. There are, however, two main drawbacks to this approachan obvious one and a more subtle one:
You must understand your data and your data distributions if you are to understand how the optimizer views your system. |