Section 1.8. The Difficulties of Historical Data

1.8. The Difficulties of Historical Data

Working with historical data is an extremely common conditionthe process of valuation , or specifying the price of goods or a service at a particular point in time, is based on historical databut one of the really difficult issues of relational design is the handling of data that is associated with some period (as opposed to point) of time.

There are several ways to model historical data. Let's assume that we want to record the successive prices of goods identified by some article_id. An obvious way to do so is to store the following items:

     (article_id, effective_from_date, price)

where effective_from_date is the date when the new price takes effect, and the primary key of the historical table is (article_id, effective_from_date).

Logically correct, this type of model is rather clumsy to use when working with current data, which in many cases will be our main concern. How are we going to identify the current value? It's the one associated with the highest effective_from_date, and it will be retrieved by running a query looking like:

     select a.article_name, h.price     from articles a,          price_history h     where a.article_name = some_name       and h.article_id = a.article_id       and h.effective_from_date =          (select max(b.effective_from_date)           from price_history b           where b.article_id = h.article_id)

Executing this query requires two passes over the same data: one in the inner query to identify which is the most recent date we have for a given article, and one in the outer query to return the price from a row that we have necessarily hit in the inner query (Chapter 6 talks about special functions implemented by some DBMS systems that can avoid, to some extent, multiple passes). Executing repeated queries following this pattern can prove very costly.

However, the choice of how to register the validity period for a price is arbitrary. Instead of storing the effective date from which the price applies, why not store the "end date" (e.g., the last date on which the current price prevails), identifying the time intervals by their upper bound instead of by their lower bound?

This new approach may look like an attractive solution. You have two ways to define current valueseither that the end date is undefined, which looks neat but isn't necessarily a good idea, or that the end date is something like December 31, 3000.

It's quite obvious that looking for the price of an article as of December 31, 3000 will take you directly to the row you want, in a single pass. Definitely attractive. Is this the perfect solution? Not quite. There may be some practical worries with the optimizer, which I discuss in Chapter 6, but there is also a major logical issue: prices, as any consumer knows, rarely stay constant, and price increases are not usually decided instantly (financial environments may be something different). What happens when, for example, in October, new prices are decided for the next year and duly recorded in the database?

What we get in our valuation table are two records for each item: one stating the current price, valid until December 31, and one giving the price that will be applied from January 1. If we store the first date when the price applies we will have one row with an effective_from_date in the past (for instance January 1 of the current year) and another one in the future (say, the next January 1). In effect, what will define the current price is not the highest date, but the highest date before today (returned in Oracle by the system function sysdate). The preceding query needs to be modified only slightly:

     select a.article_name, h.price     from articles a,          price_history h     where a.article_name = some_name       and h.article_id = a.article_id       and h.effective_from_date =          (select max(b.effective_from_date)           from price_history b           where b.article_id = h.article_id             and b.effective_from_date <= sysdate)

If we store the last day when the price applies, we will have one row with an end_date set to December 31 and another with end_date set either to null or doomsday. Expressing that we want the price for which the end_date is the smallest date after the current date is no obvious improvement on the query just shown.

Denormalization is of course a possible solutionone can imagine storing both the date when a price becomes effective and the date when it ceases to be, or one could also argue for storing the effective_from_date and the number of days for which the effective_from_date price applies. This could allow using either the start or the end of the period, as best suits the query.

Denormalization always implies taking a risk with data integritya minor date entry error can leave black holes when no price is defined. You can of course minimize the risk by adding more checks when data is inserted or updated, but there is always a performance penalty associated with such checks.

Another possible solution is to have a current table and a historical table and plan a migration of rows from current to historical when prices change. This approach can suit some kinds of applications, but may be complicated to maintain. Moreover, the "pre-recording" of future prices fits rather badly into the picture.

In practice, particular storage techniques such as partitioning, which I discuss in Chapter 5, will come to the rescue, making constructs such as the one using the effective_from_date less painful than they might otherwise have been, especially for mass processing.

But before settling for one solution, we must acknowledge that valuation tables come in all shapes and sizes. For instance, those of telecom companies, which handle tremendous amounts of data, have a relatively short price list that doesn't change very often. By contrast, an investment bank stores new prices for all the securities, derivatives, and any type of financial product it may be dealing with almost continuously. A good solution in one case will not necessarily be a good solution in another.

Handling data that both accumulates and changes requires very careful design and tactics that vary according to the rate of change.