PROBLEMS WHEN USING RELATIONAL DATABASES

only for RuBoard - do not distribute or recompile

It has been stated that the relational model supports the requirements of data warehousing, and it does. There are, however, a number of areas where the relational model struggles to cope. As we are coming to the end of our introduction to data warehousing, we'll conclude with a brief look at some of these issues.

Problems Involving Time

Time variance is one of the most important characteristics of data warehouses. In the section on Building the Data Warehouse, we commented on the fact that there appeared to be a certain amount of data redundancy in the warehouse because we were duplicating some of the information, for example, Customers' details, which existed in the operational systems. The reason we have to do this is because of the need to record information over time.

As an example, when a customer changes address we would expect that change to be recorded in the operational database. When we do that we lose the old address. So when a query is next executed where that customer's details are included, any sales of wine, for that customer, will automatically be attributed to the new address. If we are investigating sales by area, the results will be incorrect ( assuming the customer moved to a different area) because many of the sales were made when the customer was in another area.

That's also the reason why we don't delete customers' details from the data warehouse simply because they are no longer customers. If they have placed any orders at all, then they have to remain within the system.

True temporal models are very complex and are not well supported at the moment. We have to introduce techniques such as start dates and end dates to ensure that the data warehouse returns accurate results.

The problems surrounding the representation of time in data warehousing are many. They are fully explored in Chapter 4.

Problems With SQL

SQL is based on set theory. It treats tables as sets and returns its results in the form of a set. There are cases where the use of procedural logic would improve functionality and performance.

Ranking/Top (n)

While it is possible to get ranked output from SQL, it is difficult to do. It involves a correlated subquery, which is beyond the capability of most SQL users. It is also very time-consuming to execute. Some individual RDBMS vendors provide additional features to enable these types of queries to be executed, but they are not standardized. So what works on one RDBMS probably won't work on others.

Top n Percent

It is not practically possible, for instance, to get a list of the top 10 percent of customers who place the most orders.

Running Balances

It is impossible , in practical terms, to get a report containing a running balance using standard SQL. If you are not clear what a running balance is, it's like a bank statement that lists the payments in one column, receipts in a second column, and the balance, as modified by the receipts and payments, in a third or subsequent column.

Complex Arithmetic

Standard SQL provides basic arithmetic functions but does not support more complex functions. The different RDBMS vendors supply their own augmentations but these vary. For instance, if it is required to raise a number by a power, in some systems the power has to be an integer, while in others it can be a decimal. Although data warehouses are used for the production of statistics, standard statistical formulas such as deviations and quartiles, as well as standard mathematical modeling techniques such as integral and differential calculus, are not available in SQL.

Variables

Variables cannot be included in a standard SQL query.

Almost all of these, and other deficiencies can be resolved by writing 3GL programs such as C or COBOL with embedded SQL. Also most RDBMS vendors provide a procedural extension to their standard SQL product to assist in resolving the problems. However, the standard interface between the products that are available at the presentation layer and the RDBMS is a standard called ODBC, which stands for open database connectivity. ODBC, and the more recent JDBC (Java database connectivity) is very useful because it has forced the industry to adopt a standard approach. It does not, at the time of this writing, support the procedural extensions that the RDBMS vendors have provided.

It is worth noting that some of these issues are being tackled. We explore the future in Chapter 11.

only for RuBoard - do not distribute or recompile