Page 1346
It is never easy to gather user requirements. This is especially true when designing a data warehouse for people who have never seen a data warehouse. They usually can describe what they need for their OLTP systems because they generally match some defined workflow scheme. A data warehouse, on the other hand, is a nebulous subject. Many people can think of the questions they asked this morning, but they cannot extend that thinking to the questions they might be asking tomorrow or the next day. Therefore, if you can afford the disk space, be liberal with the information you include in your data warehouse. I have seen data warehouses that ran through massive back-processing tasks that took many weeks to complete because they did not bring down a certain piece of information from the OLTP system that a user decided was needed after the system went into production.
After you have a feel for the kind of information users want to see in your data warehouse, you need to start looking for quality sources of that information. Quality is a very important word here. Many organizations find that they have grown up with a variety of information systems performing similar functions. There may be dozens of lists of valid departments in these systems or even multiple calculations of profitability, each using separate algorithms and source data. Usually, the end users have to be the judges of which source is best. Be aware that even the best source of information may require you to develop clean-up algorithms to make the data useful (to eliminate duplication, for example).
After you map out your sources of information, you might want to look at additional information available in these systems that might be useful to your users. There is always a trade-off between storage capacity and providing information that might be needed in the future. If there are a few extra fields of data in the OLTP system records, though, it might be worth your effort. But remember that many OLTP systems store a lot of information that can be derived from other information it stores. This dates from the era when processing capacity was limited and programmers often stored extra information for easy retrieval while running large batch jobs. You do not need to bring down information that you can derive later.
The next issue to consider is that it might be quite a challenge to find information on some of your older OLTP systems. This is the most common problem I have run across when implementing
Page 1347
data warehouses. Many of these systems evolved over time. Many programmers and users involved in the development of the various pieces of these systems are no longer around. I have found problems such as fields being used for different purposes over the years to code value schemes changing to people deciding to reuse primary key values in the middle of the year (the controller decided to restart order numbers at 1 in the middle of the year). These types of problems can play havoc with your upload processing routines if you are trying to capture several years of past data into your data warehouse. What you wind up doing is modifying the scripts in the base table population routines to have a lot of if...then loops to handle the anomalies. My advice in this situation is to find whatever information you can about the OLTP systems and leave some time in the project schedule for the things that no one knows about.
At this point in the design process, you have a fairly decent handle on the information available in the source systems and the information you want to store in the data warehouse. You need to spend some time coming up with an efficient design for the information in the data warehouse. I have included a section later in this chapter called "Star Join Schema." This is one of the most common data storage formats used in data warehouses. If you are implementing a large data warehouse, you will have to ignore some of the normalization rules you have been taught. These rules were designed for efficient use of disk space, but they are not the fastest means of storing data for warehouse applications, and they result in a schema that is far too complex for average users to deal with when they write their ad-hoc queries. Although you can create views that give users a simpler presentation of data that is stored in a normalized schema, speed is a critical factor in larger data warehouses. Therefore, you need to spend a bit of time designing the tables in a manner optimized for data warehousing.
After you design the source and target databases, you need to start working on the data-transfer processes. You need to choose the technologies you will use to move around the information. On your OLTP systems, you often are limited in the tools that can access the information stored using some proprietary data-storage technology. In other areas, however, you have choices. You usually will want to use the tools you already are using and with which you are familiar. You might find cases in which you have to consider alternative technologies to get the processing speed required for larger data loads, however. Instead of using a COBOL program to load information into the Oracle database, for example, you may use Oracle's Loader product. The key factor in these decisions is balancing speed of processing against using the tools you want to have as part of your supported IS architecture.
At first, data extraction from the online system might seem to be one of the easiest parts of the overall data warehouse system. After all, it is something that you already are familiar with, and you know the development tools used on that platform. You already have identified the key data elements you want to extract. However, you might have to deal with a few issues, such as the following: