General Data Warehouse Guidelines

 <  Day Day Up  >  

Be sure to understand and apply the following general guidelines when you work on a data warehouse project. These guidelines, though not specific to DB2, are beneficial for building effective data warehouses.

Do Not Implement a Data Warehouse As a Panacea

Many data warehouse development projects begin with "pie in the sky" expectations. One of the biggest problems with a data warehouse project is a situation in which the data warehouse is viewed as a "magic bullet" that will solve all of management's information problems.

To alleviate these types of problems, you should manage expectations by securing an executive sponsor, limiting the scope of the project, and implementing the data warehouse in stages (or possibly by implementing multiple data marts for each department).

Incorporate All Three Sides of the Pyramid

When you're developing a data warehouse, be sure to include tools, people, and methods in your warehouse blueprint. Refer to Figure 45.7. A successful data warehouse requires

  • People, including full-time employees , subject matter expert consultants , long-term partners , and short- term product support personnel.

  • Methods, well-developed policies and procedures for data warehouse planning, architecture, construction, management, and on-going production support.

  • Tools, to facilitate data movement, transformation, replication, propagation, data cleansing, business intelligence and querying, and metadata management.

Figure 45.7. The data warehousing pyramid of success.

graphics/45fig07.gif


Too often, the focus is solely on the tools component. To be successful, a data warehouse project requires more than just tools. You need careful planning and implementation (methods) as well as a means to learn from the efforts of others (people) through mentoring, consulting, education, seminars , and user groups.

Do Not Mix Operational Needs into the Data Warehouse Project

When a data warehousing project is first initiated, it may have a mixture of operational and analytical/informational objectives. This mixture is a recipe for disaster. Redefine the project to concentrate on non-operational, informational needs only. The primary reason for the existence of the data warehouse in the first place is to segregate operational processing from reporting.

Do Not Underestimate the Complexity of Implementing a Data Warehouse

Moving data into a data warehouse is a complex task. Detailed knowledge of the applications accessing the source databases that feed the data warehouse must be available. Be sure to allot development time for learning the complexities of the source systems. Frequently, the systems documentation for the production system is inadequate or non-existent.

Additionally, be sure to analyze the source data to determine what level of data scrubbing is required. As I mentioned earlier, this process can be an immense, time-consuming task.

Prepare to Manage Data Quality Issues Constantly

Maintaining data quality will be an ongoing concern. Both the end users and the data warehouse construction and maintenance team are responsible for promoting and fostering data quality. Data problems will be discovered not only throughout the development phase of the data warehouse, but throughout the useful life of the data warehouse.

Be sure to establish a policy for how data anomalies are to be reported and corrected before the data warehouse is made generally available to its end users. Additionally, be sure to involve the end users in the creation and support of this policy; otherwise , it is doomed to fail. The end users understand the data better than anyone else in the organization, including the data warehouse developers and DBAs.

Do Not Operate in a Vacuum

As business needs change, operational systems change. When operational data stores change, the data warehouse will be affected as well. When a data warehouse is involved, however, both the operational database and the data warehouse must be analyzed for the impact of changing any data formats. This is true because the data warehouse stores historical data that you might not be able to change to the new format. Before the change is made to the operational system, the data warehouse team must be prepared first to accept the new format as input to the data warehouse, and second, to either maintain multiple data formats for the changed data element or to implement a conversion mechanism as part of the data transformation process. Conversion, however, can result in lost or confusing data.

Prepare to Tackle Operational Problems During the Data Warehousing Project

You will encounter problems in operational systems that feed the data warehouse. These problems may have been in production for years , running undetected. The data warehousing project will uncover many such errors. Be prepared to find them and have a plan for handling them.

Only three options are available:

  • Ignore the problem with the understanding that the problem will exist in the data warehouse if not corrected.

  • Fix the problem in the operational system.

  • If possible, fix the problem during the data transformation phase of data warehouse population.

Of course, the second and third options are the favored approaches.

Determine When Data Is to Be Purged

Even in the data warehouse environment, when certain thresholds are reached, maintaining certain data in the data warehouse does not make sense. This situation may occur because of technology reasons (such as reaching a capacity limit), regulatory reasons (change in regulations or laws), or business reasons (restructuring data, instituting different processes and so on).

Plan to arrange for methods of purging data from the data warehouse without dropping the data forever. A good tactic is to prepare a generic plan for offloading warehouse data to tape or optical disk.

If you create a data warehouse without a data purging plan, be prepared to manage very large databases as the data warehouse grows uncontrollably. Several vendors (Princeton Softech and IBM) offer products that assist with purging and archiving data from DB2 databases. You might want to consider such a product if you have a significant need to archive DB2 data.

Consider Using Denormalization Strategies

Experiment with denormalized tables. Because the data warehouse is a read-only database, you should optimize query at the expense of update. Denormalization takes care of this situation. Analyze the data access requirements of the most frequent queries, and plan to denormalize to optimize those queries.

Refer to Chapter 5, "Data Definition Guidelines," for an in-depth discussion on the types of denormalization.

Be Generous with Indexes

The use of indexes is a major factor in creating efficient data retrieval. You usually can use indexes more liberally in the read-only setting of the data warehouse. Remember, though, you must make a trade-off between data loading and modification and the number of indexes, as shown in Figure 45.8.

Figure 45.8. Indexes and the performance of query versus modification.
graphics/45fig08.gif

These indexes do not have to be the same indexes that exist in the operational system, even if the data warehouse is nothing more than an exact replica or snapshot of the operational databases. You should optimize the indexes based on the access patterns and query needs of the decision support environment of the data warehouse.

Back Up the Data Warehouse

Putting in place a backup and recovery plan for data warehouses is imperative. Even though most of the data comes from operational systems originally, you cannot always rebuild data warehouses in the event of a media failure (or a disaster). As operational data ages, it is removed from the operational databases, but it may still exist in the data warehouse. Furthermore, data warehouses often contain external data that, if lost, may have to be purchased again (creating a financial drain).

Follow "The 10 Steps to Clean Data"

The following list is a short compendium of the top 10 things you can do to ensure data quality in your data warehouse environment:

  1. Foster an understanding for the value of data and information within the organization. In short, treat data as a corporate asset. What does this mean? Consider the other assets of your organization. The capital assets ($) are modeled using a chart of accounts. Human resources (personnel) are modeled using management structures, reporting hierarchies, and personnel files. From building blueprints to item bills of material, every asset that is truly treated as an asset is modeled . If your corporation does not model data, it does not treat data as an asset and is at a disadvantage .

    Acceptance of these ideals can be accomplished through lobbying the users and managers you know, starting an internal newsletter, circulating relevant articles and books throughout your company, and treating data as a corporate asset yourself. A great deal of salesmanship, patience, politics, and good luck will be required, so be prepared.

  2. Never cover up data integrity problems. Document them and bring them to the attention of your manager and the users who rely on the data. Usually, the business units using the data are empowered to make changes to it.

  3. Do not underestimate the amount of time and effort that will be required to clean up dirty data. Understand the scope of the problem and the process required to rectify it. Take into account the politics of your organization and the automated tools that are available. The more political the battle, the longer the task will take. The fewer tools available, the longer the task will be. Even if you have tools, if no one understands them properly, the situation will probably be worse than having no tools at all as people struggle to use what they do not understand.

  4. Understand what is meant by "data warehouse" within the context of your projects. What is the scope of the "warehouse": enterprise or departmental? What technology is used? If OLAP is a component of the environment, is it ROLAP or MOLAP?

  5. Educate those people implementing the data warehouse by sending them to courses and industry conferences, purchasing books, and encouraging them to read periodicals. A lack of education has killed many potentially rewarding projects.

  6. Physically design the data stores for the data warehouse differently than the similar, corresponding production data stores. For example, the file and table structures, indexes, and clustering sequence should be different in the warehouse because the data access requirements are different.

  7. You will often hear that denormalization is desirable in the data warehouse environment, but proceed with caution. Because denormalized data is optimized for data access, and the data warehouse is "read-only", you might think that denormalization is a natural for this environment. However, the data must be populated into the data warehouse at some point. Denormalized data is still difficult to maintain and should be avoided if performance is acceptable.

  8. Understand the enabling technologies for data warehousing. Replication and propagation are different technologies with different availability and performance effects on both the production (OLTP) and the warehouse (OLAP) systems.

  9. Only after you understand the basics should you delve into the more complex aspects of data warehousing such as implementing an ODS, very large databases, or multidimensional databases.

  10. Reread steps 1 through 9 whenever you think you are overworked, underpaid, or both!

Data in the warehouse is only as good as the sources from which it was gleaned. Failure to clean dirty data can result in the creation of a data outhouse instead of a data warehouse.

 <  Day Day Up  >  


DB2 Developers Guide
DB2 Developers Guide (5th Edition)
ISBN: 0672326132
EAN: 2147483647
Year: 2004
Pages: 388

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net