WHAT ASSUMPTIONS AND RISKS SHOULD I INCLUDE?

only for RuBoard - do not distribute or recompile

Every project plan contains assumptions, or it should. We never know everything we need to know at the start of a project, and so we have to make a guess. These guesses we call assumptions. Different projects require different assumptions, as do different customers. The difference between assumptions and risks is not at all clear in the context of a project plan, since we can mitigate a risk by making an assumption. For instance, we might regard as a risk the possibility that our senior solution architect may get killed by a passing truck, so we mitigate this by making the assumption that it won't happen. If you prefer to use a risk register, then this section is still relevant.

Here are some assumptions we should consider when planning a data warehouse project:

Data quality. If the project has a task that analyzes the data quality, all well and good. If it doesn't, then we only have the customer's word for the fact that the data quality is adequate. Experience tends to show that customers tend to exaggerate the quality of their data. This is not usually an attempt to hoodwink us but simply underlines that fact that they just don't know how bad their data is. Examples of poor data quality are missing data, duplicated data, referential integrity errors, etc. It is often the case, for instance, that there are several customer databases (23 is the most I have encountered ). These have grown up over time, and each is there for a different purpose. Each database yields a little of the information that we need in the warehouse. Unfortunately, they are never consistent. Sometimes the different databases adopt different encoding systems, and the customer identifier in one system may be entirely different from another system. You might have thought that, if each system yields a piece of the puzzle, all we have to do is make a big table join and, presto, we've got all the information that we need. Sometimes this is true and sometimes it is not true. It is not unusual to find that you have to have a mapping table like the one below to arrive at a consistent customer identifier in the data warehouse.

Customer Identifier Map
	Warehouse customer ID char(8)
	Order processing customer ID number(6)
	Sales rep customer ID char(6)
	Accounts payable customer ID char(6)

This kind of approach works, as long as there is a one-for-one relationship between the systems but, often, there isn't. The sales representative customer database might define a customer at a different level so that the daughter companies appear as individual customers, whereas the accounts payable system is only interested in the corporate customer, the one that pays the invoices. There is a further issue, which is that these disparate sources of information will not be consistent. Addresses will be different, information will be more up to date in some systems than it is in others, etc. These oddities contribute to the general issue surrounding data quality. Our customer is unlikely to recognize these inconsistencies. Where we have not put aside time in the project to analyze the data, then it pays to include an assumption about data quality that includes these points. Poor data quality can be a real show stopper in a data warehouse project.

Data availability. We need to be able to get at the data we need, and sometimes that can be difficult. We've discussed the big issue of how to get changed data when describing the temporal problems in Chapter 4 but, oftentimes, the behavioral data can also be problematic . It is quite common for the behavioral data to be made available at some point during the overnight batch processing. A point in the batch cycle is identified as being appropriate for the data to be placed in a file so that it can be passed onto the data warehouse for processing. If the batch processing fails to deliver the data for any reason, then the warehouse cannot be updated. Although this sounds like an operational rather than a development issue, our customer will not accept the system unless the availability of data can be relied upon. Ensuring that this kind of problem does not happen is very difficult. It is better to include an assumption into the plan that places the responsibility for timely delivery of the data onto the customer themselves .

Overnight processing window. This is a similar problem to the previous one. We have a responsibility to make the data warehouse available for a certain period each day, say, from 8:00 a.m to 8:00 p.m. We need a certain number of processing cycles, in the form of an overnight window, to do all our loading and housekeeping between those times. Most of the time there may be no problem, but there are occasions in all operational data centers when time becomes very short. For instance, at month end, quarter end, and, especially , year end, extra data processing has to be done as part of the normal course of doing business. When this happens, we get squeezed. Also, when things go wrong in the mission critical overnight stuff and suites of software have to be rerun, preceded by lengthy database restores , etc., the batch processing overruns into our time window and we get squeezed again. As before, it is extremely difficult to code for this eventuality, and the best thing is to make a broad assumption at the outset which states that our overnight processing window will be kept open .

Business sponsor. It is vital that there is a prominent sponsor, or project champion, for the project within the business community. This person must own the project on behalf of the customer. It is not possible to overstate the importance of this role to the ultimate success of the project. This person must be enthusiastic about the project and also must be empowered to make decisions on its behalf . The sponsor must continue to be the project champion throughout the life of the project. If the project sponsor leaves the project for any reason, this poses a serious threat to the success of the project. New sponsors coming in partway through a project rarely have the same level of commitment as those who have been involved from the beginning. It is well worth making an assumption that the customer's project sponsor will remain in place throughout the life of the project.

Source system knowledge. This can be another big issue. This is especially a problem where the source systems are getting a bit old and where they were originally developed in-house. Most of the original designers and developers will have moved on. The system will have been enhanced and tweaked over the years of its service and the documentation, such as it was, has not been rigorously kept up to date. In short, no one knows much about it anymore. There are several extract files, or places in the processing cycle where extract files may be obtained, but there is no one who can provide definitive descriptions of the data fields in the records. Incidentally, this can be true just as equally with quite new systems, especially where there is a large packaged systems component. Finding out just what is in these systems can be a nightmare. It's a good idea to make the assumption that the customer can provide people who fully understand its systems.

only for RuBoard - do not distribute or recompile