REQUIREMENTS OF THE CONCEPTUAL MODEL

only for RuBoard - do not distribute or recompile

Before proceeding, the general requirements of a conceptual data model for data warehousing are restated succinctly. The model must provide for the following:

Be simple to understand and use by nontechnical people
Support the GCM
Support time

The most widely used data models are entity relationship (ER) or sometimes, extended entity relationship (EER) methods . These models are widely understood by IT professionals but are not easy for non-IT people to understand. A properly produced ER model contains quite a lot of syntax when you think about it. We have rules regarding cardinality, participation conditions, inclusive and exclusive relationships, n -ary relationships, and entity supertypes and subtypes . These syntax rules are necessary for models for operational systems but, in many respects, the diagrammatic requirements for dimensional data warehouses are simpler than traditional ER models. Take for the moment the rules for dimensional models: These are what we will use to model customer behavior in the GCM.

The structure of a dimensional model is predictable. There is a single fact table at the center of the model. The fact table has one or more dimension tables related to it. Each dimension will have zero, one, or more hierarchical tables related to it.
The relationships are not usually complex. Relationships are always one to many in a configuration where the dimension is at the one end of the relationship and the fact table is the many end. Where dimensional hierarchies exist, the outer entity (farthest from the fact table) is the one end and the inner entity ( nearest to the fact table) is the many end.
One-to-one and many-to-many relationships are rare, although this changes when time is introduced. There is no real need to model the cardinality (degree) of the relationships.
The participation conditions do not need to be specified. The dimension at the one end of the relationship always has optional participation. The participation condition for the dimension, or fact, at the many end is always mandatory.
Entity super/subtypes do not feature in dimensional models.
There is no requirement to name or describe the relationships as their meaning is implicit. It is important to show how the dimensional hierarchies are structured, but that is the only information that is needed to describe relationships.
There is no requirement for the fact table rows to have a unique identifier.
There is no requirement to model inclusive nor exclusive relationships.

The additional rules and notations that are required to support the features in the list above are, therefore, not appropriate for dimensional data warehouses.

There is a further consideration. We know that data warehouses are not designed to support operational applications such as order processing or stock control. They are designed to assist business people in decision making. It is important, therefore, that the data warehouse contains the right information. Often, the business people are unable to express clearly their requirements in information terms. It is frequently the case that they feel they have a problem but are unsure where the problem lies. This issue was brought out in the introduction to the Wine Club case study in Chapter 2.

Most business managers have a set of business objectives. These can be formally defined key performance indicators (KPIs), against which their performance is measured, or they can be more informal, self-imposed, objectives. A data warehouse can be designed to help them to achieve their business objectives if they are able to express them clearly and to describe the kind of information they need to help make better decisions in pursuit of their objectives. One method that leads business managers through a process of defining objectives and subsequent information requirements is described later on in this chapter.

What is needed is an abstraction that allows for the business requirements to be focused upon in a participative fashion. The business people must be able to build, validate, modify, or even replace the model themselves . However, in addition, the model must be powerful enough to enable the technical requirements of each data object to be specified so that the data warehouse designers can go on to develop the logical model.

Later on we will introduce the dot modeling notation as a model for capturing information requirements in a way that business people can understand. There exists a fundamental requirement that the people who use the data warehouse must understand how it is structured.

Some business applications need to be supported by complex data models. Data modeling specialists are comfortable with this complexity. Their role in the organization, to some extent, depends on their being able to understand and navigate complex models. The systems are usually highly parametric in nature, and users tend to be shielded from the underlying complexity by the human computer interface. The users of data warehouses are business people such as marketing consultants . Their usage of the warehouse is often general and, therefore, unpredictable, and it is neither necessary nor desirable to shield them from the structure of it. The conceptual model should be easy to understand by nontechnical people to the extent that, with very little training, such people could produce their own models. If it is accepted that the dimensional model, due to its simplicity, is an appropriate method to describe the information held in a data warehouse, then it would be sensible to ensure that the simplicity is maintained and that the model does not add complexity.

Another requirement of the conceptual model is that it should retain its dimensional shape. Again, having achieved a model that is readily understood, we should try to ensure that the essential radial shape is retained even in relatively complex examples.

Also, there is a need within the model to record business semantic information about each of the attributes. This means that some additional, supporting information about attributes will have to be recorded in any case. For the purpose of keeping the conceptual model simple, it seems sensible to incorporate the temporal aspects into the same supporting documents.

The Treatment of Behavior

The temporal requirements of behavioral data are very straightforward. Once recorded, they do not change. They do not, therefore, have any form of lifespan. The behavioral facts are usually derived from an entity that has attained a particular state. The attainment of the state is caused by an event, and the event will occur at a particular point in time. So one of the attributes of the fact will be a time attribute. The time attribute records the time of the occurrence of the event. The time attribute will be used in two ways:

As part of the selection constraint in a query
As an aid to joining the dimensions to higher-level groupings

For each event or change there is, as has been said, an associated time. Different applications have different requirements with respect to the grain of time. Some applications require only a fairly coarse grain such as day. These might include:

Companies selling car insurance
Banks analyzing balances
Supermarkets analyzing product-line sales

Some other applications might wish to go to a finer grain, perhaps hours. These might include:

Supermarkets analyzing customer behavior
Transport organizations monitoring traffic volumes
Meteorological centers studying weather patterns

Other applications would require a still more fine grain of time to say, seconds. An example of this is the telecommunication industry monitoring telephone calls. Another requirement of the data model, therefore, is that it needs to show clearly the granularity of time pertaining to the model.

Some applications require more than a single grain of time. Consider the examples above. The supermarket example shows a different requirement with respect to time for product analysis as distinct from customer behavior analysis. In any case, almost all organizations conducting dimensional analysis will require the ability to summarize information, from whatever the granularity of the base event, to a higher (coarser) level in order to extract trend information.

The Treatment of Circumstances ”Retrospection

In a first-generation dimensional data warehouse, the way in which an attribute of a dimension is treated, with respect to historical values, depends entirely upon the requirements to partition the facts in historical terms. The key role of dimensional attributes is to serve as the source of constraints in a query. The requirements for historical partitioning of dimensional attributes for, say, dimension browsing have been regarded as secondary considerations. Principally, the dimensional attributes exist to constrain queries about the facts. Even though some 80 percent of queries executed are dimension browsing, the main business purpose for this is as a process of refinement of the fact table constraints. It is only comparatively recently, since the advent of CRM, that the dimensions themselves have been shown to yield valuable information about the business such as the growth in customers, etc. In some industries ( especially telecommunications and retail banking), this is seen as the largest business imperative at the present time. Our GCM, being a customer-centric model, enables amazingly rich and complex queries to be asked about customers' circumstances, far beyond a traditional dimensional model. The GCM, therefore, imposes the need for a much greater emphasis on the nondimensional aspects of the customer, that is, the circumstances and derived segments.

In recognition of the need to place more emphasis on the treatment of history within the GCM, we have to examine the model in detail in order to assess how each of the various elements should be classified .

Each component of the model that is subject to change will be evaluated to assess the way in which past (historical) values should be treated. When we refer to a component we mean:

Entity: a set of circumstances or a dimension (e.g., customer details or product dimension)

Relationship: for example, a hierarchy

Attribute: for example, the customer's address

Each component will then be given a classification. This is called the retrospection ^[1] of the component. Retrospection has three possible values:

^[1] Retrospection means literally looking back into the past.

True
False
Permanent

True retrospection means that the object will reflect the past faithfully. It enables queries to return temporal subsets of the data reflecting the partitioning of historical values. Each dimension, relationship, and attribute value will, in effect, have a lifespan that describes the existence of the object. An object may have a discontinuous lifespan, that is, many periods of activity, punctuated by periods of inactivity. True retrospection is the most accurate portrayal of the life of a data warehouse object.

False retrospection means that the view of history will be altered when the object's value changes. In simple terms, when changes occur, the old values will be overwritten and are, therefore, lost. It is as though the old values had never existed.

Permanent retrospection means that the value of the object will not change over time.

Let us now explore how the various values for retrospection apply to dimensions, relationships, and attributes.

Retrospection in Entities

So far as entities are concerned , the value for retrospection relates to the existence of the dimension. For instance, the existence of a customer starts when the customer first orders a product from the Wine Club, and the existence ends when the customer becomes inactive.

Retrospection = true for entities means that the lifespan of the entity consists of one or more time intervals. A single customer may not have a single, continuous lifespan. It is also true of other entities such as the wine dimension. A wine may be available for intervals of time spanning many years , or the entire lifespan may be punctuated by periods when the wine is not available. An example of this would be the Beaujolais Nouveau, which, for some reason, is very popular when first available in November each year but must be consumed quickly, as it soon deteriorates. As this wine is not available for 10 months out of each year, it is reasonable to say that the lifespan of this wine is discontinuous.

Retrospection = false for entities means that the current state only, of the existence of the entities, is recorded. An example from the Wine Club would be the supplier dimension. There may be a need to be able to distinguish between current suppliers and previous suppliers. There is no requirement to record the intervals of time when a supplier was actually supplying wine to the Wine Club as distinct from the intervals of time when they were not.

Retrospection = permanent for entities means that the entity exists forever. The concept of existence does not, therefore, apply. An example from the Wine Club would be the region dimension. Regions, which represent the wine-growing areas of the world, are unlikely to disappear once created.

Retrospection in Relationships

In a dimensional model, the degree of snapshot relationships is always one to many. When the relationship becomes temporal, due to the need for true retrospection on the relationship, the degree of the relationship may change to that of many to many. This has been described in detail in Chapter 4.

It is important that this information is recorded in the model without introducing significant complexity into the model. The essential simplicity of the model must not be dominated by time, while at the same time it needs to be straightforward for designers to determine, quickly and precisely, the degree to which support for temporal relationships is required.

The situation with relationships is similar to that of entities. It is the requirement with respect to the existence, and lifespan, of a relationship that determines the value for retrospection in relationships.

Retrospection = true for relationships means that the lifespan of each relationship must be recorded and kept so that the results from queries will faithfully reflect history. An example of this in the Wine Club is the relationship between customer and sales area. If a customer moves from one sales area to another, it is important that the previous relationships of that customer with sales areas are retained.

Retrospection = false for relationships means that only the current relationship needs to be recorded. There is no need for the system to record previous relationships. A true view of history is not required. An example of this, within the Wine Club, is the relationship between a hobby and a customer. If Lucie Jones informs the club, through the periodic data update process, that her hobby has changed from, say, horse riding to choral singing , then the new hobby replaces the old hobby and all record of Lucie's old hobby is lost.

Retrospection = permanent for relationships means that the relationship is never expected to change. No change procedures have to be considered . An example of this kind of relationship, in the Wine Club, is the relationship between a wine and a growing region. A wine is produced in a region. A particular wine will always be produced in the same region, so it is reasonable to take the view that this relationship will not change.

Retrospection in Attributes

Each attribute in the model must be assessed to establish whether or not it needs temporal support. Therefore, one of the properties of an attribute is its value for retrospection. As with the other data objects, the recording of the temporal requirements for attributes should not introduce significant complexity into the model.

The situation with respect to attributes and retrospection appears, at first, to be somewhat different to the requirements for other objects. In reality, the situation is very similar to that of relationships. If we consider an attribute to be engaged in a relationship with a set of values from a domain, it becomes easy to use precisely the same approach with attributes as with relationships.

Retrospection = true for attributes means that we need to record faithfully the values associated with the attribute over time. An example of this is the cost of a bottle of wine. As this cost changes over time, we need to record the new cost price without losing any of the previous cost prices.

Retrospection = false for attributes means that only the latest value for the attribute should be kept. When the value of the attribute changes, the new value replaces the old value such that the old value is lost permanently. An example of this in the Wine Club is the alcohol by volume (ABV) value for the wine. If the ABV changes, then the previous ABV is replaced by the new ABV.

Retrospection = permanent for attributes means that the value is not expected to change at all. Changes to the values of these type of attributes do not have to be considered. An example of this type of attribute in the Wine Club is the hobby name. Once a hobby has been given a name, it is never expected to change.

There is a rule that can be applied to identifying attributes. With great originality it is called the identifying attribute rule and simply states that all identifying attributes have a value for retrospection of permanent. This is because the identifying attributes should not change.

There is an implicit inclusion of an existence attribute for entities, relationships, and attributes where the value for retrospection is true. The status of true retrospection will direct the logical database designers to provide some special treatment, with respect to time, to the affected object. The type of treatment will vary on a per-case basis and will, to some extent, depend on the type of database management system that is to be deployed.

The inclusion of an existence attribute is also implicit for entities, but not relationships and attributes, where the value for retrospection is false. The provision of time support, where retrospection is false, is usually simpler to implement than where retrospection is true. For relationships and attributes, it is simply a case of replacing the previous value with a new value ”in other words, a simple update.

Retrospection and the Case Study

Table 5.1 now lists a handful of the data elements, entities, attributes, and relationships for the Wine Club. For each, the value for retrospection is given that satisfies the requirements of the Wine Club with regard to the representation of time.

In accordance with the previous point about their implicit nature, there is no explicit mention of existence attributes in Table 5.1. A complete list can be found in Appendix A.

So far, we have analyzed the requirements with respect to time and have identified three cases that can occur. The Wine Club data model has been examined and each object classified accordingly .

The requirement that follows on from this is to develop a method that enables the classification to be incorporated into the model so that a solution can be designed. It is important that the requirements, at this conceptual level, do not prescribe a solution. The designers will have additional points to consider such as the volumes of data, frequency of access, and overall performance. As always, some compromises are likely to have to be made.

Table 5.1. Wine Club Entities, Relationships, and Attributes

Object Name	Type	Retrospection	Reason
Customer	Entity	True	Customers may have more than one interval of activity. It is important to the Wine Club that it monitors the behavior of customers over time. There is a requirement, therefore, to record the full details of the existence of customers.
Sales_Area	Entity	False	Latest existence only, is required. Sales areas may be combined, or split. Only the latest structure is of interest. Note that some organizations might wish to compare old regional structures to the new one.
Hobby	Entity	Permanent	The hobby details, once entered, will exist forever.
Sales Area?Customer	Relationship	True	There is a requirement to monitor the performance of sales areas. As customers move from one area to another, therefore, we need to retain the historical record of where they lived previously, so that sales made to those customers can be attributed to the area in which they lived at the time.
Hobby?Customer	Relationship	False	A customer's hobby is of interest to the Wine Club. Only the current hobby is required to be kept.
Customer?Sales	Relationship	Permanent	The relationship of a particular sale to the customer involved in the sale will never change.
Customer.Customer_Code	Attribute	Permanent	Identifying attribute rule.
Customer.Customer_Name	Attribute	False	The latest value only is sufficient.
Customer.Customer_Address	Attribute	True	Requirement to analyze by detailed area down to town/city level.
Customer.Date_Joined	Attribute	False	The latest value only is sufficient.

only for RuBoard - do not distribute or recompile