| only for RuBoard - do not distribute or recompile | 
Data warehouses are a special type of database that are built for the specific purpose of getting information out rather than putting data in, which is the purpose of most application databases.
The emphasis is on supporting questions of a strategic nature, to assist the managers of organizations in planning for the future.
A data warehouse is:
Subject Oriented
Non Volatile
Integrated
Time Variant
Dimensional analysis is a technique used in identifying the requirements of a data warehouse and this is often depicted using a star schema. The star schema identifies the facts and the dimensions of analysis. A fact is an attribute, such as sales value, or call duration, which is analyzed across dimensions. Dimensions are things like customers and products over which the facts are analyzed . A typical query might be:
Show me the sales value of products by customer for this month and last month.
Time is always a dimension of analysis.
The data warehouse is almost always kept separate from the application databases because:
Application databases are optimized to execute insert and update type queries, whereas data warehouses are optimized to execute select type queries.
Application databases are constantly changing, whereas data warehouses are quiet (nonvolatile).
Application databases have large and complex schemas whereas data warehouses are simplified, often denormalized , structures.
Data warehouses need historical information and this is usually missing from application databases.
There are five main components to a first generation data warehouse.
Extraction of the source data from a variety of application databases. These source applications are often using very different technology.
Integration of the data. There are two types of integration. First there is format integration, where logically similar data types (e.g., dates) are converted so that they have the same physical data type. Second, semantic integration so that the meaning of the information is consistent.
The database itself. The data warehouse database can become enormous as a new layer of fact data is added each day. The star schema is implemented as a series of tables. The fact table (the center of the star ) is long and thin in that it usually has a large number of rows and a small number of columns. The fact columns must be summable. The dimension tables (the points of the star) are joined to the fact table through foreign keys. Where a dimension participates in a hierarchy, the model is sometimes referred to as a snowflake .
Aggregate navigation is a technique which enables the users to have their queries automatically directed at aggregate tables without them being aware that it is happening. This is very important for query performance.
Presentation of information. This is how the information is presented to the users of the data warehouse. Most implementations opt for a client-server approach, which gives them the capability to view their information in a variety of tabular or graphical formats.
Data warehouses are also useful data sources for applications such as data mining, which are software products that scan large databases searching for patterns and reporting the results back to the users. We review products in Chapter 10.
There are some problems that have to be overcome , such as the use of time. Care has to be taken to ensure that the facts in the data warehouse are correctly reported with respect to time. We explore the problems surrounding time in Chapter 4.
Also, many of the queries that users typically like to ask of a data warehouse cannot easily be translated into standard SQL queries, and work-arounds have to be used, such as procedural programs with embedded SQL.
| only for RuBoard - do not distribute or recompile | 
