OLAP, Data Warehousing, Business Intelligence, and Data Mining

A relational database can either be designed for real-time business operations, such as order processing, or optimized to support the decision-making process. The former is referred as an online transaction processing (OLTP) database; the later is called online analytical processing (OLAP). Data in an OLAP database is organized to support analysis rather than to process real-time transactions, as in an OLTP database. An ETL tool such as DTS is often used to extract, transform, and load data from various sources, including OLTP databases, into an OLAP database. OLTP databases are generally designed to support a much higher number of concurrent users than OLAP databases.

An OLAP database, also known as a data mart, consolidates and organizes data from varied sources, including the operational data from OLTP databases. This data is processed and pre-aggregated to provide superior performance for ad hoc queries submitted to provide actionable business insights and to aid in the business decision-making process. In other words, a data warehouse is a data store that is built using a more systematic approach to combine data from various sources, cleanse it for accuracy and consistency, and organize it in a way that favors queries that request thousands or millions of rows at a time versus one that requests limited rowsets, like the ones found in an OLTP system. A data warehouse may also help in segregating expensive reporting queries from the OLTP system. A data warehouse often contains historical and summarized data that supports business decisions at many levels.

Let's say you wanted to find out about local, regional, national, and worldwide sales per year for the past five years for a particular set of products. To obtain this information from an OLTP database, the query might have to process millions of records and could take a long time to come back with results. In contrast, because an OLAP data warehouse already stores the summarized data in a multidimensional fashion, it yields the desired results a lot faster than the relational database.

Business intelligence is a generic term that means different things to different people. Business intelligence refers to sets of tools and applications that query OLAP data and provide reports and information to enterprise decision makers. Business intelligence tools and applications allow the leveraging of the organization's internal and external information assets for making better business decisions. Business intelligence capabilities include data transformation (ETL), OLAP, data mining, and reporting.

Data mining is an activity that operates on a data warehouse to ferret out trending patterns in the data. A common example of data mining is exhibited on the Amazon.com website. If you have ever received an email from Amazon that recommends a new book or DVD based on your past purchases, you have experienced data mining at work. Data mining involves exploring large quantities of data, using patterns and rules, in order to discover meaningful information about the data. A data mining component allows you to define data mining models based on existing data to discover trends and predict outcomes.

OLAP Terminology

Before you read the rest of this chapter, it's important that you be familiar with some terms related to OLAP.

A table that contains columns and rows is a basic object in a two-dimensional database system. The equivalent to this in a multidimensional OLAP database is a cube. A cube is a conceptual container of detailed values consisting of dimensions and measures. The term measures refers to facts available in any data store that are interesting to the business user. Some common measures are sales amount, sales quantity, hourly rate, current inventory amount, total expenses, and cost. In other words, a measure is a summarizable numerical value used to monitor business activity. A dimension is an aspect of the business through which measures are analyzed. In other words, a dimension is a descriptive category, such as date, product, location, or customer. Dimensions may have multiple hierarchies, each hierarchy may have multiple levels, and each level may have multiple members. For instance, the location dimension may have country, states, and city as the hierarchy.

Measures are aggregated based on the members of a dimension. A member is a particular instance of a dimension. Examples of members of a date dimension might be 2005, Quarter 1, January, and 1/1/2005. A level provides a grouping of members at a level of a hierarchy. In the preceding example, the levels that correspond to the example members might be Year, Quarter, Month, and Day. Dimension attributes are pieces of information about a member that are interesting for reporting or analysis, such as customer name, day of week, SKU, store hours, and so on.

The term fact table refers to a relational database table that is a central table in a data warehouse schema. A fact table usually contains a large number of rows, sometimes in the hundreds of millions of records. A fact table's columns include numeric values for one or more measures and primary keys from dimension tables as the foreign key columns. An example of a fact table could be a table with the columns product_id, time_id, customer_id, promotion_id, store_id, store_sales, store_cost, and unit_sales. Note that a fact table contains primary keys from all the dimension tables as foreign keys, and the rest of the columns are the measures.

The two common schema design strategies for data warehouses include the star schema and the snowflake schema. The entity relationship diagram of the star schema resembles a star, with one fact table at the center and several dimension tables as the points of the star. The snowflake schema is an extension of a star schema such that one or more dimensions are defined by multiple tables. In a snowflake schema, only primary dimension tables are joined to the fact table. Additional dimension tables are joined to primary dimension tables. Snowflake schemas are more normalized than star schemas.

OLAP Storage Modes

There are three modes in which you can store dimensional data. Each choice has its own characteristic data storage requirement and can greatly affect both processing and querying performance:

  • Multidimensional OLAP (MOLAP) MOLAP stores both the underlying data used to create the cube and the aggregations created for the cube in a multidimensional structure maintained on the OLAP engine, such as an Analysis Services server. Because the OLAP engine can answer all queries directly from this multidimensional structure without further accessing the relational or any other store, the MOLAP storage mode presents the best overall querying performance of all three storage modes. However, because the data used to create the cube is copied from the relational database to this multidimensional structure, even though the OLAP engine might compress this data, the storage requirement needed to support MOLAP is still the largest of the three storage modes. Another important consideration with the MOLAP strategy is the time it takes to process the cubes.

  • Relational OLAP (ROLAP) With ROLAP storage mode, both the underlying data for the cube and the aggregates created for the cube are stored in a common relational database. Because the OLAP engine (the Analysis Server) must query the relational database to retrieve all cube data, this storage mode provides the slowest querying performance of the three storage modes.

  • Hybrid OLAP (HOLAP) HOLAP combines elements from both MOLAP and ROLAP. With HOLAP, the aggregations are stored on the OLAP engine (the Analysis Services server) in multidimensional storage. However, the underlying data for the cube remains in relational database and is not copied into multidimensional storage. HOLAP provides the best processing performance and uses the least additional storage. Query performance is typically better in HOLAP than in ROLAP but not good as in MOLAP, depending on the aggregation design.

Analysis Services 2005 supports all three of these storage modes.

Microsoft SQL Server 2005(c) Changing the Paradigm
Microsoft SQL Server 2005: Changing the Paradigm (SQL Server 2005 Public Beta Edition)
ISBN: 0672327783
EAN: 2147483647
Year: 2005
Pages: 150

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net