A historical perspective | Oracle9iR2 Data Warehousing

1.3 A historical perspective

In the 1970s the first commercial applications were built to computerize the day-to-day operations of a business. These systems were built on mainframe computers, which were very expensive. Only large businesses could afford the hardware, the programmers to program them, and the operations staff to keep them running. These systems were focused on inserting new data and reading this data sequentially using magnetic tapes.

With the invention of disk storage, the data could be accessed directly. This led to the first database management systems, which organized the data either hierarchically or in a network. These database systems were very complex. Programmers had to understand how the data was stored on the disk and navigate through the data to generate reports. Application programmers used COBOL to create custom reports. It took several days or even weeks to write the program for each new report. Reports were printed on computer paper and manually distributed to the users. There were never enough programmers, so there was always an application backlog. Once data could be accessed directly, the first on-line transaction processing (OLTP) systems were built.

In the late 1970s and early 1980s, minicomputers such as Digital's PDP-11 and VAX 11/780 brought the hardware costs down. Data was often stored in the complex CODAYSL database, which was extremely difficult to change and hard to understand and design. All that changed with the introduction of the relational database. In 1979, the Oracle database became the first commercially available relational system.

With the relational model, data is organized into tables with columns and rows. Rather than using pointers to maintain relationships among the data, a unique value, such as customer number, is stored in multiple tables to identify the row. The relational model was much easier to understand, and SQL, the language used to access the database, did not require knowledge of how the underlying data was physically stored. It became much easier to build applications, which led to widespread use of database management systems. After the initial release of relational systems, many companies began developing products used to access relational databases, including ad hoc query, reporting, and analysis tools.

With the introduction of the PC, computing moved from mainframes to client/server systems. Oracle applications were introduced in the late 1980s. Companies no longer had to build their own custom applications; they could now purchase software that provided basic functionality from vendors such as Oracle, PeopleSoft, and SAP.

As relational databases matured in the 1980s OLTP systems were built using relational systems to automate the operational aspects of the business. These included systems such as order processing, order entry, inventory, general ledger, and accounting. OLTP systems automate processes and represent the state of a system at a current point in time. In an inventory application, there are transactions to insert new items into the inventory, delete items when sold, and update the quantity on hand, which is always maintaining the balance on hand. A limited amount of history is retained. It is easy to determine how many of product "111-45-222" is on hand, for example, or on what date order number "45321" was shipped. During this time, the relational database vendors focused on improving performance for OLTP applications and competed with each other using industry standard TPC-C benchmarks.

1.3.1 The rise of the data warehouse

Once the OLTP systems were built to efficiently collect data, the challenge became how to best interpret this data. In the late 1980s and early 1990s, in an effort to take a broader view across the entire business, the first enterprise data warehouses (a term invented by Bill Inmon, the "father of data ware-housing") were built. Data was brought together from the many operational systems used to run the day-to-day business operations in order to provide a corporatewide view.

Data warehouses were built to view the business over time and spot trends. Many decisions require being able to look beyond the details of today's operations and take a broader view of the business. Typical ware-house queries involve reporting on product sales over the last two years or looking at the impact of a major snowstorm on retail sales versus Internet sales. Queries involve looking at how values have changed over time and what else also changed, and possibly discovering connections.

In order to perform this type of analysis, data in the warehouse needs to be retained for long periods of time, often five to ten years.

The data warehouse looks beyond data to find information

In a data warehouse, the primary activity is querying, or reading, the data. The only update activity occurs when new data is loaded. Decision support systems (DSS), such as Oracle Discoverer, provide interactive querying, charting, graphs, and reporting capabilities. Oracle added new types of access structures, such as bitmap indexes, to improve query performance. Materialized views were added in Oracle 8i to improve performance when aggregating and summarizing data.

Oracle Express Server became the analytical front end to the data ware-house to provide on-line analytical processing (OLAP). OLAP software is used to analyze business data in a top-down hierarchical fashion. It assumes queries will be posed iteratively, where the results of asking one question leads to asking many more questions.

It's not enough to know just the profit made this year; analysts also need to know profit over time for each product for each geographic region. This is a three-dimensional query—the dimensions are product, time, and geographical region. Or an analyst may need to compare this month's sales to the same month last year for each store versus the Internet site. He may drill down to a more detailed level in the hierarchy to get the sales for individual stores to determine which ones are most profitable and which have lost money.

The data warehouse requires a different database design

Data warehouses are designed for quick retrieval, when the access path is not known in advance. Information is often derived from other data, by rolling up data into summaries, drilling down to get more detail, or looking for patterns and trends.

In an OLTP system, entity relationship diagramming techniques (E-R) are used to design the database schema. Each entity becomes a table, attributes become columns, and relationships are represented by joining the primary-key and foreign-key columns together at run time.

A normalized design provides optimal performance for OLTP systems, which support high volumes of transactions that frequently update data. Normalization ensures the tables are correctly formed by putting related data together in one table and eliminating redundancy. By having only one copy of the data, update anomalies are avoided. By only updating the data in one place, consistency is maintained. After normalizing the data, some redundancy may have been reintroduced on columns that were not updated, to improve performance.

In order to optimize performance for a warehouse, where the primary activity is querying the data, a new data model was needed. Ralph Kimball, the primary industry spokesperson for dimensional modeling and author of The Data Warehouse Toolkit, introduced the star schema, a new way of designing the database to facilitate OLAP processing. In order to optimize performance for a warehouse, dimensional modeling techniques are used.

The dimensional approach to modeling organizes data into fact and dimension tables. It represents data in a way that is easily understood by users. Users often ask for reports of sales results on a quarterly basis broken down by store and geographical region. The sales numbers are the facts. Store, region, and quarter are the dimensions the data is analyzed by, and are used to organize the data. With dimensional modeling, denormalization and redundancy is introduced. In Chapter 2, we will see how to actually create a design for Easy Shopping Inc., the example that will be used throughout this book.

The logical design is converted to a physical representation that will best optimize performance and manageability. Tables, constraints, indexes, and partitions are defined.

Oracle has added several features to support dimensional designs. The optimizer can recognize a star schema and performs special optimizations, which avoid the more costly full Cartesian product. In addition to creating tables and columns, you can also define dimensions.

1.3.2 Data warehouses evolved as separate systems

You may already be wondering why you can't use your operational production systems for your data warehouse. You already have databases that are accessible through your corporate network, so why can't you just use those to get the information you need to run your business more efficiently? Why do you need to copy data from one system to another to build a warehouse? Since operational systems may not contain historical data, the information may not be available to analyze. Also, the schema is not designed and the data is not structured for business intelligence queries.

In addition, queries in the warehouse typically access large amounts of data, which require a great deal of memory, CPU, and I/O resources. Running decision-support queries that require a large amount of computing power to sort and summarize millions of rows will have an impact on the performance of your operational systems.

To make it even more difficult, the many different operational systems often are running on different hardware platforms, with different operating systems and database-management systems. Some applications may have been purchased and others built in house. On some, the source code may no longer even be available to make changes to, and many of the systems could be several years old.

It was once believed that distributed databases were going to allow a user to issue a query and that global query optimizers would locate the data transparently, returning the data to the user fast enough so that he or she never realized the data was located on a machine in a different geographical location. But systems of this type never materialized.

As a result, data needs to be moved from the operational systems into a separate data warehouse, where it is stored in a common format on a different machine for analysis.

The data warehouse is built from the operational systems

Building a warehouse involves extracting data from operational systems, sometimes combining the data with additional information from third parties, transforming it into a uniform format and loading it into the database.

Once data is entered in the warehouse, it almost never changes, since the data records the facts of an event or state that existed at some moment in time, such as a particular sale of a product that happened on December 23, 1998. If there were another sale of a similar product on December 24, 1998, it would generally be recorded as a separate event.

Often up to 80 percent of the work in building a data warehouse is devoted to the extraction/transformation/load (ETL) process: locating the data; writing programs to extract, filter, and cleanse the data; transforming it into a common encoding scheme; and loading it into the data warehouse.

Operational data must be extracted from the source operational systems and copied to the staging area—a temporary location where the data is cleansed, transformed, and prepared for the warehouse. Sometimes you have direct access to the source systems; however, often access is severely restricted, and you can only get files of data that have been extracted for you. The operational systems frequently must be running on a 24/7/52 basis, and performance cannot be impacted in any way.

Data from multiple systems needs to be transformed into a common format for use in the warehouse. A knowledge of the meaning of the data in the operational system is required, including the following types of examples:

Each operational system may refer to the same item by a different name or key. For example, a tool company in the United States might call product_id "1234" a "wrench," while a company in another country might call the same product "1234" a "spanner."
Each system might use a different encoding scheme. The product_id may be represented as three characters separated by dashes (xxx-xx-xxx) in one system and characters separated by spaces (xxx xx xxx) in another. The data must be transformed to a common encoding scheme in the warehouse.
An attribute of a table may have different names. One system might refer to a column in the customer table as gender, represented by values "0" or "1." Another system may call it sex, represented as "M" or "F."
Different systems may use different units of measure. The sales amount might be in dollars in the United States and the euro in Europe. The data must be transformed to a common measure in the warehouse.

In designing the transformation process, these different column names from the operational systems are mapped into the common name chosen in the warehouse and transformed into a common encoding scheme.

Once the data is transformed, it is ready to be loaded into the ware-house. Often the transformation area is on a machine separate from the warehouse; thus, the data will need to be transported or moved to the machine where the warehouse is located.

If the transformed data is stored in a flat file, it may be transported using FTP and then loaded using the SQL*Loader utility or 9i external tables. If the data has been transformed in an Oracle database, transportable tablespaces may be used to move a tablespace from one database to another.

Oracle Warehouse Builder, a tool in the Oracle Internet Developer Suite, helps automate the extraction, transformation, transport, and load aspects of the process. We'll take a further look at the extraction and transformation process in Chapter 5.

New data is generally added to the warehouse on a periodic basis. It is usually loaded in batch in the evenings or at another time when the ware-house is not being used heavily by the analysts.

In addition to the data you already own, you can purchase data from external data providers to add to your warehouse. For example, you can buy information about the weather, demographics, and socioeconomic data.

Examples of use could include the following:

Adding data that tracks regional weather events on a daily basis; this way, you can determine which products show an increase or decrease in sales when there is a snowstorm.
Adding customer demographic data; selective marketing can be performed, targeting those customers most likely to respond to a sales promotion.
Knowing what types of customers buy what types of products, you can anticipate demand and increase profitability. Demographic data can be used to help choose a location to place a new retail store.
Adding Dun and Bradstreet data containing information on companies and products.

1.3.3 The data mart

Building a warehouse can be very complex and often takes anywhere from 18 months to three years to deploy. Because a warehouse contains many subject areas and crosses multiple organizations, it can also be highly political. Many early data warehousing projects failed.

It was discovered that many of the same benefits of a warehouse could be scaled down to the department or line of business, solving a particular business problem. Data warehouses contain multiple subjects that provide a corporate view across all lines of business. Data marts are subject-specific or application-specific data warehouses and contain data for only one line of business such as sales or marketing. The major difference between a data mart and a data warehouse is the scope of the information it contains. Because the scope of a data mart is much smaller, the data is obtained from fewer sources, and the typical time to implement a data mart is shorter.

Independent data marts

Data marts may be dependent or independent, based on the source of information. The source of information for a dependent data mart is an existing data warehouse. A data mart is considered independent when no enterprise data warehouse exists, and the data is extracted directly from the operational systems.

Because independent data marts can be constructed very quickly, they became quite popular in the mid to late 1990s as each department in a company created its own data mart for its own needs. Unfortunately, after creating a few data marts, problems began to arise. Each data mart is its own "island of information." It's obviously a problem when two reports with different answers to the same question occur.

One of the reasons independent data marts can be deployed so quickly is that they postpone some of the critical decisions that later become necessary as the number of marts grows. Only the data needed by an individual department needs to be identified and understood. A complete understanding of all the corporate data is not necessary. Creating independent data marts avoids political issues related to the creation of common naming and encoding standards.

Other problems arose from the fact that the individual data marts were often built independently of one another by different autonomous teams. These teams will often select different hardware, software, and tools to use.

Each independent data mart gets its data directly from the operational system. If a company had five different data marts, each needing customer information, there would be five separate programs running to extract data from the customer table in the operational system. You probably don't have enough time in the batch window to run five extract programs today, and you certainly won't be able to run more extract programs in the future as you add more data marts.

Each does its own data cleansing and transformations, possibly each in a slightly different way. It is very easy for the data to become inconsistent. Redundant and inconsistent data leads to different answers, making it difficult to make decisions. Imagine trying to merge these different views at a later point into a common data warehouse.

Operational data stores consolidate reporting of recent information

As discussed previously, there is a major distinction between the data in the operational systems and the data in the warehouse. Operational data is about the current state of the company and is used to manage the daily operations. Data in the warehouse is informational, containing a historical record about the past.

If there is a need to provide information about the current state of the business to make tactical decisions to support day-to-day operations, an operational data store (ODS) may be built as part of the information management architecture. An ODS contains subject-oriented, integrated data that has been validated and cleansed. It is used to support operational queries and reports. One example is customer service organizations that need to access current account balances and billing information.

The ODS may be updated in near real time, so that it reflects the current state in time. The ODS may serve as the source of data for the ware-house or data marts.

Incrementally building the data warehouse with dependent data marts

To solve these problems, and still provide a timely return on investment, rather than building a warehouse representing the entire corporation, people began building the warehouse a functional area at a time using a phased approach.

Figure 1.1 shows the most common architecture used today. Data is extracted from OTLP systems and external sources, loaded into operational data stores and enterprise data warehouses, and loaded into dependent data marts.

click to expand
Figure 1.1: Enterprise data warehouse with dependent data marts.

Building a data warehouse is just like building software. You cannot do everything in one release, and you will never be able to anticipate all of the possible uses. It is much better to develop an overall architecture, and build a framework with components that allow the warehouse to be built in phases. Limit the scope, and plan for enhancements every three to six months.

Reporting, query, and analysis tools

In 1995 Larry Ellison, the founder and CEO of Oracle, first introduced his vision of the network computer: a small, inexpensive device that makes it easy to run applications that access information via the Internet. Although the network computer never gained significant market share, the vision of Internet-centric business computing accelerated the rapid price decline of PCs, meeting the demand for cheaper, simpler desktop computing.

In 1998 Oracle released its entire applications suite enabled for use on the Web, and stopped developing applications for use in a client/server environment. Oracle8i, the Internet database, was released with many new features designed to support Internet-based activities and applications.

The ability to publish reports on the Web can make information available to virtually anyone. It gives employees, partners, and customers realtime access to critical information. No longer do you have to be in the office to view a report. Just pop into the local Internet cafe or connect to the Internet from your hotel room. Placing information on the Web (either your internal web or the World Wide Web) means that your office can truly be almost anywhere.