Applying the SAN to Data Warehouse Workloads | Storage Networks: The Complete Reference

Data warehouses provide strategic and tactical business information used to direct business planning and forecasting analysis. Data warehouses are built from subsets of operational databases, typically OLTP workloads that form a historical database of various business activities. Users query these large databases in the form of What if transactions that range from simple to complex questions. The answer sets of these queries can be extremely large, complex, and form the basis for additional subsets of databases that warrant further analysis.

A spin off of data warehouses is the data mart. Theoretically, data warehouses are based upon corporate information and are used to drive corporate business decisions. Data marts are subsets of data warehouses that are used to drive lower-level business decisions within the divisional or departmental levels. This is important because the size and I/O content will be driven by use of the data warehouse at the corporate level or divisional or departmental level. Given that the true nature of data warehousing is tied to the functions of the relational database, there are many database rules, guidelines, and practices that determine its design, operation, and usage. However, most of these are outside the scope of this book and although important, only those that provide us with clues to their I/O workload analysis will be mentioned.

As you may have experienced or surmised, data warehouses can be very large. Given their sources are the continuing updates from daily business transactions, their storage utilization has been the most comprehensive. The capture of operational database recordsknown as the extraction, transformation, and loading activities are driven by the complexities of the data analysis application. In other words, the more comprehensive the analysis application, the more data is required. Many data warehouse configurations are moving into the multiple terabyte range and, consequently, provide the foundation for why data warehouses and data marts are good candidates for SANs. Of all the workloads examined so far, the data warehouse provides the most complex I/O activities that we will discuss.

Data warehouses are built around relational databases; therefore, they are under the control of the user s database system of choice. Moreover, this defines the need to provide large storage capacities that scale well in terms of pure size and I/O content. This is due to the activity of the data warehouse transaction. The DW transaction in its simplest form accesses information from multiple database tables which span multiple recordsFor example, in order to analyze the number of customers with account balances over $1,000 for an entire year with credit card balances requires the database system to access a customer table, deposit accounts table, credit card accounts, and so on. Without going into the specifics of the database design, we see that even a simple query generates multiple I/Os, each with a large set of records.

DW transactions are serviced on a synchronous and asynchronous basis. Consequently, its important to understand the mix of transactions and relative service levels. This brings up an interesting phenomenon as both online and batch-oriented operations can process simultaneously within these environments. Given that these are disparate workloads in terms of their resource requirements, it becomes a challenge to configure an I/O system capable of handling this set of complexities.

This is where SANs come in. SANs provide the most comprehensive solution for data warehousing since the MPP database machines. SANs, through their increased addressing capacity, provide an extensible and scalable solution to configuring a very large database. Moving well beyond the addressing limitations of SCSI bus configurations, the need to supply beyond 16 addressable storage devices has long been a requirement in DW environments. SANs support a comprehensive workload with considerable I/O content given the increased FC frame payload capacity for each I/O. Finally, the flexibility to isolate disparate I/Os through SAN operations and configurations can sustain a disparate and complex workload.

Consequently, taking into account the mix of disparate workload characteristics, there needs to be a balance between transactional activity, update activity, and data currency. These should be identified in the workload analysis. The subsequent resource requirement details can be estimated in the major areas of I/O activities (for example, the data organization method, data paths, and user access).

The Data Organizational Model

The use of a relational database system as the data organizational model for the data warehouse is guaranteed . This not only defines a set of metrics that allows for greater detail in determining the byte transfer rates, but also the physical placement of data within the storage devices. Working closely with a database administrator greatly enhances the SAN design and operational factors. As discussed earlier, there are three key database metrics to consider. These are database size, multiple table accesses, and updates.

Database size determines the number of storage devices necessary for pure capacity. Although important, physical placement becomes a critical factor. Partitioning within the database, meanwhile, is key to the type and transaction mix. Consequently, databases that span storage arrays have needed consideration in determining port capacities and access at the switch level.

Depending on the DW transactional mix, the ability to provide adequate performance for accessing I/O concurrently across the database is required. Databases partitioned within storage arrays (which is a given), and those that span storage arrays (a good possibility), need to be analyzed in order to provide sufficient data paths that meet required service levels. Just as important is the ability to isolate the data paths in terms of concurrent access to batch operations that may be taking place simultaneously.

The significant challenge is the configuration attributes that handle the updates. DW sources are operational databases that handle a companys day-to-day operation. Copies of subsets of data from these databases are used as updates to the DW databases. The operational databases may or may not be part of the SAN configuration. In most cases, they arent; therefore, care should be taken in how the update process is executed. Usually, this is a batch process performed during an off shift. If this is the case, an analysis of the options for getting the data into the DW is key to the success of SAN operations.

User Access

The one break you can expect from DW configurations is that the user community supported is generally limited. Although somewhat obvious, since there are only a limited amount of company personnel assigned to strategic planning, the user base is likely to grow significantly as the success of DW applications demonstrate their value. Although user traffic may be limited, keep in mind that the I/O content of answer sets can be very large. This is one area where applications like DW push the boundaries of FC transfer rates.

If we consider our previous example of analyzing the number of customers with accounts of $1,000 for the year, with credit cards, and who have lived in the same place for more than three years , we may find that the answer set exceeds a gigabyte of data. Dealing with such a large transaction, the number of I/Os generated requires sufficient resources to meet the expected service level and that may be a real-time requirement. However, if only two users generate this requirement per day, this will begin to put the requirement into perspective in terms of analyzing the total byte transfer rates.

Data Paths

Processing transactions that have gigabyte requirements requires multiple data paths. In addition, the necessity to physically partition the database across storage arrays requires more data paths. If we augment the requirement further with database updates that may need to be performed in isolation, the number of data paths becomes significant.

Although the number of data paths needed is dependent on the service level, the ability to access multiple terabytes of data using complex queries in a reasonable time period will drive the number of data paths required by a SAN. The alternative is a highly complex MPP environment using sophisticated parallel processing functionality.

The Design and Configuration of Data Warehouse Workloads

Figure 18-6 shows an example of our cascading configuration supporting the data warehouse application. This configuration is comprised of three FC switches, 18 disk arrays, interswitch-link ports, and an integrated FC-SCSI bridge into a tape library. This supports our workload estimate using data warehouse metrics and considerations, and supports the five servers attached to the end-user LAN. It assumes three relational database systems supporting two warehouses, the Financial Analysis System and the Financial Planning System. Note that the Financial Analysis data is large enough that it is partitioned into North American activity and European activity. The outstanding issue in this configuration is the update process that occurs with operational databases that are not part of this SAN. Considering our initial example of the banking application from Figure 18-3, an interesting problem to contemplate is the operational activities necessary in updating the data warehouse from SAN to SAN. This will be discussed in Chapter 20, which concerns integration issues.

Figure 18-6: A data warehouse workload using a cascading SAN configuration