System Sizing Considerations | The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

Were sure that all readers of this book are hoping for a simple chart that will specify what kind of server machine they should buy. Sorry, its not going to happen: The problem is too difficult to reduce to a simple matrix or tool. The best we can do is describe the different options and parameters, and the kinds of operations that will push you to require bigger and more expensive hardware.

There are four main factors that will push your project to more expensive hardware: data volumes , usage complexity, number of simultaneous users, and system availability requirements. These factors are illustrated in Figure 4.2, and discussed in the following sections.

Overall System Size	Small	Medium	Large
Range of Configurations	One Commodity 32-bit or 64-bit 4-way	One high-end (8 or more processors) 64-bit system with max memory, or distribute components to multiple commodity servers	Distributed system on high-end 64-bit servers; some clusters
Data Volume	< 500 M fact rows, < 50 GB atomic fact data	< 5 B fact rows, < 500 GB atomic fact data	Billions and billions
Usage Complexity	> 60% Simple use 30% Medium complexity < 10% Demanding use	50% Simple use 35% Medium 15% Demanding	35% Simple 40% Medium 25% Demanding
Number of Users	< a dozen simultaneous users	< 200 simultaneous users	Thousands
Availability	Several hours of downtime each night acceptable	< 1 hour downtime or query slowdown each night	24x7 with excellent query performance

Figure 4.2: Factors influencing DW/BI system hardware requirements

Calculating Data Volumes

The first and most obvious characteristic of your DW/BI system that will affect your hardware purchases is data volumes. By the time youve finished your logical model and initial data profiling, you should have enough information to estimate how big your DW/BI system is going to be. Later in the project your DBAs will calculate database sizes in great detail, but for now you can just think about fact table row counts. Unless you have a monster dimension of 50100 million rows, dimension sizes are insignificant.

For starters, just figure out what order of magnitude number of rows youll have for your initial historical load of fact data. Multiply that number by 100 bytes, which is our generous rule-of-thumb for fact row sizes. You can get more precise if you like, but we like arithmetic that we can do on our fingers. One hundred million fact rows require about 10GB as stored in the relational database, measured as atomic data only, no indexes. It might be 7GB; it might be 12GB; it wont be 100MB or 100GB. When you add relational indices and Analysis Services indices and MOLAP storage, multiply that base number by 2 to 4. In the early stages, before we have a specific design, we use a factor of 3. The requirements for the ETL staging area could conceivably add another factor, although most often the staging area is small.

The incremental daily or monthly load volume is important, too. From the incremental volumes, you can compute expected data volumes for each fact table one, three, and five years out, which will help you decide what class of storage system to buy today.

Although weve talked in this section about counting all the fact rows in your system, not all fact tables are created equal. A single billion-row fact table is going to be more demanding of system resources than ten 100 millionrow fact tables would be. We dont have a scientific way to quantify this difference, but you should list the fact table sizes by fact table, in addition to an overall count.

In Figure 4.2, a small system is characterized by less than 500 million rows of fact data, or 50GB using our simple multiplier . A large system is more than 5 billion fact rows, into the terabyte range and above.

Determining Usage Complexity

The next key factor in deciding how big your hardware needs to be is to consider how your business users are going to use the data. There are two main questions: how many users will be working simultaneously , and what will they be doing? The usage patterns of a business intelligence system are quite different from the familiar workload of a transaction system. Even the simplest DW/BI queryfor example, a query browsing a dimensionis more complex than most transactional queries. And the most complex DW/BI query is several orders of magnitude more complex, and touches more data, than any operational transaction.

COMPARING DATABASE SIZES

We wish all vendors would use fact table rowcounts and row sizes as their metrics for reporting the size of their customers DW/BI systems. Theres so much variety in size reporting that the information is effectively useless. Weve seen reports ranging from the smallest possible number as were discussing here (atomic data only, no indexes), all the way to the total storage attached to the system. The latter is fine for marketing purposes, but not useful for education or planning. Our measure reduces the likelihood that your system will put you in the Terabyte Club, but trust us when we tell you that you dont really want to be there.

You cant expect to already have a great understanding of system usage during the design phase of the DW/BI system. But you do need to think about different kinds of usage, and the approximate volume of use in each category. The data model tests we described in Chapter 2 are a good place to start. Your business requirements document should contain additional information.

Simple or Controlled Access

The more simple or predictable the users queries, the more simultaneous users can be supported on the same size system. Examples of simple use include:

Predefined queries and reports based on a relatively small set of relational or Analysis Services data: Because these queries are relatively simple and predefined, the system can easily be tuned to support them. Its particularly hard to understand whats simple and whats challenging to Analysis Services OLAP databases. Please refer to the related discussion in the following two subsections.
Reporting Services scheduled and cached reports: As described in Chapter 9, scheduled reports may be extremely complex. But because they are run at night and cached, the load placed by users of these reports during business hours is relatively light.
Data mining forecasting queries: Another kind of predefined query. As we discuss later, training a data mining model is most definitely not simple access. But performing a forecasting query on new information about a customer is straightforward.

Moderate Complexity

Examples of moderately complex use include:

Predefined reports based on a broad set of relational data, such as all sales for the past year. On the positive side, the report is predefined and so it can be tuned. On the negative side, the underlying query touches a lot of data and so is expensive. Consider scheduling and caching these reports in Reporting Services to move them into the Simple category, or using Analysis Services as the reports data source.
Ad hoc query and analysis using Analysis Services, where the analysis does not need to look at a large portion of the atomic data. If a lot of business users are performing ad hoc queries, the odds are good that theyre hitting different parts of the OLAP database. In this case, the servers data cache will be of limited use (unless you have a lot of memory). Contrast this moderately complex ad hoc use of Analysis Services with the highly complex situation described following.

Highly Demanding Use

Examples of highly complex use include:

Ad hoc query and analysis using the relational data warehouse database: The business users arent experts, so they may make mistakes. Its not really feasible for them to use query hints. Queries typically join many tables and often access a large volume of data.
Ad hoc query and analysis on Analysis Services: The analysis requires wide queries that access a large portion of the atomic data. There is a class of analytic problems that by definition must touch very detailed data. For example, a query that counts the unique values in a large set is unavoidably expensive because it must touch detailed rows. Similarly, a query that returns a median or top N percent must also touch many more rows than are handed back in the result set.
Training of a data mining model: As we discuss in Chapter 10, creating the case sets for a data mining model often involves several full table scans of the fact tables.

Most DW/BI systems will be used in all these ways. A normal DW/BI system will have 60 percent simple, 30 percent moderately complex, and 10 percent demanding usage, as illustrated in Figure 4.2. A challenging usage profile has 35 percent simple, 40 percent moderate, and 25 percent demanding usage.

Estimating Simultaneous Users

The number of potential users of the DW/BI system provides only the roughest possible estimate of how many people are using the system at the same time. One analyst doing very complex work can use as many resources as dozens of users who are accessing simple reports. Its as important to know the system usage characteristics as it is to know how many people access the system.

If you currently have no DW/BI system in place, it will be difficult for you to forecast usage frequency and timing. Even our old standby recommendation, that you interview the business users, will be of little value. During the design and development phase, business users are not able to guess how much theyll use the system, and during what times of day.

A few broad patterns are easy to predict. There is usually a surge of demand in the morning, when people arrive at work and check the reports run on yesterday s data. Other cycles may be based on obvious work patterns, like month-end, quarterly, or year-end financial reporting calendars. Even with these patterns, however, remember that a DW/BI workload is quite different from the fairly constant stream associated with a transaction system. In most cases, a business user executes a query or report, and then examines and thinks about the information before executing a followup query. You should be careful to incorporate this think time into your understanding of simultaneous usage. If you buy or develop a performance testing suite, make sure it uses randomly generated think times of between 30 seconds and several minutes.

If you have an operational DW/BI system, the current usage statistics will provide a more certain estimate of future use. But if the current DW/BI system performs poorly, youll find increased use with a new higher performance system than with the old. A lot of DW/BI queries and reports are somewhat optional for business users. If the current system is painful to use, they wont use it.

A small system may have only a dozen simultaneous userspeople who are issuing queries and reports at more or less the same time. A challenging system, by contrast, will have hundreds or even thousands.

Assessing System Availability Requirements

The final factor affecting system size and configuration is the business requirements for system availability. These requirements can range from an 8-hour load window (midnight to 8 a.m.) during which the DW/BI system can be offline, to the opposite extreme of 24 hours a day, 7 days a week. If your business users require high availability, you may need to purchase a substantially larger and more complex system than youd otherwise need.

The stronger the business need for a high availability system, the more likely you will be to cluster some of the components, notably the relational database and Analysis Services database, and to set up a web farm for Reporting Services.