Now we look in detail at our proposed solution. We start with how to design, create, and manage the partitioned tables in the data warehouse. Next, we look at the ETL processes we need to create to populate the tables in the data warehouse. Finally, we look at how to design, create, and manage the partitions that make up the cubes.
You use partitioned tables for optimizing queries and management of a large relational table. Our partitions will be based on date because our usage of the data is primarily for a specific data range.
Designing Your Partitions
To design partitions, you need to decide on a basis for the partitioning. What you are looking for is one or more attributes in the table that will group your data in a way that maximizes the specific benefits you are looking for. In our example, we are looking for the following:
A partitioning scheme is frequently based on time because it can support all of the benefits listed above. Each partition would hold data from a specific time period. If the data in a partition does not change, you can mark the file group read-only, back it up once and not have it take time and space in future backup operations. If you need to purge or archive old data, you can easily remove the partition without having to perform a costly delete operation.
Your partition sizes normally are determined by the amount of data estimated to be in a partition and a convenient time period. Larger partitions make management a bit easier, but may contain too much data to provide us with the efficiency we are seeking. Daily or weekly partitions are common. In this example, we will create uniform weekly partitions. It can take up to four weeks to receive all the data, so we will keep the prior four partitions writeable, and mark old partitions read only. This will mean we only need to back up four weeks of data.
Creating Partitioned Tables
No GUI exists for creating the partition function or the partition scheme, but even if there were, we still recommend that you use scripts to create partitioned tables so that you have a repeatable process. Here is an overview of the steps you take:
Creating the Partition Function
The partition function defines the boundary values that divide the table between partitions. You specify a data type that will be passed to the function, and then a number of values (up to 999), which are the boundary points. The name of the partition function is used later when you create the partition scheme. To illustrate the idea, the example below creates six partitions on a weekly basis, with the last partition holding everything beyond January 29.
CREATE PARTITION FUNCTION PartitionByWeek (smalldatetime) AS RANGE RIGHT FOR VALUES ( '2005-DEC-25', '2006-JAN-01', '2006-JAN-08', '2006-JAN-15', '2006-JAN-22', '2006-JAN-29')
The partition function that you create determines which partition a row belongs in. Partitions don't have to be uniform in size. You can have partitions for the current four weeks, previous month, and another for everything else in a history partition if that suits your query profile. You can also specify more filegroups than you have partitions. The extra ones will be used as you add partitions.
Creating the Filegroups and Adding Files
A filegroup is simply a set of one or more files where data or indexes can be placed. A filegroup can support more than one partition, and a partition can be distributed over more than one filegroup. In our example, we use one filegroup for each partition. The partitions were defined by the partition function we just created previously. We will create one partition for each of the ranges defined in the function. We will add one file to each filegroup:
Create "FG2005WK52" Group ALTER DATABASE CDR ADD FILEGROUP FG2005WK52; GO Add file "CD2005WK52" to FG2005WK52 group ALTER DATABASE CDR ADD FILE ( NAME = Archive, FILENAME = 'D:\CDR\Data\CD2005WK52.ndf', SIZE = 3000MB, FILEGROWTH = 30MB ) TO FILEGROUP FG2005WK52 GO ALTER DATABASE CDR ADD FILEGROUP FG2006WK01; GO ALTER DATABASE CDR ADD FILE ( NAME = CD2006WK01, FILENAME = 'D:\CDR\Data\CD2006WK01.ndf', SIZE = 3000MB, FILEGROWTH = 30MB ) TO FILEGROUP FG2006WK01 GO . .(repeat for the remaining partitions)
We expect that 3,000MBs is sufficient space for one week's data. If that's not enough, the file will grow by 30MB whenever it runs out. In SQL Server 2005, the new space is not formatted when it is allocated, so this operation is quick. Having multiple files in each filegroup would allow for some parallelism in query execution. If you have enough drives, you could spread the load across multiple drives. With high data volumes, such as we anticipate here, you will likely be using SAN storage with many drives and a large cache.
Creating the Partition Scheme
Now, you need to map the partitions to filegroups. You can map them all to one filegroup or one partition to one filegroup. We will assign each partition to its own filegroup because that will give us the performance and manageability we are seeking. The partitions and filegroups are associated one by one, in the order specified in the filegroup list and in the partition scheme. You can specify more filegroups than you have partitions to prepare for future partitions:
CREATE PARTITION SCHEME WeeklyPartitionScheme AS PARTITION PartitionByWeek TO (FG2005WK52, FG2006WK01, FG2006WK02, FG2006WK03, FG2006WK04, FG2006WK05, FG2006WK06)
Creating the Partitioned Table
With the partition scheme now created, you can finally create a partitioned table. This is straightforward and no different from creating any other table, other than to direct SQL Server to create the table on a partition schema:
CREATE TABLE [Call Detail] (TicketNumber BigInt NOT NULL, CallDate SmallDateTime NOT NULL, ... ) ON WeeklyPartitionScheme (CallDate)
In Figure 11-5, you can see how all the pieces fit together to store the data in a table. When a row arrives for insert, the database engine takes care of figuring out where it should go without any further work on your part. The partition column is used by the partition function to determine a partition number. The partition number is mapped by position to one of the partitions in the partition schema. The row is then placed in a file in one of the filegroups assigned to the selected partition. To reduce clutter in the diagram, we haven't shown all the files or filegroups.
Figure 11-5. Assigning a row to a partition in a table
You can also partition indexes. By default, indexes are partitioned to be aligned with the data partitions you created. This is the most optimal configuration. If your data and index partitions are not aligned, you cannot take advantage of the efficiencies offered by filegroup backups.
You need to be aware of some restrictions when designing the data model for your application. We have outlined the important ones here.
Consistent Table Schemas
Tables that are currently separate that you are planning to merge into a partitioned table should be essentially identical, with few exceptions. For example, if you have two tables, current and historical, and you plan to merge current data into a partition in the historical data table, you can't add additional columns to one table to support some functionality for current data and not carry the columns over into the historical data table.
Foreign Key Constraints
There can't be any foreign key constraints between tables you are planning to combine into the same partition. This restricts you from having self-referencing relationships in the model.
There can't be any foreign key constraints from other tables to a table that could be merged into another partition. An example of this would be a dimension table that is referenced in a constraint by a fact table. You are unlikely to have this situation in the portion of a data warehouse that is supporting a dimensional model. The fact table is usually the table you would be partitioning, and the foreign key relationships are usually from the fact table into a dimension table. Dimension tables generally are small enough that you do not need to partition them.
Indexed Views Not Supported
Indexed views are not allowed in either the source or the target, because schema binding is not allowed. If indexed views were implemented to improve the performance of an application, a cube built based on the data warehouse should be able to provide the same if not better performance gain.
When you add a table to a partition, the indexes must be the same on both source and target.
Loading Large Volumes of Data
The pipeline architecture of an SSIS data flow is extremely efficient in loading and transforming data. After we have covered how the package is put together, we look at ways we can optimize it for high-volume data flows.
When talking about large volumes of data, bulk load or bcp utilities always come to mind. You can also use a Bulk Load task in your control flow. However, we want to transform the data, so these approaches won't help us.
Let's look at an SSIS package that can be used to import data from a text file and insert it into a data warehouse using the Fast Load option. This isn't strictly a VLDB solution, but it's an opportunity to show you another way of loading data. Flat files are a common way of recording or delivering larges volumes of data.
The first thing you need to do is to create a connection manager that points to your CDR input files. In the Connection Manager panel at the bottom of the window, right-click and then choose New Flat File Connection. Give the connection manager a name such as CDR Input. Browse to a file containing CDR data and select the file. Set the format to match your file. In our example, the columns have fixed width, and we just need to define the width of each column. If your data has a separator between columns, such as a tab or comma, choose delimited for the format. On the Columns tab, drag the red line to the right until you reach the end of the row. To define a column break, click the ruler.
Try to get the column breaks set correctly before assigning names to the columns. After that point, you need to use the Advanced tab to create new columns and set their widths, as shown in Figure 11-6.
Figure 11-6. Configuring the extraction from the source files
When you are configuring the Advanced column properties, take care to select the right data type. The default is DT_WSTR, which is a Unicode or nvarchar data type. Clicking Suggest Types helps if you have many columns that are not character types. The name of the column is also set on the advanced properties. If you don't set the name, you will have a list of columns named Column 0 to Column n, which won't make it easy to maintain your package, so we strongly encourage you to supply names here.
Setting Up the Data Flow
Now you need to create the Data Flow task to extract the data from the CDR flat file, apply some transforms to create the keys in the fact table to link to the dimensions, and then insert the rows into the fact table in SQL Server.
We want to transform the originating and terminating country and area code into surrogate keys that reference our Geography dimension, and also transform the Call Type to a surrogate key that references our Time dimension, Date dimension, and Call Type dimension. We use a Lookup data flow transformation to do this.
We start the data flow by defining the flat file source to reference the "CDR Flat" connection we defined previously. We connect its data flow to the first lookup task. Each Lookup task does one transform. The updated rows are passed on to the next Lookup task, until all transforms have been completed. The last Lookup transform is connected to a SQL Server destination connection, which writes the rows to the data warehouse. Figure 11-7 shows the data flow.
Figure 11-7. Call fact table processing data flow
On a Lookup task, SSIS tries to read as many rows as possible and perform in-memory operations. In the data flow shown above, each batch is processed by the first Lookup task and then passed on to the next Lookup task. The first Lookup task loads another batch, and the pipeline begins to fill as each Lookup task passes its results to the next task.
The Lookup data flow transform will try to use as much memory as is available, to enhance performance. You can restrict the amount of memory used for cache, but full caching (the default) is normally recommended if you have the memory.
If the reference table is very large, you may run out of memory and the lookup transform will fail. You can reduce the memory required by using a query instead of a table name for the reference table. In the query, select only the columns you are mapping to and the ones you want to look up. If that doesn't help, you can set the caching mode to Partial. Partial caching loads the rows from the reference table on demand (that is, when they are first referenced). In the Lookup transform that determines the PlanType, we have to reference the Customer table by the Origin Area Code and the Origin Local Number. Because the fact data is loaded so frequently, only a small number of customers show up in the CDR data, but those who do often make many calls. This is the kind of situation where partial caching may prove to be very useful. You set the caching mode either through the Advanced Editor or in the Properties window of the Lookup transform.
To optimize the data flow, don't select columns that you do not need, either in the data source or in lookups. In a union all transform, you can drop columns that you no longer need.
Joins Versus Lookups
As discussed in Chapter 4, "Building a Data Integration Process," SSIS Lookup transformations provide many advantages over loading the new fact data into a staging area and then joining to the dimension table using SQL to fetch the surrogate keys. Lookups are easier to understand and maintain, and they enable you to handle errors when the reference table does not have a matching row. Joins are generally faster than lookups, but for smaller data volumes, we were willing to sacrifice some performance for the clarity and error handling, because the total difference in time was relatively small. For extremely large reference tables, a join may be so much more efficient that the time difference is substantial, and you have no choice but to implement a translation using a join. You should do some testing before committing to either solution. In the case of a join, implement the join as an outer join so that you don't lose any facts.
You could implement a join either by creating a view in the source database or in the SSIS data source. Often, you don't have the permission to create a view in the source database, so you must create the appropriate SQL statement in the SSIS data source. You create your own SQL statement by selecting SQL Command, instead of Table or View, in the connection manager. You can type in the SQL directly, or you can use a graphical wizard to help you create the query. But there are some issues with doing joins in the query in the OLE DB data source. You usually join incoming facts to the dimension tables in the data warehouse. You have to explicitly name the source database in the query, and this makes it difficult to migrate to other environments without changing your package.
In our case, we can't use a join because our data source is a flat file. We have the option of using the Bulk Insert task to load the flat file into a temporary table and then performing the joins to the dimensions using that table. You don't need to spend time to index the temporary table because every row will be read, so the indexes won't be used.
If you are using an OLE DB destination, choose the Data Access Mode Table with Fast Load option. This option offers a significant performance enhancement over the standard Table or View option. If you are running the package on the same server as SQL Server, you can choose a SQL Server destination, which will provide further performance improvements.
Partitioning Analysis Services Cubes
Cube partitions enhance performance and manageability. They are invisible to users of the cube, so you have complete freedom to design and manipulate partitions according to your needs. Partitions offer the following advantages when working with cubes:
Designing Cube Partitions
In our example, we want to be able to efficiently add new data, while minimizing the impact on cube users. Smaller partitions perform better than larger ones, although it may mean a bit of extra management overhead. Ten to 20 million facts per cube partition is a good rule to follow. If you have queries that limit a search to one or a few partitions, it will be faster than if the query has to search one large or many small partitions. To improve performance, your objective is to create small partitions that cover the span of the most frequent queries. To improve maintainability, you want as few partitions as possible.
Weekly partitions are a good compromise between manageability and efficiency, so we'll implement uniform weekly partitions just as we did for the relational table.
Creating Cube Partitions
You can create cube partitions using the BI Development Studio or SQL Server Management Studio. You can also use the Integration Services Analysis Services Execute DDL task to execute commands to manage cube partitions. We'll use BI Development Studio while we're developing our solution so that the partition design is captured as part of our source code. You will first have to modify the default partition's source so that it doesn't reference the entire call detail table. Next, you create additional partitions as necessary according to your design.
Because you have multiple partitions, you need to have an independent table or view for each partition, or a query that is bound to the partition. The data returned by the tables, views, or queries must not overlap; otherwise, the cube will double-count those rows.
For all our partition sources, we are going to use Query Binding rather than Table Binding. Query Binding enables us to use a single fact table as the data source, and then a where clause to restrict the call detail rows to a date-range matching the partition. You would use Table Binding if each partition had its own separate relational table as a data source.
To make sure there is no data overlap in the queries for each partition, we use a where clause like this:
CallDate >= '1-Jan-2006' and CallDate < '8-Jan-2006'
The use of >= to start ensures we capture every call that begins after midnight on January 1. The use of < for the end date ensures we capture all the calls right up to, but not including, January 8. If we had used <= '7-JAN-2006', we might miss calls made on January 7 if the time component in the fact table wasn't 00:00. We don't expect the time to be anything else, but this is defensive design in case future developers modify (or ignore) the design assumptions. We also won't use clauses like
CallDate BETWEEN '1-Jan-2006' and '8-Jan-2006' CallDate BETWEEN '8-Jan-2005' and '15-Jan-2006'
because the ranges overlap at the endpoints and the facts on January 8 would be duplicated in adjacent partitions (in this example, 8-Jan-2006 would be included by both of these where clauses).
Tip: Avoiding Gaps in Partition Ranges
A good practice to follow is to use the same value in the predicate for the boundary, but omit the equality comparison in one of the predicates. For the earlier partition, use less than (<), and for the later partition, use greater or equal (>=). An earlier partition includes facts up to but not including the boundary value. The later partition includes facts at the boundary value and greater. That way, there is absolutely no gap and absolutely no overlap. If there are gaps, you will have missing data, and if you have overlaps, you will have duplicate data.
Tip: Create an Index on the Partitioning Column
To improve performance of cube partition processing, it is important to have an index on the partitioning column in the source table; otherwise, every row in the table will have to be read to process a cube partition.
Aggregations are used by Analysis Services to improve response time to queries that refer to a rollup of one or more measures. For example, Sales by month or quarter is logically rolled up by summing Sales by day. If Analysis Services precalculates the Sales by month, then to get Sales by quarter, only three values (for the three months in a quarter) need to be summed. If some aggregations are good, are more aggregations better for large cubes? Not necessarily!
The number of potential aggregations in your cube depends on the number of dimensions, attributes, and members in your dimensional model. It is not possible to precalculate all aggregations for most cubes because the number becomes astronomical quickly. The more aggregations that are stored, the better your performance, but it will also take longer to process new data, and partition size will increase.
Remember that you must specify the attribute relationships in your hierarchies for good aggregations to be created. This is particularly important in large databases. Refer back to Chapter 5, "Building an Analysis Services Database," to review how to set up the relationships.
Partitions can each have their own aggregation designs. This enables you to tailor the aggregations differently if you find, for example, that users pose different queries against historical data than for current data. However, a caveat applies here. When you merge partitions, they must have the same aggregation design. That is why we copied aggregation designs from other partitions when we created them.
In BI Development Studio, you can find the tool for designing aggregations on the Partitions tab of your cube. When you first design aggregations, Analysis Services creates a set of aggregations that it estimates will improve performance. You can specify how much space you are willing to provide for aggregations or how much performance improvement you want. You must limit the aggregations if the processing time becomes unacceptable. Consider starting with something like a 20 percent improvement.
Regardless of the initial aggregations you set up, when in production, you want to be sure the aggregations are appropriate for the queries that your users are sending. Analysis Services provides a means to do this, called Usage Based Optimization, or UBO. We describe how to use UBO in the section on "Managing the Solution" at the end of this chapter.
Large Dimension Support in SQL Server 2005
SQL Server 2005 Analysis Services has a much higher limit on the number of members that can belong to a node in a dimension compared with Analysis Services 2000. However, if you have a GUI interface that allows a user to drill into dimension levels, this might create a problem for the interface or the user, or both. You should consider very carefully whether you want to design such a large, flat dimension. If you can find a natural hierarchy for a dimension, it is usually better to implement it.
In our example, LocalNumber could be very largeseven digits means potentially ten million members. However, we don't need to use LocalNumber to solve our business problem, so we won't include it in a dimension. If it had been necessary, we could create a hierarchy that included the three-digit exchange number as one level and the four-digit subscriber number as a child of the exchange.