A crucial aspect when designing an MDC table is to choose the right set of dimensions for clustering the table and the right block size to minimize the space utilization. If the dimensions and block sizes are chosen appropriately, the benefits of the clustering can translate into significant performance and maintenance advantages. On the other hand, if the dimensions and/or block sizes are chosen incorrectly, performance can be degraded, and the space utilization could increase dramatically. Some alternatives that can be exploited to organize the table are:
One or more of the above techniques can be used in conjunction with detailed knowledge of the application and the data to identify the best organization for the MDC table. The following sections describe these aspects of the design of MDC tables in more detail.
Identify Candidate Dimension Attributes
The first step is to identify candidate dimension attributes for the table. The main criterion in choosing the candidates for the dimension column(s) is the need for clustering the data, based on a set of queries. The set of queries (either known or expected) should be examined for one or more of the following types of clauses or conditions that are good candidates for clustering:
A set of queries may have several candidate attributes that satisfy the criteria above. In this case, it is important to rank these attributes based on the workload characteristics to be able to choose an appropriate subset of the candidates for consideration. Once the number of candidates has been narrowed down, iterate through the following steps with variations of the set of dimension candidates before deciding on an appropriate selection.
Estimating the Number of Cells
Given a candidate dimension set, it is important to identify how many potential cells are likely to occur. The following query can be used to accurately determine the number of cells if a test table or an existing non-MDC table is available.
select distinct dimcol1, dimcol2,....dimcoln from <table_name>
If the table does not exist, this can be accomplished by estimating the number of unique combinations of the dimension attributes. An estimate for the number of cells can be made by multiplying the cardinalities of the dimension columns . However, this estimate can be inaccurate if there are correlations between the dimension columns.
Let C represent the estimated number of cells and R represent the estimated number of rows in the table. The number of cells estimated by itself is not a crucial performance indicator. If disk usage is a concern, then when C is a relatively large fraction of R, this indicates that there may be a large amount of disk space being used and/or wasted .
Excessive Space May Be Used and/or Wasted If C/R Is Large
Each cell may have one partially filled block assigned to it, and these partially empty blocks can occupy a relatively large set of pages if C is large. One solution to this is to reduce the block size (table space extent size) so that less space is used and potentially wasted. Another solution is to reduce the number of cells by either reducing the number of dimensions or by using a generated expression to increase the granularity of the cells for one or more dimensions.
In this case, there could potentially be up to 100,000 partially filled blocks or cells, using 3.2 million pages. This is a lot of disk space. If the extent size for the table space is instead set to four pages, the total space used would then be reduced to 400,000 pages.
The number of cells (C) could also be reduced to 10,000 (100,000 / 10 = 10,000) by using a generated column (GENSHIP) that is equal to SHIPPEDDATE/10, and the table could be organized using GENSHIP and PARTNUM.
The above discusses dividing SHIPPEDDATE by 10 instead of 12, or 20, etc., in order to keep the generated column monotonic. Monotonicity will be discussed in detail later in this chapter.
The number of cells (C) could also be reduced by not using the column PARTNUM for the table's dimension. This would reduce the values of C to 1,000.
Estimating the Actual Space Usage per Cell
Assuming that there is no skew in the underlying data, the average number of rows per cell (RPC) can be estimated as:
RPC = R / C
It is also possible to factor in the data skew and identify the average number of rows for the less frequent attribute values.
The space usage per cell (SPC) can be estimated as follows :
SPC = RPC * Average_Row_Size
Because the smallest page size supported by DB2 is 4 KB, having an SPC of less than 4,000 bytes times the extent size (or block size) will result in wasted space.
For a Low Value of SPC
Based on the example above, if C is reduced from 100,000 to 10,000 or even 1,000, this will increase the SPC to 10,000 or 100,000 bytes and allow for at least one reasonably sized block to be filled. If SPC is 100,000, an extent size of 16 and a page size of 4 KB will still utilize one full block.
Combining this with a relatively small value for C provides a good starting point for designing an MDC table.
For a High Value of SPC
When the value for SPC is high, each cell normally will require many blocks. In most situations, this will be acceptable from a performance perspective, as long as the choices of dimensions, granularity, and block size are appropriate. However, the performance may be further improved as follows:
This will require approximately 10,000,000/16,000, or 625 blocks per cell. Increasing the page size to 16 KB would decrease this to approximately 160 blocks per cell. Additionally, increasing the block size (table space extent size) to 32 can decrease this to 20 blocks per cell.
If the number of blocks per cell is too large, even after adjusting their sizes, then the number of cells (C) should be increased, if possible. This is best achieved by adding the next candidate dimension to the set.
The following sections describe the effect of the factors affecting MDC table layout in additional detail.
Varying the Number of Dimension Keys
The query workload analysis phase may identify a set of potential dimension keys. Note that a dimension key can be a composite set, i.e., (CITY, STATE) can be combined to form one dimension. In contrast, CITY, STATE implies that there are two dimensions. If city and state are always used together in the QUERY workload, using an aggregate dimension set is a good choice because this will automatically build the slices based on the combination of these values. However, if city or state is used independently, it is better to have two separate dimensions on CITY and STATE so that each slice can be examined independently.
Given a set of candidate dimension keys, the next step is to order them qualitatively, based on their priority. Next, identify the column cardinalities for each key by using the available statistics or by using a SELECT distinct query against a set of test data.
A method for choosing the table's dimensions is illustrated below:
Based on the example discussed previously:
If the SPC for SHIPPEDDATE is too large, add the PARTNUM column to the dimension. If the SPC is still too large, add the DISCOUNT column to the dimension.
If the SPC for SHIPPEDDATE is too low, a generated column can be built on the year and month of the SHIPPEDDATE, using the expression integer(SHIPPEDDATE) / 100. The function integer(DATE) returns the numeric form of the date as an integer data type, i.e., 2002-11-04 is returned as 20021104. Dividing this value by 100 yields an integer, 200211, that represents the date's year and month value.
Varying the Granularity of a Dimension Key
If one of the candidate dimension keys has too many possible unique values, a rollup technique can be used to decrease the granularity or cardinality of the dimension candidate in several cases. For example, the candidate dimension SHIPPEDDATE has a column cardinality that is quite high. A generated expression based on the YEARANDMONTH of the SHIPPEDDATE can be added to the table and used as a dimension for the table.
To add the column YEARANDMONTH to the table, add the following line to the create table statement:
YEARANDMONTH integer generated always as integer(SHIPPEDDATE)/100
The YEARANDMONTH column can then be used as the table's dimension, as follows:
organize by (YEARANDMONTH)
This will reduce the value of C by a factor of roughly 30. If this is not sufficient, the data can be reduced to a YEAR by using the Year function or by using the function integer(SHIPPEDDATE)/10000.
There are two main points to remember when using granularity rollup.
Varying the Block Size and Page Size Values
Given a value for SPC (the space occupied per cell), it is possible to adjust the page size and extent size for the table space to arrive at an optimal number of full blocks. In order to realize the full flexibility possible in the design of an MDC table, the table space's parameters may also need to be set accordingly , unlike regular tables. Consider the following scenario:
The value of SPC is relatively low and does not even fill a block, given the default values for the page size (4 KB) and extent size (16 or 32 pages). The extent size can be reduced to a lower number (i.e., 8 KB) and the SPC recalculated to verify whether this results in better block space occupancy . Although further reductions in the extent size could help, all the way to the smallest value of 2, it is important to consider the tradeoff of I/O overhead when using very small extent sizes. In these cases, any of the other techniques, such as varying the granularity or reducing the number of dimensions, should be considered , as well.
If the value of SPC is relatively large and can accommodate several blocks, consider increasing the page size and the block size parameters to arrive at an optimal value of blocks per cell. If the value of SPC is high, it might be possible to add another dimension to the table and obtain some additional performance benefits.
Using a Regular Table as a Baseline for the MDC Table Design
Because MDC is new in DB2 Version 8, regular tables in existing databases may be good candidates for organizing by dimensions. Therefore, a comparison can be made between an MDC table and a regular table while performing the conversion. While designing an MDC table, a regular table can provide a good indication of the lower bounds on the table's space utilization, especially if it has been reorganized or reloaded recently. In particular, the space utilization statistic FPAGES can be used to identify whether the clustering choices for an MDC table are reasonable.
Based on the example discussed previously, with the MDC version of the parts table named PART_MDC and the regular version of the table named PARTS, if the PARTS_MDC table is initially clustered based on the SHIPPEDDATE column, consider the following statistics after loading both tables with the same amount of data.
In this case, space is being underutilized in the MDC table PARTS_MDC. Following the tuning guidelines described above, utilize a granularity rollup on the dates to YEARANDMONTH. After rebuilding the PARTS_MDC table with the dimension based on YEARANDMONTH, the statistics would look like:
To save more space, the extent size for the table space can be reduced. After creating a new table space with a smaller extent size and recreating the table PARTS_MDC in the new table space, the statistics would look like:
The exercise of rebuilding a table and loading data is time-consuming ; therefore, it is more efficient and quicker to use the analytical techniques described above.
Summary of MDC Design Guidelines
The following list summarizes the design considerations for MDC tables: