5.2 Database Service Example

Consider a database service facility supported by a single CPU and two disks. Two measurement tools, a DBMS Performance Monitor and a OS Performance Monitor provide the necessary workload characterization parameters. The DBMS Performance Monitor generates a log of the resource activity for every transaction as indicated in Table 5.1.

Table 5.1. DBMS Performance Monitor Measurements
Transaction Id	CPU Time (msec)	Disk 1 I/O Count	Disk 2 I/O Count
.	.	.	.
.	.	.	.
005	25.4	12	21
006	32.8	18	15
007	107.6	36	10
.	.	.	.
.	.	.	.

The CPU times measured by the DBMS Performance Monitor only account for time spent by transactions in user mode. The time spent by transactions executing in kernel mode (i.e., executing operating system code) is not accounted for by this monitor. The disk I/O count reflect the number of disk I/Os (i.e., hits, visits) that each transaction makes.

While the DBMS Performance Monitor provides measurement data at the user transaction level, the OS Performance Monitor provides measurement data at the system device level. Specifically, the OS Performance Monitor records the utilization of the CPU and of each disk as indicated in Table 5.2.

Table 5.2. OS Performance Monitor Measurements
Resource	Utilization (%)
CPU	45
Disk 1	75
Disk 2	65

5.2.1 Preliminary Analysis of the Workload

The data provided in the DBMSLog worksheet of the DBServiceExample.xls MS Excel workbook contains an excerpt of the DBMS log for 200 transactions executed during a period of 150 seconds. Thus, the throughput of the database service facility during this interval is X₀ = 200/150 = 1.33 transactions per second (tps).

A preliminary analysis of the log data collected by the DBMS Performance Monitor is depicted in Fig 5.1, which shows statistics of the CPU and disk usage by the transactions. The first section of the table in Fig. 5.1 shows basic statistics such as mean, standard deviation, sample variance, and coefficient of variation. The first three statistics are computed using Excel's Data Analysis tool. The coefficient of variation, CV, is defined as

Equation 5.2.1

Figure 5.1. Basic statistics for the database service workload.

graphics/05fig01.gif

The CV is an important statistic and provides additional information of the variability within the data. The standard deviation provides an absolute measure of the variability, whereas the CV provides a relative measure of variability. As a small example, the data set {1, 4, 5} and the data set {1001, 1004, 1005} have the same standard deviation, but very different CVs. A higher coefficient of variation indicates that more variability exists in the data set. In the example of Fig. 5.1, the standard deviation of the CPU time is almost 70% of the mean (i.e., CV = 0.696). The CV is also a good "rule-of-tumb" indicator of the underlying distribution of the measurements within the data set. Without providing details here, data coming from an exponential distribution has a CV = 1. Hyperexponential distributions have a CV > 1 and hypoexponential distributions have a CV < 1.

The middle section of the table of Fig. 5.1 shows the minimum and maximum values, as well as quartile information of the data sets. The first quartile (Q1), is the 25th percentile (i.e., a value such that 25% of the data lies below that value). The second quartile (Q2), also called the median, is the 50th percentile. The third quartile (Q3) is the 75th percentile. For example, 25% of the CPU times are below 104.4 msec, 50% are below 151.6 msec, and 75% are below 418.1 msec. The fourth quartile follows from the maximum value, with 100An easy way to compute the p percentile of a data set using MS Excel is through the function PERCENTILE (e.g., the 25th percentile of an array of values is obtained as PERCENTILE (<array>,0.25)).

As a better indicator than the single CV statistic, minimum and maximum values along with the quartiles provide a good indication of the distribution of the data. A graphical depiction of this is given by the Box and Whisker plot [5]. This type of plot, illustrated in Fig. 5.2, shows the minimum and maximum values of a data set, as well as the three quartiles. Quartiles Q1 through Q3 are delimited by a box. A line corresponding to Q2 divides the box into two parts. Fifty percent of the data lies within the box. For instance, in the case of the total number of I/Os on disk 1, 50% of the transactions made between 33 and 72 I/Os on this disk. Of these transactions, half performed between 33 and 63 I/Os and the other half made between 63 and 72 I/Os on disk 1.

Figure 5.2. Box-and-Whisker plots for CPU time and number of I/Os on disks 1 and 2.

graphics/05fig02.gif

5.2.2 Workload Clustering

Another visualization of the data is depicted in Fig. 5.3 as an X-Y scatter plot showing the total number of I/Os by a transaction (i.e., at both disks) versus its CPU time. Thus, each transaction is represented by a dot in the plot. The picture shows three clear "clouds" (i.e., clusters) of transactions, indicating that the transactions can be clustered in three groups labeled as low CPU/low disk (cluster 1), high CPU/high disk (cluster 2), and medium CPU/high disk (cluster 3). This example, though realistic and representative, is somewhat artificial. In many actual environments, data does not fall into such obvious clusters. Often, outlier points (i.e., transactions) are observed. Whether to ignore these points as renegades or to include them as special small clusters is a design decision. Determining the best number and the best clusters to most accurately represent the workload is far from a trivial task.

Figure 5.3. Total number of I/Os vs. CPU time (msec).

graphics/05fig03.gif

The means of the workload clusters are shown in Table 5.3, using the k-means clustering algorithm described in Chapter 4, and implemented in the k-means.xls MS Excel workbook.

Table 5.3. Workload Clusters
Cluster Number	CPU Time (msec)	Number of I/Os		Number of Points
Cluster Number	CPU Time (msec)	Disk 1	Disk 2	Number of Points
1	67.5	8.0	11.0	50
2	434.2	62.4	73.1	80
3	136.1	69.8	36.7	70

Table 5.1. DBMS Performance Monitor Measurements

Table 5.2. OS Performance Monitor Measurements

5.2.1 Preliminary Analysis of the Workload

Figure 5.1. Basic statistics for the database service workload.

Figure 5.2. Box-and-Whisker plots for CPU time and number of I/Os on disks 1 and 2.

5.2.2 Workload Clustering

Figure 5.3. Total number of I/Os vs. CPU time (msec).

Table 5.3. Workload Clusters