Consider a database service facility supported by a single CPU and two disks. Two measurement tools, a DBMS Performance Monitor and a OS Performance Monitor provide the necessary workload characterization parameters. The DBMS Performance Monitor generates a log of the resource activity for every transaction as indicated in Table 5.1.
The CPU times measured by the DBMS Performance Monitor only account for time spent by transactions in user mode. The time spent by transactions executing in kernel mode (i.e., executing operating system code) is not accounted for by this monitor. The disk I/O count reflect the number of disk I/Os (i.e., hits, visits) that each transaction makes. While the DBMS Performance Monitor provides measurement data at the user transaction level, the OS Performance Monitor provides measurement data at the system device level. Specifically, the OS Performance Monitor records the utilization of the CPU and of each disk as indicated in Table 5.2.
5.2.1 Preliminary Analysis of the WorkloadThe data provided in the DBMSLog worksheet of the DBServiceExample.xls MS Excel workbook contains an excerpt of the DBMS log for 200 transactions executed during a period of 150 seconds. Thus, the throughput of the database service facility during this interval is X_{0} = 200/150 = 1.33 transactions per second (tps). A preliminary analysis of the log data collected by the DBMS Performance Monitor is depicted in Fig 5.1, which shows statistics of the CPU and disk usage by the transactions. The first section of the table in Fig. 5.1 shows basic statistics such as mean, standard deviation, sample variance, and coefficient of variation. The first three statistics are computed using Excel's Data Analysis tool. The coefficient of variation, CV, is defined as Equation 5.2.1
Figure 5.1. Basic statistics for the database service workload.
The CV is an important statistic and provides additional information of the variability within the data. The standard deviation provides an absolute measure of the variability, whereas the CV provides a relative measure of variability. As a small example, the data set {1, 4, 5} and the data set {1001, 1004, 1005} have the same standard deviation, but very different CVs. A higher coefficient of variation indicates that more variability exists in the data set. In the example of Fig. 5.1, the standard deviation of the CPU time is almost 70% of the mean (i.e., CV = 0.696). The CV is also a good "ruleoftumb" indicator of the underlying distribution of the measurements within the data set. Without providing details here, data coming from an exponential distribution has a CV = 1. Hyperexponential distributions have a CV > 1 and hypoexponential distributions have a CV < 1. The middle section of the table of Fig. 5.1 shows the minimum and maximum values, as well as quartile information of the data sets. The first quartile (Q1), is the 25th percentile (i.e., a value such that 25% of the data lies below that value). The second quartile (Q2), also called the median, is the 50th percentile. The third quartile (Q3) is the 75th percentile. For example, 25% of the CPU times are below 104.4 msec, 50% are below 151.6 msec, and 75% are below 418.1 msec. The fourth quartile follows from the maximum value, with 100An easy way to compute the p percentile of a data set using MS Excel is through the function PERCENTILE (e.g., the 25th percentile of an array of values is obtained as PERCENTILE (<array>,0.25)). As a better indicator than the single CV statistic, minimum and maximum values along with the quartiles provide a good indication of the distribution of the data. A graphical depiction of this is given by the Box and Whisker plot [5]. This type of plot, illustrated in Fig. 5.2, shows the minimum and maximum values of a data set, as well as the three quartiles. Quartiles Q1 through Q3 are delimited by a box. A line corresponding to Q2 divides the box into two parts. Fifty percent of the data lies within the box. For instance, in the case of the total number of I/Os on disk 1, 50% of the transactions made between 33 and 72 I/Os on this disk. Of these transactions, half performed between 33 and 63 I/Os and the other half made between 63 and 72 I/Os on disk 1. Figure 5.2. BoxandWhisker plots for CPU time and number of I/Os on disks 1 and 2.
5.2.2 Workload ClusteringAnother visualization of the data is depicted in Fig. 5.3 as an XY scatter plot showing the total number of I/Os by a transaction (i.e., at both disks) versus its CPU time. Thus, each transaction is represented by a dot in the plot. The picture shows three clear "clouds" (i.e., clusters) of transactions, indicating that the transactions can be clustered in three groups labeled as low CPU/low disk (cluster 1), high CPU/high disk (cluster 2), and medium CPU/high disk (cluster 3). This example, though realistic and representative, is somewhat artificial. In many actual environments, data does not fall into such obvious clusters. Often, outlier points (i.e., transactions) are observed. Whether to ignore these points as renegades or to include them as special small clusters is a design decision. Determining the best number and the best clusters to most accurately represent the workload is far from a trivial task. Figure 5.3. Total number of I/Os vs. CPU time (msec).
The means of the workload clusters are shown in Table 5.3, using the kmeans clustering algorithm described in Chapter 4, and implemented in the kmeans.xls MS Excel workbook.
