The queuing network model of Fig. 5.4 is now completely parameterized using the service demands of Table 5.5 and the class arrival rates from Eq. (5.3.9). [Note: The transaction arrival rates must equal the transaction completion rates (i.e., X0,1, X0,2, and X0,3) since this is a lossless system and the flow into the database server must equal the flow out of the database server.] The parameterized model is solved using OpenQN-Chap5.xls. The residence times for all classes and queues as well as the response time per class are shown in Table 5.6.
As indicated in Table 5.6, response times for classes 2 and 3 are relatively high. Moreover, disk 1 is the resource where all classes spend most of their time. This is not surprising since disk 1 has the highest overall utilization (see Table 5.2). Thus disk 1 is the system bottleneck. From looking at Table 5.6, transactions from classes 1, 2, and 3 spend over 40%, 44%, and 67% of their time using or waiting to use disk 1, respectively. Note also that the ratios of the residence times at disk 1 to the corresponding service demand are quite high. These ratios are 4.0 () for each class on disk 1. In other words, the time a transaction spends on disk 1 is four times its total service time on that disk. Stated equivalently, a transaction's total waiting time at disk 1 is three times its total service time. In comparison, the corresponding waiting times at the CPU and disk 2 are only 0.8 and 1.9 times the class service times, respectively. Therefore, to improve performance (i.e., to reduce response times), the most effective strategy is to reduce the time spent on I/O and particularly at disk 1. This baseline model can be used to evaluate relevant "what-if" scenarios. It has already been noted that performance would improve by upgrading the disk subsystem. Also, as the predicted workload intensity changes over time, an understanding of the predicted resulting performance is also important. For example, suppose that the workload intensity is predicted to change over the next ten months as indicated in Table 5.7. (Note that January's workload (i.e., X0 = 1.33 tps) is the baseline workload intensity assumed up to this point.)
5.4.1 Adding a Third DiskTo cope with the increasing expected workload (i.e., Table 5.7) and to address the bottleneck, a new disk equivalent to the two existing disks is proposed to be added to the database server. The performance analyst also decides that the I/O activity should be balanced on the three disks to further improve I/O performance. Balancing the load is achieved by distributing the file requests across the three disks so that the service demand of any class is the same at all three disks. Thus, the new values of the service demands for disks 1 through 3 are computed as Equation 5.4.10
These new values of the service demands are shown in Table 5.8.
The new response time values for the three classes are obtained with the help of OpenQN-Chap5.xls (Note: Remember to reinitialize the model to indicate that there are now four queues). The results are shown in the top part of Table 5.9 for the first five months (January-May) of the predicted new workload intensity levels. By looking at the first line in Table 5.9 (i.e., the January intensity of 1.33 tps) and comparing the class response times against the baseline metrics in Table 5.6, the effects of simply adding the third disk and balancing the I/O load across the disks results in over a 35% performance improvement (i.e., response time reduction). However, by May (i.e., workload intensity of 2.68 tps) the predicted performance is unacceptably poor, with class 2 anticipated response time exceeding 30 seconds.
By looking at the ResidenceTimes worksheet of OpenQN-Chap5.xls, it is seen that with the third disk, the CPU is the resource where class 1 and class 2 transactions spend most of their time. That is, for these classes, the disks are no longer the bottleneck, but rather it is the CPU that is the most limiting the performance. To maintain an acceptable response time from, say April (i.e., 2.26 tps) on, it is necessary to reduce contention on the CPU. One alternative is to replace the current CPU with a faster processor. Another alternative is to upgrade the system to a multiprocessor by adding a second CPU. This second scenario is considered in the next subsection. 5.4.2 Using a Dual CPU SystemIn order to maintain an acceptable QoS level past April, in addition to the extra disk and a balanced I/O load, an additional CPU is proposed in a dual CPU configuration. In order to analyze the effects of this change using OpenQN-Chap5.xls, the CPU queue is specified as MP2 (i.e., a multiprocessor with two CPUs). The results are shown in the middle of Table 5.9 for the April and May workloads (i.e., 2.26 tps and 2.68 tps). The largest reduction in response time, as expected, is for classes 1 and 2, the ones that spend more time at the CPU. However, the improvements are relatively minor and the response times are still very high for classes 2 and 3 for an arrival rate of 2.68 tps (i.e., May's workload). An analysis of the residence time breakdown indicates that with the dual CPU, all three classes spend most of their time at the disks. That is, adding the second CPU shifted the system bottleneck back to the disks. The next step is to improve disk access performance further. 5.4.3 Using Faster DisksA more dramatic upgrade is considered here to hopefully cope with the increasing workload intensities. Each disk is replaced by one that is three times faster. This is reflected in the model parameters by simply dividing the service demands on all disks by a factor of three. Solving the model with these new disk speeds yields the results shown in the middle lower half of Table 5.9. The results indicate that acceptable response times are obtained for an arrival rate up to 4.78 tps (i.e., through August). However, for 5.74 tps (i.e., September's workload), the response time for class 2 exceeds 10 seconds. A look at the residence times for class 2 reveals that 87% of its response time is being spent at the CPU. This indicates the need to further enhance the CPU. 5.4.4 Moving to a 4-CPU SystemIn order to reduce the response time of class 2 transactions at high arrival rates, the dual CPU is replaced by a quad CPU. This change is reflected in the model by specifying the type of the CPU as MP4. The results are shown in the lower portion of Table 5.9. With this final upgrade, the model indicates that the response times are at acceptable service levels for all classes throughout the ten months period of concern. Figure 5.5 illustrates how the response time varies for class 2 transactions (i.e., those transactions that have the highest CPU and I/O demands resulting in the highest response times) for each of the scenarios. If a 10-second service level is deemed "acceptable" then an appropriate capacity planning strategy becomes apparent: 1) in January purchase an additional disk and load balance the load across the three disks by moving files, 2) in May exchange the disks for ones that are three times faster and purchase an additional CPU, 3) in September, purchase two additional CPUs. Though the parameters and time frames may change, this case study illustrates the usefulness of a quantitative analysis using queuing network modeling approach. Figure 5.5. Class 2 response times for various scenarios.
The remaining sections of this chapter discuss the important issue of monitoring. Monitoring tools are used to obtain input parameters for performance models from measurement data. They are also used to validate performance predictions made by the models. |