6.7 Experimental Comparison of Two Servers

Performance engineering involves experimentation in addition to modeling. Different alternatives can be compared by designing experiments, conducting experiments, analyzing the results, and drawing conclusions.

Many factors may impact the performance of computer systems being compared and each factor may have several levels. Consider the issue of purchasing a new Web server. Several factors are relevant, which directly influence the performance of the Web Server. Such factors include: processor speed, number of processors, and amount of main memory. Each of these factors may have more than one level as indicated in Table 6.6.

Table 6.6. Example of Web Server Options
Factor	Levels
Processor Speed (GHz)	2.0, 2.4, 2.8, 3.1
Number of Processors	1, 2, 4, 8
Main Memory (GB)	1, 2, 4

An exhaustive evaluation of all options considering all possible combinations of factors and levels, would require 48 (= 4x4x3) different experiments. This is called a full factorial design evaluation. The number of experiments in a full factorial design evaluation may be too large, making the experimental process time consuming and expensive. A significant reduction is the number of experiments is achieved by reducing the number of levels of each factor and/or eliminating factors that do not make a significant contribution to overall performance.

A method for eliminating factors that are less relevant is a 2^k factorial design. The basic idea is to consider only two levels for each of the k factors. When factors affect performance monotonically (e.g., performance improves monotonically as the processor speed increases), the minimum and maximum levels of each factor are evaluated to determine whether or not the factor has a significant performance impact. For example, increasing the processor speed from 2.0 GHz to 3.1 GHz improves performance. By conducting experiments for two levels only, 2.0 GHz and 3.1 GHz, the effect of this factor can be determined.

An important aspect of an experiment is the workload and initial conditions used in the experiment. For example, selecting a representative workload and replaying it on the system under different configurations is effective. Care must be taken to ensure that different initial conditions, such as the contents of various caches and buffers, do not distort the results.

When analyzing the results of experiments there is always some degree of experimental error. This error contributes to variation within the measured results. Experimental error may come from non-controllable factors that may affect the results. Examples include extraneous load on the network, caching activity by file systems, garbage collection activities, paging activities, and other operating system background management activities. Thus, the variation in the results is due to: 1) different levels of the design factors involved, 2) interaction between factors, and 3) experimental error.

A technique known as ANOVA (Analysis of Variance) can be used to separate the observed variation into two main components: variation that can be attributed to assignable causes (e.g., amount of main memory or number of processors) and uncontrollable variation (e.g., network load, operating system background activities) [3]. A detailed description of ANOVA is outside the scope of this book. The interested reader may refer to [3]. However, it is useful to mention that single factor and two-factor ANOVA can be easily performed using MS Excel by using the Tools Data Analysis facility.

Confidence intervals can be used as a simple method for comparing two alternatives [2] as explained via the following simple example. Suppose that management is interested in comparing the performance of their Web server with that of a new Web server. The performance analyst conducts an experiment to determine if the performance obtained from the two Web servers is different at a 95% confidence level.

The analyst carries out the following steps:

Select a representative workload, for example, a sample of 1,000 downloads of PDF and ZIP files at the peak period (see data in the Log worksheet of the MS Excel workbook ServerComparison.XLS).
Play the workload against the original and the new Web server. Record the download times for the files downloaded during the measurement interval (see measurement results in the Log worksheet of the MS Excel workbook ServerComparison.XLS).
Sort the results by file type (i.e., PDF and ZIP) and compute, for each file downloaded, the difference D_{new orig} between the download time using the new server and the download time using the original server (see SortedLog worksheet of the ServerComparison.XLS MS Excel workbook).
Compute 95% confidence intervals for the mean of the differences D_{new orig} for both PDF and ZIP file downloads. If the 95% confidence interval includes zero, then there is no significant difference between the two Web servers at a 95% confidence level. If zero falls outside the 95% confidence interval, the servers are deemed to be different at that confidence level (see results in the SortedLog worksheet of the ServerComparison.XLS MS Excel workbook).

The above four steps are performed to determine if the two Web servers give significantly different performance. A negative value of D_{new orig} indicates the new server downloads files faster than the original server. The results of running the experiments are shown in Table 6.7. As indicated by the table, the 95% confidence interval for the mean of the difference in PDF file download times is [-0.0380, -0.0334], which does not include zero. Similarly, for ZIP files, the 95% confidence interval for the mean of the difference in download times is [-0.1160, -0.1058], which also does not include the zero.

Table 6.7. Results of Experimental Comparison of Two Web Servers
	PDF	ZIP
D_{new orig}	-0.0357	-0.1109
Lower bound 95% CI	-0.0380	-0.1160
Upper bound 95% CI	-0.0334	-0.1058

Thus, at the 95% confidence level, the new server outperforms the original server.

Table 6.6. Example of Web Server Options

Table 6.7. Results of Experimental Comparison of Two Web Servers