8.2 Measuring performance | Microsoft Exchange Server 2003 Administrators Pocket Consultant

< Day Day Up >

As with any venture, the first question to ask is why you want to measure performance. Perhaps it is to validate different system configurations so that you can make a choice between one system configuration and another. For example, should you use a two-CPU system or a four-CPU system? Or perhaps it is to test a particular variant of an Exchange server under realistic working conditions, such as a cluster (to test how long failovers take after a Store failover) or how front- and back-end servers work together with clients accessing the servers through firewalls, a DMZ, and so on. For whatever reason, it is wise to begin the exercise with four points in mind:

The "realistic" workload that you generate through simulation software is only representative of the workload that simulated users produce. Real users can do weird and wonderful things to upset system performance. In fact, they are rather good at doing this.
A balanced system is more important than the fastest system on earth. This means that you have to concentrate on achieving the best balance between the speed of the CPU, the number of CPUs, the amount of storage, the type of storage and controller, and the amount of memory.
Performance measured on a server that only runs Exchange is about as valid as a three-dollar bill in the real world of operations. No Exchange server simply runs Exchange. Instead, real servers run a mixture of Exchange and other software, including antivirus and antispam detectors, and so on, all of which steal some CPU cycles, memory, and I/O.
New advances, such as hyperthreading and 64-bit Windows, will continue to appear to drive performance envelopes upward. However, operational considerations often limit the number of mailboxes that you want to support on a single server. The old adage of not putting all of your eggs in one basket holds true today. Against this argument, it is generally true that organizations operate far too many servers today and server consolidation is a trend that will continue for the foreseeable future.

Because its ability to deliver great performance decreases the further you get from the datacenter, even the best-balanced and most powerful server will not satisfy users all the time. The reasons for this include:

Network speed and latency: If users connect across slow or high- latency links, their perceived access and performance are gated by the amount of data transferred across the link. You can install faster computers, but it will not make much difference to the users at the end of such links.
Clients: Each client differs in its requirements. A POP client makes minimal demand when compared with Outlook, but the latest version of Outlook can work in cached Exchange mode to speed perceived performance on the desktop. Outlook Web Access is highly sensitive to bandwidth.
User workload: If users are busy with other applications, some of which also use network links, the performance of their Exchange client might suffer and they might blame Exchange.

All of this goes to prove that no matter how well you measure performance and then configure systems before you deploy Exchange, user perception remains the true test.

8.2.1 Performance measuring tools

Microsoft provides three tools to assist you in measuring the performance of an Exchange server:

LoadSim
Exchange Stress and Performance (ESP)
JetStress

LoadSim is the oldest tool, since Microsoft first engineered it for Exchange 4.0 to generate a measurable workload from MAPI clients. ESP serves roughly the same purpose for Internet clients (including Outlook Web Access), while JetStress generates low-level database calls to exercise the I/O subsystem. You can download these tools from Microsoft's Exchange site at www.microsoft.com/exchange.

LoadSim and ESP both work by following a script of common operations that you expect users to take (creating and sending messages, scheduling appointments, browsing the GAL, and so on). You can tweak the scripts to create heavier or lighter workload. Usually, one or more workstations generate the workload to exercise a server, each of which follows the script and generates the function calls to perform the desired operations. The workstations do not have to be the latest and greatest hardware, since even a 700-MHz Pentium III-class machine is capable of generating the equivalent workload for 600 or so clients. Note that LoadSim does some things that make it very unsuitable for running on any production server. For example, when LoadSim creates accounts and mailboxes to use during the simulation, it gives the new accounts blank passwords. You can imagine the opinion of your security manager if you create hundreds of accounts with blank passwords in your production environment. For this reason, always run LoadSim on test servers, but equip those servers with hardware that is as close as possible, if not identical, to the configuration used in production.

JetStress falls into a different category, because you do not use this tool to measure the overall performance of a server. Instead, JetStress exercises the storage subsystem by generating calls to stress the physical disks, controllers, and cache to identify if a configuration is capable of handling a specified workload. Another way of thinking about JetStress is that it mimics the work done by the Store process, whereas the other tools aim to exercise a complete Exchange server. While the Store is central to Exchange, many other components affect the overall performance of an Exchange server, such as the Routing Engine. The Store places the heaviest load on the storage subsystem and that is what JetStress attempts to measure. Unlike the other tools, JetStress does not come with a pretty interface and does not generate nice reports. You have to be prepared to interrogate the system performance monitor to capture data that you later analyze. In addition, while LoadSim and ESP work on the basis of operations (such as sending a message to two recipients) that you can easily associate with time, JetStress requires detailed knowledge of Windows performance and storage fundamentals if you are to make sense of its results. It is probably fair to say that any Exchange system administrator can run and understand LoadSim, but JetStress requires you to do more work to understand how to change hardware configurations to improve performance based on the data it generates.

8.2.2 The difference between vendor testing and your testing

Hardware vendors typically use a standard benchmark workload called MMB2 for Exchange 2000 and MMB3 for Exchange 2003^[2] when they test new servers. MMB2 is a modification of the original MMB workload and represents the workload generated by average office workers, if you could ever find one of these strange beasts. MMB3 is an evolution of MMB2, but differs in that it attempts to reproduce the different load generated by Outlook 2003 clients that use cached Exchange mode. Client-side caching changes server workload and may affect overall system performance, but it is only one aspect of Exchange 2003 performance. Microsoft has incorporated other factors into MMB3 (such as the use of rules, query- based distribution groups, and search folders) that increase client demand on a server, so a typical MMB3 result (in terms of number of mailboxes supported by a server) is lower than MMB2. Therefore, you cannot take a server result for Exchange 2000 and compare it with a result reported for Exchange 2003, because it is not an apple-to-apple comparison. You need to use the LoadSim 2003 version to perform benchmarks based on the MMB3 workload. A similar situation occurred when Microsoft changed the original MMB benchmark to MMB2 with the introduction of Exchange 2000.

All benchmarks attempt to prove one thing: that a server can support many more Exchange mailboxes than any sane administrator would ever run in production. To some extent, the benchmarks are a game played out by hardware vendors in an attempt to capture the blue riband of Exchange performance. It is nice to know that a server will support 12,000 mailboxes, but you always have to realize that, despite Microsoft's best effort to refine the MMB workloads, real users generate workloads very different from simulations for the following reasons:

Real servers run inside networks and experience all of the different influences that can affect Exchange performance, such as losing connectivity to a GC.
Real servers run much more than Exchange. For example, antivirus detection software can absorb system resources that inevitably affect overall system performance. Some informal benchmarking of leading antivirus software shows that it can absorb 20 percent to 25 percent CPU, as well as virtual memory, with an attendant reduction on the number of supported mailboxes. Multi-CPU systems tend to be less affected by add-on software, because the load is spread across multiple processors.
Benchmarks usually test the performance of single servers and ignore complex configurations such as clusters.
Benchmarks do not usually incorporate complex storage configurations such as SANs, but shared storage is a prerequisite for any server consolidation exercise. Storage can significantly affect server performance, especially for database applications such as Exchange, which is the reason why vendors avoid complex storage configurations in benchmarks. They also tend to use RAID 0 volumes to hold the Store databases. This ensures performance, but you would never use RAID 0 for Store databases on production servers.
The Store databases on real-world servers include a much wider variety of attachment types than found in the measured setup of a test database. For example, you do not typically use test databases that include huge PowerPoint attachments, yet any corporate Exchange server is littered with these files.
The performance of all servers degrades over time due to factors such as disk fragmentation.

If you just read these points, you might conclude that there is no point in paying any attention to vendor benchmarks and running your own benchmark tests may not deliver worthwhile results. This is an oversimplification of the situation. The results of a vendor benchmark performed using a standard workload (remember that the MMB3 workload is preferred for Exchange 2003) gives you a baseline to measure different system configurations against each other. You can understand the impact of installing multiple CPUs in a server, the difference an increase in CPU speed makes, or how different storage controllers and disks contribute to overall system performance. All of this assumes that vendors perform the benchmarks according to the "rules" laid down by Microsoft and do not attempt anything sneaky to improve their results. Many consultants take the benchmark results reported by vendors and adjust them based on their own experience to create recommendations for production-quality server configurations. Other factors, such as the "keeping all your eggs in one basket" syndrome and the time required to take backups (and, more importantly, restores) of large databases, reduce the tens of thousands of mailboxes that some benchmarks report to a more supportable number. For example, an HP quad- CPU (2 GHz) Proliant DL580 with 4 GB of memory benchmarks at 13,250 mailboxes, but you would never run this number in production. Experience of most corporate-style deployments indicates that 4,000 is closer to a long-term supportable number.

The decision to run your own benchmarks is harder to make because of the effort required. You can run LoadSim on a server just to see how it responds, but this will not generate a measurement that you can use in any serious sense. To create a proper benchmark you need:

Dedicated servers to host Exchange and the Active Directory (DC and GC), as well as the workstations to generate the workload.
A similar software configuration on the Server Under Test (SUT) that you intend to run in production. In other words, if you want to run a specific antivirus agent on your production servers, it should be installed and running on the SUT too. The same is true if you intend to host file and print services or any other application on the production servers-these have to be factored into the equation.
Access to the same storage configuration that you plan to deploy in production. If you want to use a SAN, then you must connect the SUT to the SAN and use the same controller and disk layout as planned for production. Because Exchange performance is so dependent on the Store, you can drastically affect overall performance by changing storage characteristics.
Apart from basic disk layout of the Exchange files (database, logs, and binaries), you should place the SMTP and MTA work directories and the message tracking logs in the same locations that they have in production.

Apart from all of this, all you need is time to prepare the test, run the benchmark, and then analyze the captured data. Do not expect to get everything right on the first run, and be prepared to run several tests, each of which lasts six hours or more (one hour to normalize the load, four hours to measure the SUT being exercised, one hour to complete the test).

During the same time, you may want to take advantage of the realistic configuration you create for the test servers to validate that assumptions about backup and restore times are correct, that antivirus and archiving tools work, that operational procedures are viable, and so on. You may also want to use other tools, such as Intel's IOmeter, to measure the base performance of storage volumes or other components.

Of course, you can take the view that every server available today is easily capable of supporting thousands of Exchange mailboxes and ignore benchmarks completely. This is a viable option if you then buy high-quality server hardware based on attributes other than just speed, including:

Vendor support
Additional features, such as the ability to boot or otherwise manage the server remotely
Form factor (some sites prefer blade servers because of their reduced rack size or form factor)
Server compatibility with the storage infrastructure

Even if you do ignore benchmarks, it is still worthwhile to build some test servers based on the desired configuration and validate it before proceeding to full deployment.

^[2] . Microsoft does not endorse any benchmark results gained by running MMB2 against Exchange 2003 servers.

< Day Day Up >