2.9 Metrics and reporting | Monitoring and Managing Microsoft Exchange Server 2003 (HP Technologies)

< Day Day Up >

Effective management of the Exchange environment requires disciplined monitoring and reporting. Metrics are the standards by which the quality of service is measured. Reporting includes the generation, distribution, and review of the measurement data to the appropriate audience. Depending on the type of report, reports may be used to summarize the measurement data, to illustrate trends, to correlate multiple metrics, or to compare measurements to past or desirable values.

Measurement of service quality is a key component in most of the processes involved in the administration of any professionally managed messaging implementation. Metrics and reporting can help the operations group-and the user community-understand the current performance of the e-mail environment and how it is being used. Metrics can identify system performance trends and changes in usage patterns. They are the primary way to validate that the e-mail system is providing the level of service specified in SLAs. After changes have been made to key system components, metrics can validate that the system continues to perform as expected. Changes in key metrics can help determine when upgrades are needed to key components, such as the number of processors, CPU speed, memory, disk space, and network bandwidth. Operations reports generally fall into one of the categories described in the following sections. Later chapters of this book will help to explain where to collect the information for these types of reports.

2.9.1 Use and capacity reports

A use and capacity report supplies the data for analyzing the long-term changes in system and network usage. By tracking these use and capacity changes, it is possible to predict when system components will need to be upgraded.

Hardcopy use and capacity reports should be published and reviewed monthly. Typical report data should include metrics, such as the following:

Windows operating system metrics. These include processor use, physical memory use, virtual memory use, disk use, disk I/Os, NIC use, and page file use.
Exchange metrics. These include average queue length, Information Store size, messages per second, and average message size.
Network metrics. These include server segment LAN use, intersite WAN circuit use, and intrasite WAN circuit use.

2.9.2 Usage reports

Usage reports are designed to show how heavily the messaging system is being used and which users are using the most resources, such as disk space and network bandwidth. As with the use and capacity reports, tracking the usage changes over time will help to identify when resources will need to be increased.

Usage reports should be published to the intranet on a weekly basis. Summaries could be published for quarterly review meetings. Typical report data should include metrics, such as the following:

User metrics. These include average user mailbox size, largest five user mailboxes, and number of users per server.
Public folder metrics. These include average public folder size, largest five public folders, and number of public folders per server.
Exchange infrastructure metrics. These include connectors per server and servers per routing group.
Message processing metrics. These include messages sent and received within each routing group, messages sent between routing groups, and messages sent and received from the Internet.

2.9.3 System health snapshots

System health snapshots are typically brief summaries that report the current performance level and recent behavior of the system. The primary purpose of these snapshots is to verify that the system is operating as expected. They can also be used to detect changes in performance because of problems, resource depletion, increased or decreased usage, or problems with underlying components such as the network.

System health snapshots should be published to the intranet each day. The operations group should carefully and religiously review these reports checking for changes in performance that might be early indicators of a problem. Typical report data should include metrics, such as the following:

Windows operating system metrics. These include processor use, paging rate, disk use, memory use, and NIC use.
Exchange metrics. These include message transfer agent, queue lengths, and connectivity (percent availability).

2.9.4 Service level agreement compliance reports

SLA compliance reports are designed to monitor the messaging system's compliance with the SLAs that the operations group has established with the user community. Similar reports also may be used to monitor system performance against internal organizational service targets.

These reports will be used as a communication mechanism between the operations group and the user community. The publication schedule, publication mechanism, and metrics for these reports should be negotiated with the user groups as part of the SLAs. Potential metrics may include availability (percentage uptime during service window), reliability (percentage of correctly addressed messages that are successfully delivered), message delivery rate, message delivery time, and mean time to restore service in the event of service outage.

2.9.5 Problem reports

It is important to have a database of reported problems and their solutions. Problems that at first appear to be isolated may prove to be systemic. Recording and reporting problem information may provide clues for early identification of systemic problems. The problem reports should include information about the number of problems reported, the number of problems solved, the most commonly reported problems, and system availability during the reporting period.

2.9.6 Change control reports

Changes to any production environment need to be carefully considered, planned, and tested before being implemented. Changes also need to be communicated to other operational groups and to the user community. Change control reports provide an audit trail of configuration changes that can be useful for problem solving.

2.9.7 Design guidelines for operational reports

Operational reports should be designed with the target audience in mind. People generally suffer from an overload of information. A user should be able to quickly determine whether the information contained within a report warrants careful review. The following guidelines will help make reports more useful:

For lengthy reports, the first page should include a summary of the information contained in the report. The summary should highlight any exceptional conditions so that the reader is spared the time-consuming task of examining the detailed report for problems.
Tables often obscure valuable information. Graphs enhance the user's ability to quickly identify trends and should be used where appropriate for quick visualization of trends and easy correlation of data.
The information contained in the report should be meaningful and relevant to the target audience for the report. If you expect a manager or executive to read your report, keep it short and eliminate all unnecessary information.

Reports do not need to be published on paper. In fact, some types of reports definitely should be delivered using other methods. The distribution method depends primarily on the purpose of the report and the target audience. Operational reports that are to be formally reviewed in group meetings generally should be distributed as hardcopy reports. Reports for managers and executives should be brief and generally delivered as an e-mail message. The corporate intranet is a good place to publish reports, such as service level compliance reports that are designed for users and groups outside of the operations team.

< Day Day Up >