Instrumentation of the Server Infrastructure | Practical Service Level Management: Delivering High-Quality Web-Based Services

Individual servers are instrumented by their manufacturers or by other companies that build an agent for a specific type of server. As with other managed elements, the element-centric instrumentation provides insight on the following:

Current behavior based upon CPU load, memory usage, network, and disc activity, for instance
Usage details, such as the number of users, threads, or processes
Environmental monitoring of temperature, power, and enclosure integrity

However, in a service-delivery infrastructure, such instrumentation of individual components is usually useful only after troubleshooting has narrowed the origin of a problem to a specific component. Instrumenting server tiers requires a different perspective again. Each tier is handling a subset of the total transaction, and behavioral measurements must take that into account. For example, synthetic (virtual) transactions, or parts of them, can be run against every tier to understand its response.

Because the transaction is composed of a set of determinate end-user steps (which must each succeed, fail, or time out in a given sequence), the measurements must also be decomposed accordingly. Administrators must determine the acceptable delay thresholds for each measured step in the transaction. Synthetic transactions for each transaction step can be created and modified as applications and infrastructures change.

Partitioning of synthetic transactions into their component parts can help decrease the load on the server systems. There's little incremental benefit in all transactions testing the same component in the same way. When the common component fails, all the synthetic transactions fail simultaneously and create redundant artifacts that the management system must screen. At the same time, it's not always practical to segment every section of every synthetic transaction; some segments, such as a database connection that needs a session assigned in the application server before it can be exercised, cannot be operated independently.

Collectors run the synthetic transactions and measure the tier response against established thresholds. Usually the overall performance of the tier is measured and no further steps are needed as long as it is acceptable. Other measurements become important when the transaction completion time for a tier begins to approach the threshold for unacceptable performance. The measurement intervals must be selected to balance the needed measurement granularity against the additional resources needed to process the synthetic transaction. Web architectures typically can absorb a relatively frequent sampling by synthetic transactionssay, for example, one HTTP request per minutewithout much difficulty.

By partitioning the measures of transaction performance across each successive level of the tier, administrators can more quickly isolate a problem when performance is degrading toward a service disruption. Using synthetic transactions for every tier quickly identifies the tier likely causing the problem. After a tier is identified as a likely source of the performance problem, element managers are used to pinpoint the specific elements and conditions causing the performance problem.

For example, end-user synthetic transactions might all be indicating a response-time problem with a certain step in a transaction. If all the end-user synthetic transactions are seeing the same problem, it's probably not related to the location of the end users, but it could be in the server farm or in the server farm's access to the Internet. If a synthetic transaction collector located within the server farm also sees the problem, the problem is inside the server farm, not with the Internet access link. A collector performing or monitoring database retrievals of the type used by the transaction step would then help operators see if slow database retrieval was a cause of the problem.

Measuring element performance and building baselines helps prevent problems when the inevitable component changes occur. The historical baselines now become trip wires, which are thresholds that indicate if the changes have actually improved the component performance or introduced additional delays. I recently visited a large online retailer that uses these strategies. They described a recent incident where a software change caused the application to issue three identical database queries each time data was needed. For whatever reason, the change slid through quality assurance and was placed into production. The database activity baseline immediately indicated a sudden abnormal jump in query volume. Administrators immediately determined the changed application was the culprit and rolled back to an earlier versionbefore customers noticed degradation in service.

Each of the three server architecture components discussed in the first part of this chapterload distribution, caching, and content distributionhas its own instrumentation characteristics that must be considered when building an integrated instrumentation system. Therefore, these are discussed in the next three subsections.

Load Distribution Instrumentation

The sellers of load distribution devices have included sophisticated management and instrumentation capabilities in their systems. F5 Networks, for example, has developed a network manager to monitor the status and performance of F5 load distribution devices. The F5 devices report to the manager using either Simple Network Management Protocol (SNMP) or XML. Available information includes workload volumes, the number of discarded connections, the number of times particular servers have been chosen, and more. The load-distribution system can then be tuned to handle performance situations as they occur. Detailed usage information can also be extracted for accurate billing and resource forecasting. Through a Web services XML interface, F5 network devices can be integrated directly with any third-party application.

Cache Instrumentation

It's important for instrumentation design to consider the fact that caches absorb incoming requests from end users. Phantom objects (page bugs), discussed in Chapter 8, can be used to count web page downloads even when the entire page is cached. The phantom object's file is simply marked with a cache header as uncacheable, or it is given an attribute, such as a query string, that cache agents will avoid. That phantom object will always be fetched from the origin server, even when the entire rest of the page is fetched from cache. The relatively slow delivery of the phantom object won't interfere with the end users' perception of page performance, as it's usually an invisible, one-pixel object.

If server-side caches are used, the cache's performance data is available for analysis. Information on cache hits and misses can be used to compute the bandwidth savings resulting from the cache. (For browser-side caches, such computations can also be made, but they're slightly more complex. The number of page views must be combined with knowledge of how many page elements were actually fetched from the server and compared to the number of elements designed into the page.) In any case, end-user measurements are needed to see the impact of caching on end-user performance.

Content Distribution Instrumentation

A combination of active and passive measures is needed to monitor and manage the behavior of the content-delivery infrastructure. Active collectors must be distributed to match the content-server distribution so that they can provide representative measurements for the cluster of customers using each server.

Active measurements use a series of synthetic transactions to measure the performance of the content delivery infrastructure. The performance of object fetches evaluates the speed of the network and the efficiency of the caches. The measurements are also dependent upon the objects; for example, only the delay is usually important when fetching a web page. If the object is a content stream, jitter and packet loss are also important measurements.

The content manager takes advantage of passive information from the content servers and caches. The content servers provide information on the most popular content. This is used to balance content across multiple servers to identify content that should be preloaded and to increase the replication of popular content. The content manager can schedule updates as needed from the origin server.

Other metrics for the content-delivery infrastructure can be derived as well. One useful measurement would assess the real impact of the content-delivery infrastructure on the backbone. Bandwidth gain is a metric that compares the total content delivered to consumers to the backbone bandwidth needed for preloading objects, refreshing content, and accessing the server when an object is not resident in the cache. For example, the bandwidth gain is 5 if the cache were delivering 100 Mbps of content while using 20 Mbps on the backbone for cache overhead. This shows the benefit of content servers on the edge versus upgrading the backbone for a centralized approach. The bandwidth gain becomes even more significant in the aggregate view. If there are 25 content servers at the edge, each with a bandwidth gain of 5, the backbone impact is very clear.

Content delivery networks can supply detailed information, including not just workload volumes but also information about the geographic location of your end users and the particular web pages or other content that they request.