Instrumenting Applications | Practical Service Level Management: Delivering High-Quality Web-Based Services

Applications must provide the internal loading, customer, and business behavior metrics that are necessary to understand their functioning. These metrics can be collected by instrumentation from the web-server systems and applications, from other server components, and from the end user. These are discussed in the following sections.

Instrumenting Web Servers

Web analytics products are used to obtain customer behavior metrics. The earliest focused on analysis of web logs produced by the standard web server applications, such as Apache Software Foundation's HTTP Web Server and Microsoft's Internet Information Server (IIS).

Server logs provide the most basic information on the accessed pages and the time of the request. The information they provide is simple, they maintain no context, there is no information about the interactions of customers with the content, and there is no visibility into cached activity (such as when the user navigates back to the previous page, which is already stored in the browser cache). Because they can provide the identity of the "referring" page (the previous web page), it's possible for software to laboriously chain the referrals together and thereby obtain a picture of the user's progress through the web site.

A typical example of a log-analysis product is the WebTrends Log Analyzer Series from NetIQ. It accepts server logs from all the major web servers and produces extensive reports on customer behavior. (Log files from all the major servers are very similar; there are standard formats and log entry definitions.) Because log files can be massive (each item on a web page creates a log entry each time it's downloaded) and because multiple web servers are usually involved in a single system, the amount of data that must be processed by the analysis engine is also massive.

Later web analytics products depended on special tags to be inserted into the pages. These tags activate the instrumentation to capture more information than is available with pure log-analysis products.

One way of tagging a page is to insert an almost-invisible phantom object on each page. Usually this is a transparent, extremely small image; the technique is often called pixel-based tracking or page-bug tracking. When the page is loaded into a browser, the browser automatically makes a request for this invisible objectexactly as it requests all the other images on the page. It's just another image as far as the browser is concerned. The phantom object's tag is no different than a standard image tag, except that the tag references the data collection server. Because of that reference, the phantom object request is directed to a third-party recording site or tool that captures the activity. By using a single object in a page to represent the page as a whole, it reduces the number of entries for each page retrieved. Instead of having a log entry for each item on the pageand there may be 50 or morethere's only one entry per page.

Unfortunately, the simplest version of this type of tagging can't see the interactions that the user makes with the web page. However, more complex versions of phantom object tagging enable the tag to contain a parameter string in the form of a query string in the image request. That parameter string can be constructed by JavaScript running in the browser, and it can therefore record user actions on the page along with any other information available to JavaScript, such as browser size and available plug-ins.

Of course, use of phantom objects requires that each page to be measured include the phantom object and, probably, a piece of special JavaScript code. In contrast, log file analysis does not necessitate changes to the web pages.

Cookies can also be used by themselves or in conjunction with phantom-object tagging and JavaScript. A cookie is an object exchanged between the browser and a web application. It contains application and user information that applications can use for authentication, personalization of content, and identification of customers for differentiated treatment. It is stored in the browser at the request of the server, and a copy of it is returned to the server with any subsequent requests made to that server. (Many users have set their browsers to reject cookies automatically, unfortunately.)

WebSideStory's HitBox is one of the leading tools that uses phantom objects, usually in combination with JavaScript and sometimes with cookies.

Clickstream is a tool from Clickstream Technologies that uses cookies in combination with a tracking module installed on the web server. Each request for a page serves the page to the browser, accompanied by a page-side measurement algorithm that records page display times as well as any offline and cached browsing activities that occur. The information is recorded in a cookie and later sent to a server that records and analyzes all the request informationincluding the browser-side, cache-based activities that would not otherwise be seen by instrumentation because they did not result in any traffic on the communications link.

Keynote's WebEffective is a measurement service that's different from the phantom object services. To use WebEffective, one line of JavaScript is embedded in the web site's entry pages. That JavaScript redirects selected users (a sample of all users, specific users, and so on) to the WebEffective server. The WebEffective server then inserts itself between the end user's browser and the original web server systems. In that position, it records everything the end user does on the web page and everything that the original web server systems do in response. (For example, it can discover that an end user is not clicking on a particular button because that button is not displayed on the end user's small browser window.) If requested, WebEffective presents a pop-up window to the end user to ask permission to track activity. After permission is given, it can ask questions of end users at any time. It can even intercept end users who are abandoning the site to ask them why they're leaving, and it can track them to their next site.

The integrity of web pages can also be evaluated by web analytics tools. I have spoken with several organizations that have written a simple application for periodically validating the integrity of the web application content. These applications improved service availability by ensuring that the correct content was correctly linked. The rapid, frequent changes on many sites might introduce a broken link, a pointer to a non-existent page, or other problems that result in poor customer experience, lost business, and reduced chances of future visits. Any problems with links or content are passed in an alert to the alarm manager. (Such tools are also available from commercial vendors and include the Keynote WebIntegrity tool and the Mercury Interactive Astra SiteManager tool.)

The integrity testing tool can be used to exercise the embedded links in the web pages. The virtual transactions load a page, check for the correct content using a simple technique like a checksum, and then initiate further virtual transactions based on links in the new page. Unfortunately, some manual intervention might be needed because of potential loops in the sequence of linksthe tests never terminate. The virtual transactions would exercise selected trails, such as those leading to visitor purchases. Web analytics tools help identify the paths that have the heaviest visitor volume.

In an example I saw, the first deployment of an integrity-testing tool was at a site with a high number of objects within each page. Before the application was tuned by the inclusion of delays inside each testing transaction, the rapid sequence of links delivered large numbers of new objects to the server's cache. This would cause the cache to replace other content with these transitory objects and add some delays while the cache refreshed its normal content after the test. The active collector was attached in a new position so that the server's cache was not directly in line and was therefore not disturbed by the integrity testing.

Instrumenting Other Server Components

Many applications are organized into tiers of servers, as shown in Figure 3-1 of this book, for higher availability and distribution of transaction volumes. Chapter 9 discusses the use of collectors to obtain measurement data from these servers within the data center. For example, using active collectors at the edge of the data center provides direct application response-time measurements without the effects of the external network. Consider a scenario in which the WAN is slow, but the web front-end and the back-end database have no performance problems. Taken alone, the end-to-end measurement from the end user's locations will show degradation. However, rather than faulting the application as a whole, the data from active collectors at the edge of the data center shows that the real problem lies with the network.

Legacy applications are still very much in the mix for most organizations, although they are hidden behind better web interfaces or safely tucked away in the back-end areas. The challenge with legacy applications is that many were never initially instrumented for remote monitoring and management.

The relative opacity of legacy applications dictates less direct approaches to understanding application behavior. One approach pioneered by BMC Software was to treat an application as a black box and observe behavior indirectly. BMC Software started to instrument mainframe applications by observing their effects on system logs, disc system activity, and memory usage, among other factors. BMC uses experience to make an educated guess about the application's behavior, based on inferences derived from analysis of those factors that could be monitored.

Geodesic Systems offers a more direct approach to instrumenting applications with their Geodesic TraceBack tool. They actually embed instrumentation during the application build process. Instrumentation is incorporated into the application code at compile time, and it records application behavior at such a fine level of detail that application errors can be pinpointed to a specific line of code.

End-User Measurements

For end-user measurements, passive and active collectors are placed near concentrations of customers or at key infrastructure locations. They interact with the web applications and carry out normal transactions. They measure the end-to-end performance of the application from various sites. Almost all measurement system vendors, such as Computer Associates, Tivoli, and HP, offer tools for running synthetic transactions or for passively observing an end user.

Measurement services are also available from companies such as Keynote Systems and Mercury Interactive. Keynote Systems is the largest supplier, with over 1500 active measurement collectors at over 100 locations on all the major Internet backbones worldwide. They run synthetic web transactions over high bandwidth, dial-up, and wireless links, and they can also pull streaming media and evaluate the end-user experience. Use of measurement service suppliers makes the most sense when your customers are dispersed over the Internet or when you need a disinterested third party to provide your SLA metrics.

It's important to measure accurately when you want to evaluate the end-user experience. As mentioned, emulation of dial-up user experiences by using restricted-bandwidth devices fails miserably because of the impact of a real modem's hardware compression feature. A study in the Proceedings of the 27th Annual Conference of the Computer Measurement Group showed inaccuracies as high as 45 percent when using restricted-bandwidth emulation instead of real dial-up modem measurements.

Similarly, use of emulated browsers instead of the actual Microsoft Internet Explorer (IE), for example, can result in misleading page download times due to differences in the browser's handling of parallel connections and other functions. If full-page download times or transaction times aren't important, browser emulations are acceptable for fetching the first HTML file in a web page. Crucial measurements, such as DNS lookup time, TCP connect time (an excellent measure of round-trip network delay), and the time to obtain the first packet of a file (which may indicate growth in the server's backlog queue) are all obtainable from an emulated browser. However, more sophisticated measures, such as the total time to obtain the page, should use a real browser, not a simplistic emulation. That's because a real browser pulls images in parallel, uses plug-ins, and has other behaviors that simplistic emulators do not match. Especially when part of the page is being delivered by caches and third-party servers (ad servers, stock price servers, and content distribution networks), end-user measurement by a simple emulator is not satisfactory.

Caches and other memories of previous measurements must also be discarded before the start of each new transaction cycle. This prevents misleading reuse of previously retrieved files.