Extracting Knowledge from Data | Real-World .NET Applications

If it's easy to measure, it's probably not useful. If it's useful, it won't be easy to measure.

—"A Web Statistics Primer" by Teresa Elms on builder.com

Once there are logs to analyze and questions to answer, the process of answering the questions with the contents of your logs begins. Regrettably, this is difficult.

Log analysis is generally too time consuming to do by hand. Therefore, the two options are to write analysis software yourself or buy an off-the-shelf solution (or, of course, to let someone else do it). Top-end analysis packages can cost tens (or even hundreds) of thousands of dollars, but cost-effective solutions exist that can satisfy the needs of many small- or medium-sized sites.

DIY

Starting completely from scratch is not recommended. There are a number of inexpensive or free products that can provide basic conglomerate statistics (path analysis or behavior segmentation software is complicated, and as of this writing, I don't know of any inexpensive products that do it with any degree of sophistication).

When working with low-end products, it's useful to prefilter the data to reduce noise in the results. There are four classes of hits that can be removed to improve the final statistics (check the documentation of the package you're using because some packages do some of these for you).

Hits from search engines. Search engines regularly send out spiders to gather site content. These spiders don't act like users. They follow links in the order they appear on the page, not in terms of what's interesting or useful. Thus, they fill up logs with data that have nothing to do with how your site is used. Fortunately, most of them use a distinct user agent field so that it's possible to quickly extract them from a log file. A list is available from info.webcrawler.com/mak/projects/robots/robots.html.
Offline browsing agents. These are programs that download all the files from a site so that it can be viewed offline at a later time. They act like spiders and have the same problems. A list is available at www.jafsoft.com/misc/opinion/webbots.html#offline_browsers_and_other_agents.
Outliers. Outlier is the name given to a value that lies far outside the majority of collected values. Outliers can skew compound statistics, such as average times and path lengths by skewing the mean (they're described as part of the "billionaire" problem in Chapter 12). To test if your statistics are being skewed by outliers, compare the mean of your data to the median. If the two are drastically different, then you may have a large number of outliers, which should then be removed and the results recalculated (again, the general ideas behind this are described in more detail in Chapter 12).
Your site. Since much of the traffic on your site comes from inside your own company, it can annoyingly skew your referrer statistics.

A list of free or cheap log analysis products is available from www.uu.se/Software/Analyzers/Access-analyzers.html and Yahoo!'s Computers and Internet > Software > Internet > World Wide Web > Servers > Log Analysis Tools category. The majority of these produce conglomerate statistics. Though they generally don't do much with session or user tracking, several of them allow drill-downs to the clickstream level.

Once the basic statistics have been examined, more usage information can be gleaned by looking at individual clickstreams. Even without automated path analysis, it's possible to get a gut feeling for how people use the site. By looking at a relatively large number (50–100) of random clickstreams and taking notes, it's possible to get an idea of how often people followed one path or another, how deep the paths were, and the entry and exit pages.

However, manual log analysis is time consuming and inaccurate. Commercial tools get deeper, more certain, more consistent knowledge.

Big Tools

Kirk to computer: "Correlate all known data."

Computer: "Wor-king ...."

At heart, they work similarly to the low-end tools, but industrial-grade analysis products kick this process into a whole other realm. By correlating information from multiple databases, such tools produce a more complex picture of users and their "consumer life cycle" than is possible with less sophisticated, run-of-the-mill tools. One such product is described by the Aberdeen Group as being able to "process Web logs, customer buying histories, demographic information, and other data to create a view of a customer's visits and proclivities. It will then map that customer behavior to the business's categories and analyze the intersections, ultimately segmenting customers into groups and explaining the rules that created those segments." This, in effect, allows the company to create a highly personalized experience (and marketing plan) for the people who match their segments.

The kinds of products that do this level of data integration and analysis can generally be found when looking for Web "data mining" or CRM information. As of this writing, some companies that make these products include TeaLeaf, NetGenesis, Accrue, Web-Trends, Coremetrics, Limelight, and Personify. There is a lot of overlap between all of them, though each has some unique features, and several are geared toward experience personalization, so they tend to be more experience focused than advertising or strategy focused.

Most of these packages are primarily designed for assisting strategic marketing and business decisions, but their tools can also be used by developers and designers to understand user behavior and gauge the effectiveness of changes to the user experience. For example, knowing that a large proportion of people who look at forks also look at napkin rings can help adjust the information architecture of a site. Extending that kind of analysis to the relationships between all the categories can tune the whole site more finely than can qualitative research methods.

Moreover, tools that integrate bottom-line financial information with user experience data can help resolve tensions between usability and business needs. For example, a search engine moved the banner ad on its results page so that it was more prominently displayed. This was intentionally more distracting to users and produced the expected higher click-throughs and higher revenues. However, it was important to know whether the additional revenue was a short-term effect that would be undermined by users growing annoyed at the new ad placement and never returning. Analyzing user retention rates before and after the change, it was discovered that retention did not perceptibly change—people were not running away any more than they had before the change—whereas revenue went up significantly. The search engine decided to keep the change. Later, the same search engine made a change that made the advertising even more prominent, but retention rates began falling while revenue did not significantly increase, and the change was rolled back.

These large tools can be expensive (typically, it costs tens or hundreds of thousands of dollars to integrate one with your existing databases and then analyze the results), so they need to be chosen carefully and with specific research goals in mind. As you're evaluating them, often with a consultant from the company providing the service, ask which of the questions in the research can be answered by the package, and with what level of certainty it can be answered.

Regardless of the level of sophistication of your analysis, people's existing experiences should not be ignored. Customer support and log analysis are powerful tools that can be used at almost any time in a product's life to expose valuable, immediately actionable information. They should be used in the context of a complete research program, but they should not be forgotten.

A Note About CRM

This chapter merely scratches the surface of customer relationship management (CRM), a rapidly growing field that attempts to track customers throughout all of their interactions with the company. It begins with things like customer comments and log files, but includes deep quantitative user/customer behavior analysis and attempts to relate user behavior to the company's products and revenue streams. For example, Mattel's Tickle Me Elmo doll was a popular and expensive toy when it came out. It worked reasonably well, but when it broke there was little support for its repair (at least from the perspective of the parents I spoke with). Many buyers became quite upset when they discovered their toys weren't fixable. Mattel was clearly not meeting customer expectations for a certain group of people, but how many? What effect did this have on these people's opinions of and behavior toward Mattel and Sesame Street? How did this affect Mattel's bottom line in the short term? The long term? The point of CRM is to understand the relationship between the customer and every facet of an organization.

Many of the ideas and quantitative techniques of CRM overlap with user experience research, and there will certainly be a lot of fruitful cross-pollination between the two fields in the future.