Some Useful Metrics | Real-World .NET Applications

Before formulating a plan to analyze a set of log files, it's useful to decide how the site is underperforming or where the development team believes it can be improved. A list of problems and their hypothetical causes helps constrain the questions asked of the data (see Chapter 5 for suggestions on collecting research goals and problems). This focuses the process on solving the most pressing problems and avoids the morass of data analysis that can accompany log mining.

Here are some typical open-ended questions.

Do the recent changes in the interface make it easier to buy?
How many people do we lose before they find what they're looking for? Where are we losing them?
Where do people spend the most time?
How many visitors eventually become purchasers?
How does the behavior of various groups of users differ?

A more systematic way of formulating their questions can come from looking at the kinds of analysis that can be done. Four different types of analysis are relevant here: aggregate measurement, session-based statistics, user-based statistics, and path analysis.

Aggregate Measurements

The easiest questions to answer are those that cover the largest windows of information. Looking just at the most crude measure of site traffic, composite page view tallies create a kind of "50,000-foot" view of how people use the site. These tallies include the following:

The total number of pages viewed in a given period. "How many page views were there in December 2000 versus June 2000?" Be careful not to compare page views before and after a redesign since that's like comparing two different sites and tells you little.
The distribution of page views over a specific time. "Are there more page views when people are at work than when they're not?"
The distribution of page views over the whole site. This is a breakdown of total page views over a section of the site. It could be as fine as individual pages or clustered by content type.

These are crude measures and should not be taken literally. For example, when looking at a time-based distribution of pages the page views may be consistently lower on Wednesday and Thursday than on other weekdays. Although this is interesting, it doesn't really provide any information about why the cause of the drop-off. It may be because of users' behavior, but it could be that your ISP is busiest during those two days and some people can't even get through.

Other conglomerate statistics include the following:

Operating system and browser proportions. These reflect the popularity of operating systems and browsers in the site's user population. From a user experience perspective, this information is most useful when determining how much to tailor a site to browser idiosyncrasies.
Client domain. For most sites, the hits will come from either .net or .com sites. This says little since ISPs anywhere on earth can have addresses in those domains. Once those are removed from the equation, it's possible to use the proportions of remaining country domains to get an idea of the size of the site's international audience or the audience from a single ISP (such as Earthlink or AOL).
Referrer sites. These are the pages and sites outside yours that were visited immediately before a given page was requested. In most cases, these are the ones that linked to your site, and knowing how people get to the site can help in understanding the context in which they're using it.
New/repeat users. A problematic statistic when looking for absolute numbers (because of the fallibility of cookies), it's still possible to use new/repeat users to understand long-term user behavior. If the proportion of returning users to new users is consistently low, the users may not be finding sufficient value in the site to return.
Search engine referrals and key words. The referrer log contains the URL of the last page visited. Many search engines keep the keywords that generated a results page in the URL to that page, so it's possible to see which search engines directed traffic to the site and the keywords that brought up the site on the results page. That can help the marketing department target search engine advertising.

Session-Based Statistics

The most useful metrics are ones that can use session information when creating conglomerate results. These reveal a richer set of user behaviors than simple conglomerate stats.

Some of the more useful are as follows:

Average number of pages per session. The number of pages in a typical session is a measure of how broadly people explore the site. For example, if most sessions consist of only one or two page views (as is often the case for a search engine), then all navigation and the whole information architecture needs to be focused on these "short-hop" navigation elements. However, if people see five pages on average, then the navigation needs to support a different kind of usage. A slight variation on this is a measurement of the number of different pages per session: if there are 20 pages per session on average, but all the visits are confined to 3 pages, it says that people are having a different kind of experience than if all 20 pages were different.
Average duration of session. Page timing is an underused branch of log analysis, primarily thanks to the confusion created by caches and the back button. Even with these limitations, time-based metrics are still useful when compared to each other. Breaking out the time per page can also be quite fruitful when trying to see which pages are used as content pages and which are transitional.
First and last pages. These are sometimes called "entry" and "exit" pages, and they can tell you if people are moving through the site as you anticipated. The first page people see determines a lot about how they relate to a site. If people enter this site in unexpected ways, you may want to consider how and where you present information.

User-Based Statistics

Identity cookies can give you a further level of information about individual aggregated user behavior.

Some of these metrics are as follows:

Number of visits. The number of times users come to a site is an important measurement of their loyalty and their trust.
Frequency of visits. How often people visit can determine how often content needs to be updated. If the content is updated five times a day, but the majority of visitors visit only once a week, it may not be as important to have such frequent updates.
Total time spent on the site. The amount of time people spend on the site over the course of a week, a month, or a year is another indicator of loyalty or utility. If the average time on a site is 5 minutes a month, but competitors' users average 40 minutes, then your site is either much more efficient or much less interesting.
Retention rate. The number of people who come back after their first visit. This, too, can be measured over the period of weeks, months, or years, depending on the purpose of the site and the horizon of typical usage.
Conversion rates. The proportion of visitors who eventually become purchasers (or frequent users or whatever is important for the site's success). This is measured by calculating the proportion of all new users from a given point in time who eventually became purchasers and comparing it to users who weren't "converted." It's the best direct measurement of a site's success.

In addition to measuring the behavior of the whole user population, it's possible to couple identity cookies with clickstream analysis and segment users by their behavior. For example, a person who goes more frequently to the fork section of a kitchen site than other sections can be labeled a "fork aficionado." Fork aficionados' behavior can then be compared to that of spoon lovers and cup fans. These, in turn, can be compared to the user profiles defined at the beginning of the development process. This is a powerful technique for understanding behavior and for personalization, though it is still in its infancy.

Clickstream Analysis

In addition to the general metrics, there are other measurements that get at the user experience. Of course, as with all indirect user research, none of these methods can tell you why people behave the way they do (for that you need the direct user contact that contextual inquiry, usability testing, and focus groups provide), but they can certainly narrow down the possibilities of how.

One of the most useful of these synthetic techniques is click-stream analysis. This is the process of analyzing the paths people followed through your site to uncover commonalities in the way people move through it. Clickstream analysis can produce such interesting results as

The average path. This is an analysis of a "typical" path through the site. Of course, as with all these metrics, what is considered to be typical depends on the quality of the data and the algorithm used to determine "typicality," but it can still provide useful insights.
"Next" pages. This statistic gives the proportion of pages that were the immediate successor to a given page. For example, you could look at the "next" links from a search engine results page and see how many people went to a page with specific information versus how many people went back to the search page. This could tell you how effective your search results were in helping people find what they were looking for.

More specialized analyses are possible for products that have a niche focus. For example, ecommerce sites can define several specialized types of clickstream analysis.

The purchase path. This is the average path people followed when buying something. It's typically calculated by figuring out the average path from entry pages to the "thank you for shopping with us" page.
Shopping cart abandonment. The common form of this finds where in the purchase path the process typically ends, when it doesn't end at the "Thank you" page. This bailout point can then be studied for usability issues (are people just changing their minds, or is there something that's preventing them from completing their purchase?). In addition, when abandoned paths are joined with the shopping cart database, it's possible to see how many people never buy what is in their cart and thereby determine value of lost sales.

Another good technique is content clustering. The pages of a site are abstracted into chunks by topic ("silverware," "dishes," "new products," etc.) or function ("navigation," "help," "shopping," etc.). These clusters are then used in place of pages to get results about classes of content rather than individual pages. For example, looking at the traffic to all the silverware on a kitchenware site may be more useful than the statistics for the spoon page. It's then possible to investigate how effectively the "Fork of the Week!" link drives silverware sales by comparing the proportion of silverware traffic to dish traffic before and after the link was added.