Introduction | Computing Information Technology: The Human Side

As the importance of the Internet rises, the need to create more adaptive and more usable Websites also grows. Most improvements to a Website require some knowledge of the site's users and how they are interacting with the pages. However, Web professionals today have relatively few good options for capturing this information. Certainly, there are software and services to help summarize the basic information from the Website logs. This could mean keeping track of the frequency of visits for the individual Web pages that make up a site, counting how many times the overall Website is visited from a specific Web location, or other basic statistics.

Web usage mining refers to the application of data mining techniques to the Web server log in order to recover patterns in the use of a Website. For example, Mobasher, Cooley and Srivastava (2000) describe an automated recommender system that dynamically suggests appropriate pages for a user based on the overall Website usage patterns. The system presented in Spiliopoulou (2000) answers questions about the Website usage, when asked in an SQL-like language. An experienced user could interactively use this system to identify the Web page sequences that meet any criterion that the user specifies. Such a general tool is very powerful, but requires considerable expertise from the user.

Perotti and Burke (2001) presented a technique and visualization that offers Web developers an opportunity to easily see the pattern of usage at a Website. Unlike earlier depictions, their Web Usage Plot emphasizes the relationship between the various pages at a Website by displaying them in a topographic organization; sites that are visited together frequently appear close together, while those that are seldom visited together in the same session appear far apart. Their process to create the Web Usage Plot visualization has several steps, as depicted in Table 1.

Table 1: A Simple Process for Visualizing Web Usage (Adapted from Perotti & Burke, 2001)
Cleaning and organizing the Web server logs
Creating an aggregate representation of all users Web page visits, the co-occurrence matrix
Visualizing this representation

The final visualization step relies on a multivariate statistical technique called Multidimensional Scaling (MDS). This technique allows the reduction of the high dimensional data into lower dimensional coordinates that can be more easily visualized. The Web Usage Plot created with MDS does have many advantages over earlier representations of Website usage patterns.

Unfortunately, using MDS for Web usage visualization can be tedious because the algorithms for reducing the data dimensionality are computationally expensive. For example, the authors used the SPSS software package, which limits the user to visualizing no more than 100 Web pages. Clearly, many Websites have more Web pages than this arbitrary limit. The present research explores an alternative and potentially superior approach using a Neural Network to capture the usage patterns at a Website. In this technique, a neural network would be trained with the patterns of usage at a Website, and then would automatically organize a low dimensional representation of these patterns.

Kohonen Self Organizing Map

Kohonen's self organizing map (SOM) is a well-known neural network technique to do data dimensionality reduction. In this technique, a neural network is created in the desired low dimensionality, say two dimensions for the sake of explanation. This network is then trained with a set of input patterns that correspond to the high dimensional data to be reduced. As the network adapts, one of the network nodes becomes highly associated with each input pattern, so that when the correct input pattern is presented, it will be the most highly active node in the network. After training, the neural network represents a simple (two dimensional) map with nearby nodes representing similar input patterns in the multidimensional input data.

Self-organizing maps have been already used for a great variety of problems, including browsing a picture database, data exploration, representing large text collections and classifying Web documents based on their textual content (Kohonen et al., 2000).

The goal for the present research is to create and visualize a self-organizing map neural network representation of Website usage patterns. As in the Web Usage Plot, the self-organizing map visualization should be useful for Web page developers to identify clusters of Web pages that are visited together frequently. However, the new techniques go well beyond a simple substitution of the SOM for the Multidimensional Scaling in the procedure outlined above.

One of the key issues in using a SOM is how the data is represented for training. We have found that the co-occurrence matrix (in Table 1) is not well suited for training a neural network. To understand why, consider the structure of the co-occurrence matrix. For every Web page at the given Website, both a row and a column are created. So, if there were n total Web pages at the Website, then the resulting co-occurrence matrix would be of size n². Inside a specific cell in the matrix is the number of times that the two pages (represented by the row and column) were visited together in the same session. So, for example, if Web page 16 was visited frequently with Web page 42, then we would see a high number in the cell for column 16 and row 42. Of course, only half of the matrix is really needed, since the usage of two pages in the same session is symmetrical.

To use the co-occurrence matrix, as input to the SOM, simply requires the treatment of each row in the matrix as an input pattern, since each row is a vector that describes the aggregate usage for one Web page with all other Web pages. The problem with this is the goal for the SOM is to have pages that are visited together frequently map to nearby nodes in the two dimensional network. Unfortunately, the vectors representing two highly associated pages may be very different. Consider the example given above, where the row for Web page 16 will have a high number in column 42, while the row for Web page 42 will have a high number in column 16. These two vectors are thus very different!

A potentially superior representation of the same information for input to the SOM could be called the session membership matrix. As before, each row corresponds to a specific Web page. However, each column now corresponds to a particular user session that was recovered from the Web log file. For a given row, each column will have a one (1) in it, if the Web page represented by the row is visited the session corresponding to the column, and a zero (0) otherwise. Thus, to continue the example above, because Web pages 16 and 42 are visited together in the same sessions, they should have a similar pattern of ones and zeros along their corresponding rows. Because Web pages that are visited together in the same session will have similar vectors, the session membership matrix is more appropriate to train the SOM than the earlier co-occurrence matrix.

Visualization of the Self Organizing Map

Another unique contribution of the present research is in the visualization of the SOM. While there are several existing techniques to create a depiction from the self-organizing map, the resulting pictures are often much more difficult to interpret than the simple map-like presentation in the Web Usage Plot. For example, a common SOM depiction requires the viewer to infer the presented relationships from a complex image of gray-scale or color levels. Since the goal of the research is to make an effective tool for Web administrators and developers, a simpler image is desirable.

One existing way to visualize an SOM is to simply note which node in the matrix responds the most when presented with a given input pattern. The matrix can then be visualized by plotting a point at every grid location whose node responded the most during the presentation of the input. In our case, a point located at a grid location would represent each Web page. However, this approach has two problems. For one, multiple pages frequently map to the same network node. So, the viewer would only see one point, when in fact several associated Web pages may be represented there. A second problem is that the distance between points is somewhat arbitrary, since it simply corresponds to the regular distance between the SOM nodes in their grid.

Figure 1 demonstrates a novel "jittered" visualization of the self-organizing map neural network, which overcomes the two problems mentioned above. Using this procedure, the visualized location of each Web page is jittered by a small amount to displace it from the regular grid location. The displacement of each point is proportional to the error reported by the SOM network, when responding to the specific input pattern. Thus, multiple Web pages can be visualized at the same node location, and the viewer can see the association between the two of them as a cluster. Also, the distance of any point from the regular grid location is a measure of how well that grid location's node succeeded in distinguishing that input pattern from the rest. Such a depiction is easy to interpret, and a viewer can quickly get a sense for the primary usage patterns that are present at the Website.

click to expand
Figure 1: A Jittered Grid Depicting Clusters of Web Pages