13.3 The Clickstream | Internet-Enabled Business Intelligence


Team-Fly

	Internet-Enabled Business Intelligence By William A. Giovinazzo
	Table of Contents

	Chapter 13. Swimming in the Clickstream

13.3 The Clickstream

The Web server records every step of our customer's interaction with our Web site. This means that every time the user goes from one point in our site to the next , the Web server records every file or image that is downloaded, every Web page that is opened. Typically, a Web server uses four logs: transfer, error, referrer, and agent. These are ASCII files whose fields are either tab-, space-, or comma-delimited. Figure 13.2 shows how the server records interactions with the users in each log.

Figure 13.2. Web server log files.

graphics/13fig02.gif

The process may not begin at our own site. Perhaps the customer comes to our site from a link on a partner's Web site. The referrer log records this action. When a page is downloaded from the Web server to the customer's browser, the transfer log records a hit . A hit is often mistaken for accessing a specific Web page. Hits actually refer to the number of files accessed from a site. In addition to Web pages, graphic images, banner ads, and hyperlinks are hits on a site and are recorded in the transfer log. While this may seem curious at first, consider how the Web server views these objects. Each objectWeb page, image, and banner adis a file. The Web server does not distinguish between them. When requested , they are all sent to the client via HTTP in the same way. There is no reason for the server to distinguish between them.

The Web server records the status of the transfer in the status log. It also records the type of client making the request of the server in the agent log. Since so much of our clickstream information originates in the transfer log, we examine this log in more detail.

13.3.1 TRANSFER LOG FILE

The transfer log file records the transactions between the client and the Web server. This is the main source of clickstream data. From the entries in this log, we can link a domain name or IP address with a request for a specific page on our Web site. The NCSA (National Center for Supercomputing Applications) has described a standard transfer log file format. Figure 13.3 provides a sample transfer log file entry.

Figure 13.3. Sample transfer log file entry.

graphics/13fig03.gif

The entry presented in the figure is an example of a common log file entry. In addition, an extended format to the log file record includes a referrer and agent field. The following list describes each of the fields in the log file.

Host The IP address or the domain name of the client making the request of the server. In order to log a domain name, the Web server must translate the IP address into a hostname using the Domain Name System (DNS). Refer to Chapter 6 for more details.
Identification The identification field is part of the original RFC (Request for Comment) 931. The field's original purpose was to provide identification of the system making the request. This field is rarely used and is most often a simple dash (-).
Authuser This field is intended to provide authenticated users access to a protected area. When this field is not in use, a simple dash will suffice.
Time This field contains the server's data and time of the request. The format of the time field is

DD/MON/YYYY HH:MM:SEC +/- XXXX
Where:
DD/MON/YYYY is the two-digit day, three-character month, and four-digit year.
HH:MM:SEC is the Greenwich Mean Time (GMT) of the request.
+/- XXXX is the difference in hours between GMT and the time local to the server. The difference is expressed as either plus or minus GMT.
Request This field contains the request made of the server. The request can be a GET, POST, or HEAD command. The GET command requests the server to get the requested document. The POST command tells the Web server that it is about to transmit data and identifies the program to receive the input. The HEAD command retrieves the HEAD section of an HTML document.

Status The status field records the status of the request. There are a variety of commands (Table 13.1 lists several of the most commonly seen codes). Note that when reviewing these codes, there are certain classes of codes. For example, a status code within the 200 range indicates a successful transfer.

Table 13.1. Web Server Status Codes

Code	Description
200	Successful Transfers
201	Created
202	Accepted
204	No Content
300	Redirected Transfers
301	Permanently Moved
302	Temporarily Moved
304	No Change
400	Failed Transfers
401	Unauthorized
403	Forbidden
404	Not Found
500	Server Errors
501	Not Implemented
502	Bad Gateway
503	Service Unavailable

Bytes The number of bytes that were transferred to the client.
Referrer This field is part of the extended common transfer log format. It is a text string that can be optionally sent by the user to indicate origin of the request or link.
Agent This field is part of the extended common transfer log format. This field identifies the client program or browser making the request.

13.3.2 MINING THE TRANSFER LOG FILE

Now that we have the transfer log file, let's consider some of the things that we can do with it. Of course, there isn't a great deal of analysis that we can perform on the raw file as it is, so we have to massage it a bit. We start by importing the file into a database and parsing the individual fields into a workable format. The time is parsed into two fields: one for date and one for time. We might also want to transform the time to the time local to the server. The request is broken into two fields as well: one for the command and one for the requested page. We then sort the data by host, date, and time. This results in a table of data similar to the one shown in Figure 13.4, which presents just a portion of the data that we would expect to see in a transfer log file.

Figure 13.4. Sorted and parsed transfer log records.

graphics/13fig04.gif

Examining the figure reveals some interesting aspects of trying to interpret customer behavior information from a transfer log. As we follow this stream of data, we see an interesting flow. The same host makes requests of our server. The stream continues somewhat steadily for a period of approximately 15 minutes. There is a gap of roughly an hour before the stream picks up again. When it does make a request, it is for a completely unrelated page. After the user visits this unrelated page, the POST command is sent, which we conclude is a purchase. Shortly after the POST command is another entry in the transaction log that starts another sequence of pages.

From this stream, we can assume that a user did some browsing on our Web site, then for some reason went away for some time. Upon his or her return, the user decided to make a purchase and then do some more browsing. This is one scenario that can be gathered from this clickstream, but not the only one. The trouble with the transaction log is that it only provides us with an IP address, which in the grand scheme of things really doesn't tell us a heck of a lot. An IP address is not a reliable means to identify an individual customer, which means that it is not a reliable mechanism for analyzing customer behavior. Some of the reasons for this are demonstrated in Figure 13.5.

Figure 13.5. Using IP addresses.

graphics/13fig05.gif

Let's begin by understanding that in the best of all circumstances, an IP address identifies only a specific system. It does not identify a person or even a client application. This is demonstrated in Figure 13.5 (a). We see that computer A at 11:00 is in use by user A. The system may have multiple accounts and multiple users. Later that day, user B works on the same system. In this situation, we have the same IP address and two different users. Typically, we would expect to see this in an office or educational environment in which it is common for different users to use the same system. In such environments, users can have drastically different demographics . Attempting to draw customer behavior information based on IP addresses in these situations would most certainly lead to incorrect conclusions.

An IP address can lead us to false conclusions in other situations as well. Figure 13.5 (b) shows a situation in which clients access the Internet via a proxy server. Businesses and ISPs use proxy servers to reduce Internet traffic as well as control access. For example, filtering software can be hosted on the proxy server to prevent employees from accessing objectionable material. The proxy server will also cache frequently accessed Web pages. This will of course cause the clickstream data miner endless problems in underreporting the number of hits a particular page receives. Nor can we be confident that the IP address reported to the Web server is the address of the system that is actually making the request or that of the proxy server. While it is recommended that the proxy server notify us that a proxy is making the request, there is nothing at present that dictates that this must be the case.

For the data miner, the proxy server means that one IP address will have a mixture of activities. In Figure 13.5 (a), we had just one or two users on the same system. Their behavior was indistinguishable from one another. In the case of the proxy server, we have entire populations of users identified with one IP address. In all likelihood , this population would not have the same demographics. Think of your own work environment. In most cases, there are rather large gender, race, religion, and sexual preference differences as well as differences in income

Finally, we are assuming that IP addresses are static entities. As we can see in Figure 13.5 (c), this is not the case. Quite often, organizations use Dynamic Host Configuration Protocol (DHCP). In these environments, the client system receives its IP address from a DHCP server. The server assigns IP addresses to client systems from a pool of IP addresses during the client's initialization. As we can see in the figure, when the system is first started in the morning, it receives from the DHCP server one IP address. Later that day, for one reason or another, the system is rebooted and is given a completely new IP address by the DHCP server. Meanwhile, the system's previous IP address may be in use by another system.

Any and all of these situations can apply to the clickstream data presented in Figure 13.4. The first group of requests could quite easily have come from a number of different users through a proxy server. We also noted that there was a rather large time gap between the initial browsing of the Web site and the actual purchase. Was this the result of the customer stopping in the middle of the process to have a conversation with someone? Perhaps a husband was discussing a purchase with his wife. We can't be certain. It could just as easily have been one prospective customer deciding that he or she would rather not purchase anything and another customer using the same client to purchase a product. A third alternative is that the original user had a system failure and upon rebooting, received a new IP address, and the record of the purchase is recorded under another IP address.

Each instance is a result of the stateless nature of the Internet. As described in previous chapters, the Web server and the client do not maintain a persistent connection. In a client/server environment, the connection between the client and server has a state. If the client process terminates, the server is aware of that termination. A Web server, since it does not maintain a state, cannot tell one client process from another. If a client process with a specific IP address terminates, the server process is not aware of the termination. If a new client process is initiated with the IP address of the terminated client, the server cannot distinguish between the two. The design of the Internet is such that the server process doesn't care that there is a different client.


Team-Fly

Top