Hack 23. Exclude Robots and Spiders from Your Analysis
One of the major complaints about web server logfiles is that they are often littered with activity from nonhuman user agents ("robots" and "spiders"). While they are not necessarily bad, you need to exclude robots and spiders from your "human" analysis or risk getting dramatically skewed results. Robots and spiders (also known as "crawlers" or "agents") are computer programs that scour the Web to collect information or take measurements. There are thousands of robots and spiders in use on the Web at any time, and their numbers increase every day. Common examples include:
While great benefits are conferred when robots and spiders visit your web site, the fundamental question will always remain: are you able to distinguish requests to your site from humans from those generated by nonhuman robots and spiders? 2.11.1. Strategies for Limiting the Impact of Robots and SpidersThe established practice in web analytics with regard to such robots is to exclude them from your analysis and reporting. The Interactive Advertising Bureau (IAB) has published the Interactive Audience Measurement and Advertising Campaign Reporting and Audit Guidelines, which include minimum requirements for excluding robots and spiders based on "specific identification of nonhuman suspected activity" (known robots and spiders) and "pattern analysis." For more information on the IAB and its requirements, please visit http://www.iab.net/standards/measurement.asp. To exclude robots from your analysis and reporting, provide lists of known robots to your web analytics software and configure it to filter their activity out of your web analytics data before you produce your metrics and reports. It is recommended that your robot lists are based both on IP addresses and user agents, because different user agents may use the same IP address and many robots may display the same user agent name. 2.11.1.1 Identify known robots and spiders.Start with a list of known robots and spiders; such a list is likely available from your web analytics vendor. The IAB, in conjunction with ABC Interactive (ABCi), maintains a list of robots and spiders that is available to IAB members free of charge. Next, supplement the list of known robots and spiders with the names of specific user agents that have been identified by your company, such as testing agents used for site monitoring by you or your vendors. The following is a list of just a few robots that have probably visited your site:
For more information on the IAB/ABCi list, see http://www.iab.net/standards/spiders.asp. 2.11.1.2 Be on the lookout for new robots and spiders.As a next step, establish a regular process and procedure for detecting robots that may be new to the Internet or specific to your site. When you find a new robot, add it to your robots lists and have its activity filtered from your web analytics data. Save all of your old robot lists and the time range over which they were used: You'll need to maintain versions so you can reproduce the numbers in your old reports if you ever need to. Regularly review your web server's access logs, starting with requests for the file robots.txt, which indicates to a robot which content on your web site should be indexed. Requests for this file almost always come from a spider or robot. Don't forget to record the user agents and IP numbers from these requests and add them to your robot lists so they are filtered from your web data. Your web measurement application should also allow you to search your web data for patterns that are common to robots. Such patterns include:
You should perform this type of pattern analysis at least once per quarter. 2.11.1.3 Build and deploy a robots.txt file.The robots.txt file, which is placed within the root directory of the web site, tells spiders which files they may download and index. Most search engines will honor the robots.txt file, but there is no specific requirement that they do. The format of the robots.txt file contains two primary elements:
The User-agent line is used to specify particular robots to be targeted with the use of the robots.txt file. A wildcard may be used to indicate all robots, as illustrated with the following syntax: User-agent: * The Disallow lines are used to specify particular files and/or directories that the identified user agents are not allowed to download. The format for such exclusion statements are as follows: Disallow: /homepage.asp This example instructs specified user agents not to spider the /homepage.asp file. To allow specified user agents to spider the entire web site, use the following: Disallow: To prevent specified user agents from spidering any file within the web site, the Disallow statement would be formed as follows: Disallow: / The most common format for the robots.txt file is as follows: User-agent: * Disallow: / Modifications may be required if your site does desire search engines to index only parts of your web site or if other system visitors such as account aggregators need to access particular files/pages served from the web server. If this is the case, you should construct your robots.txt file to disallow only those parts of your site that you do not want indexed. The following is an example from a site: # robots.txt for http://www.site.com User-agent:* Disallow:/feedback Disallow:/images Disallow:/cgi-bin Disallow:/system Disallow:/inetart Disallow:/maps You can view any web site's robots.txt file, if it has one, by requesting http://www.domain.com/robots.txt, the standard naming convention for this file. (Replace the domain variable with the name of the site you want to check.) 2.11.2. Remember That Some Spiders Are Good!What if you are interested in analyzing robot and spider activity rather than filtering it out? For instance, you may want to track visits from Google's robot, Googlebot. Many web measurement application vendors offer solutions that can collect robot activity data from your web measurement data, providing the ability to analyze robot traffic for various purposes such as optimizing your pages for search engine indexing. The specifics of this procedure will vary based on your particular application, but most mature products allow you to analyze robots separately from human traffic, essentially doing the opposite of what is suggested above in this hack. It is important to know that a solely client-side data collection model (page tags) may not be able to collect all robot/ spider traffic information, because some robot/spider agents do not execute JavaScript and generally do not accept cookies. Jim MacIntyre and Eric T. Peterson |