Hack23.Exclude Robots and Spiders from Your Analysis | Web Site Measurement Hacks: Tips & Tools to Help Optimize Your Online Business

Hack 23. Exclude Robots and Spiders from Your Analysis

One of the major complaints about web server logfiles is that they are often littered with activity from nonhuman user agents ("robots" and "spiders"). While they are not necessarily bad, you need to exclude robots and spiders from your "human" analysis or risk getting dramatically skewed results.

Robots and spiders (also known as "crawlers" or "agents") are computer programs that scour the Web to collect information or take measurements. There are thousands of robots and spiders in use on the Web at any time, and their numbers increase every day. Common examples include:

Search engine robots that crawl over the pages in sites on the Web and feed the information they collect to the indexes of search engines like Google, Yahoo!, or industry-specific engines that search for information such as airfares, flight schedules, or product prices.
Competitive intelligence robots that spider a site to collect competitive analysis data. For instance, your competitor may construct robots to regularly gather information from your online product catalog to understand how they should price, or to make product and price comparisons in their marketing.
Account aggregator robots that regularly collect data from online accounts (usually with the permission of the account owner) and feed that data to web-based "account consolidators." Users of such account management sites benefit from having current information from their financial accounts, loyal program memberships (for hotel points or frequent flyer miles), or other accounts on a single site. Examples include Everbank, Yodelee, MilePro, and MaxMiles.
Performance measurement robots that make requests of web sites to simply determine how long it will take a page on the Internet to load. Companies like Keynote and Gomez operate such robots for their clients to take measurements of their clients' site(s) or the sites of their clients' competitors. Your IT department or your IT vendors may use similar agents for system testingi.e., to test that your site is up and running as intended.

While great benefits are conferred when robots and spiders visit your web site, the fundamental question will always remain: are you able to distinguish requests to your site from humans from those generated by nonhuman robots and spiders?

2.11.1. Strategies for Limiting the Impact of Robots and Spiders

The established practice in web analytics with regard to such robots is to exclude them from your analysis and reporting. The Interactive Advertising Bureau (IAB) has published the Interactive Audience Measurement and Advertising Campaign Reporting and Audit Guidelines, which include minimum requirements for excluding robots and spiders based on "specific identification of nonhuman suspected activity" (known robots and spiders) and "pattern analysis." For more information on the IAB and its requirements, please visit http://www.iab.net/standards/measurement.asp.

To exclude robots from your analysis and reporting, provide lists of known robots to your web analytics software and configure it to filter their activity out of your web analytics data before you produce your metrics and reports. It is recommended that your robot lists are based both on IP addresses and user agents, because different user agents may use the same IP address and many robots may display the same user agent name.

2.11.1.1 Identify known robots and spiders.

Start with a list of known robots and spiders; such a list is likely available from your web analytics vendor. The IAB, in conjunction with ABC Interactive (ABCi), maintains a list of robots and spiders that is available to IAB members free of charge. Next, supplement the list of known robots and spiders with the names of specific user agents that have been identified by your company, such as testing agents used for site monitoring by you or your vendors. The following is a list of just a few robots that have probably visited your site:

4anything.com LinkChecker v2.0
Alligator 1.31 (www.nearsoftware.com)
Express WebPictures (www.express-soft.com)
DaviesBot/1.7 (www.wholeweb.net)
GomezAgent
Inktomi Search
InternetLinkAgent/3.1
MediaCrawler-1.0 (Experimental)
Mozilla/2.0 compatible; Check&Get 1.1x (Windows 98)

For more information on the IAB/ABCi list, see http://www.iab.net/standards/spiders.asp.

2.11.1.2 Be on the lookout for new robots and spiders.

As a next step, establish a regular process and procedure for detecting robots that may be new to the Internet or specific to your site. When you find a new robot, add it to your robots lists and have its activity filtered from your web analytics data. Save all of your old robot lists and the time range over which they were used: You'll need to maintain versions so you can reproduce the numbers in your old reports if you ever need to.

Regularly review your web server's access logs, starting with requests for the file robots.txt, which indicates to a robot which content on your web site should be indexed. Requests for this file almost always come from a spider or robot. Don't forget to record the user agents and IP numbers from these requests and add them to your robot lists so they are filtered from your web data.

Your web measurement application should also allow you to search your web data for patterns that are common to robots. Such patterns include:

Visitors to your site that have very high numbers of page views in single sessions
Visitors to your site that have many very rapid page views or very low page view duration times
Visitors that return to your site at exact or seemingly routine times (e.g., every day at midnight)

You should perform this type of pattern analysis at least once per quarter.

2.11.1.3 Build and deploy a robots.txt file.

The robots.txt file, which is placed within the root directory of the web site, tells spiders which files they may download and index. Most search engines will honor the robots.txt file, but there is no specific requirement that they do. The format of the robots.txt file contains two primary elements:

User-agent line
One or more Disallow lines

The User-agent line is used to specify particular robots to be targeted with the use of the robots.txt file. A wildcard may be used to indicate all robots, as illustrated with the following syntax:

 User-agent: *

The Disallow lines are used to specify particular files and/or directories that the identified user agents are not allowed to download. The format for such exclusion statements are as follows:

 Disallow: /homepage.asp

This example instructs specified user agents not to spider the /homepage.asp file. To allow specified user agents to spider the entire web site, use the following:

 Disallow:

To prevent specified user agents from spidering any file within the web site, the Disallow statement would be formed as follows:

 Disallow: /

The most common format for the robots.txt file is as follows:

 User-agent: * Disallow: /

Modifications may be required if your site does desire search engines to index only parts of your web site or if other system visitors such as account aggregators need to access particular files/pages served from the web server. If this is the case, you should construct your robots.txt file to disallow only those parts of your site that you do not want indexed. The following is an example from a site:

 # robots.txt for http://www.site.com User-agent:* Disallow:/feedback Disallow:/images Disallow:/cgi-bin Disallow:/system Disallow:/inetart Disallow:/maps

You can view any web site's robots.txt file, if it has one, by requesting http://www.domain.com/robots.txt, the standard naming convention for this file. (Replace the domain variable with the name of the site you want to check.)

2.11.2. Remember That Some Spiders Are Good!

What if you are interested in analyzing robot and spider activity rather than filtering it out? For instance, you may want to track visits from Google's robot, Googlebot. Many web measurement application vendors offer solutions that can collect robot activity data from your web measurement data, providing the ability to analyze robot traffic for various purposes such as optimizing your pages for search engine indexing. The specifics of this procedure will vary based on your particular application, but most mature products allow you to analyze robots separately from human traffic, essentially doing the opposite of what is suggested above in this hack. It is important to know that a solely client-side data collection model (page tags) may not be able to collect all robot/ spider traffic information, because some robot/spider agents do not execute JavaScript and generally do not accept cookies.

Jim MacIntyre and Eric T. Peterson