Section 3.2. Excluding the Bot

3.2. Excluding the Bot

There are a number of reasons you might want to block robots, or bots, from all, or part, of your site. For example, if your site is not complete, if you have broken links, or if you haven't prepared your site for a search engine visit, you probably don't want to be indexed yet. You may also want to protect parts of your site from being indexed if those parts contain sensitive information or pages that you know cannot be accurately traversed or parsed.

Figure 3-3. Compared with the identical page in a text-only view (Figure 3-2), it's hard to focus on just the text and links

If you need to, you can make sure that part of your site does not get indexed by any search engine.

Following the no-robots protocol is voluntary and based on the honor system. So all you can really be sure of is that a legitimate search engine that follows the protocol will not index the prohibited parts of your site.

3.2.1. The robots.txt File

To block bots from traversing your site, place a text file named robots.txt in your site's web root directory (where the HTML files for your site are placed). The following syntax in the robots.txt file blocks all compliant bots from traversing your entire site:

     User-agent: *     Disallow: /

You can exercise more granular control over both which bots you ban and which parts of your site are off-limits as follows:

The User-agent line specifies the bot that is to be banished.
The Disallow line specifies a path relative to your root directory that is banned territory.

A single robots.txt file can include multiple User-agent bot bannings, each disallowing different paths.

For example, you would tell the Google search bot not to look in your images directory (assuming the images directory is right beneath your web root directory) by placing the following two lines in your robots.txt file:

     User-agent: googlebot     Disallow: /images

The robots.txt mechanism relies on the honor system. By definition, it is a text file that can be read by anyone with a browser. So don't absolutely rely on every bot honoring the request within a robots.txt file, and don't use robots.txt in an attempt to protect sensitive information from being uncovered on your site by humans (this is a different issue from using it to avoid publishing sensitive information in search engine indexes).

For more information about working with the robots.txt file, see the Web Robots FAQ, http://www.robotstxt.org/wc/faq.html. You can also find tools for generating custom robots.txt files and robot meta tags (explained below) at http://www.rietta.com/robogen/.

3.2.2. Meta Robot Tags

The Google bot, and many other web robots, can be instructed not to index specific pages (rather than entire directories), not to follow links on a specific page, and to index, but not cache, a specific page, all via the HTML meta tag, placed inside of the head tag.

Google maintains a cache of documents it has indexed. The Google search results provide a link to the cached version in addition to the version on the Web. The cached version can be useful when the Web version has changed and also because the cached version highlights the search terms (so you can easily find them).

The meta tag used to block a robot has two attributes: name and content. The name attribute is the name of the bot you are excluding. To exclude all robots, you'd include the attribute name="robots" in the meta tag.

To exclude a specific robot, the robot's identifier is used. The Googlebot 's identifier is googlebot, and it is excluded by using the attribute name="googlebot". You can find the entire database of excludable robots and their identifiers (currently 298 with more swinging into action all the time) at http://www.robotstxt.org /wc/active/html/index.html.

The 298 robots in the official database are the tip of the iceberg. There are many more unidentified bots out there searching the Web.

The possible values of the content attribute are shown in Table 3-1. You can use multiple attribute values, separated by commas, but you should not use contradictory attribute values together (such as content="follow, nofollow").

Table 3-1. Content attribute values and their meanings
Attribute value	Meaning
follow	Bot can follow links on the page
index	Bot can index the page
noarchive	Only works with the Googlebot; tells the Googlebot not to cache the page
nofollow	Bot should not follow links on the page
noindex	Bot should not index the page

For example, you can block Google from indexing a page, following links on a page, or caching the page using this meta tag:

     <meta name="googlebot" content="noindex, nofollow, noarchive">

More generally, the following tag tells legitimate bots (including the Googlebot) not to index a page or follow any of the links on the page:

     <meta name="robots" content="noindex, nofollow">

There's no syntax for generally stopping a search engine from caching a page because the noarchive attribute only works with the Googlebot.

For more information about Google's page-specific tags that exclude bots, and about the Googlebot in general, see http://www.google.com/bot.html.

3.2. Excluding the Bot

Figure 3-3. Compared with the identical page in a text-only view (Figure 3-2), it's hard to focus on just the text and links

3.2.1. The robots.txt File

3.2.2. Meta Robot Tags

Table 3-1. Content attribute values and their meanings