3.2. Excluding the Bot
There are a number of reasons you might want to block robots, or bots, from all, or part, of your site. For example, if your site is not complete, if you have broken links, or if you haven't prepared your site for a search engine visit, you probably don't want to be indexed yet. You may also want to protect parts of your site from being indexed if those parts contain sensitive information or pages that you know cannot be accurately traversed or parsed.
Figure 3-3. Compared with the identical page in a text-only view (Figure 3-2), it's hard to focus on just the text and links
If you need to, you can make sure that part of your site does not get indexed by any search engine.
3.2.1. The robots.txt File
To block bots from traversing your site, place a text file named robots.txt in your site's web root directory (where the HTML files for your site are placed). The following syntax in the robots.txt file blocks all compliant bots from traversing your entire site:
User-agent: * Disallow: /
You can exercise more granular control over both which bots you ban and which parts of your site are off-limits as follows:
For example, you would tell the Google search bot not to look in your images directory (assuming the images directory is right beneath your web root directory) by placing the following two lines in your robots.txt file:
User-agent: googlebot Disallow: /images
For more information about working with the robots.txt file, see the Web Robots FAQ, http://www.robotstxt.org/wc/faq.html. You can also find tools for generating custom robots.txt files and robot meta tags (explained below) at http://www.rietta.com/robogen/.
3.2.2. Meta Robot Tags
The Google bot, and many other web robots, can be instructed not to index specific pages (rather than entire directories), not to follow links on a specific page, and to index, but not cache, a specific page, all via the HTML meta tag, placed inside of the head tag.
The meta tag used to block a robot has two attributes: name and content. The name attribute is the name of the bot you are excluding. To exclude all robots, you'd include the attribute name="robots" in the meta tag.
To exclude a specific robot, the robot's identifier is used. The Googlebot 's identifier is googlebot, and it is excluded by using the attribute name="googlebot". You can find the entire database of excludable robots and their identifiers (currently 298 with more swinging into action all the time) at http://www.robotstxt.org /wc/active/html/index.html.
The possible values of the content attribute are shown in Table 3-1. You can use multiple attribute values, separated by commas, but you should not use contradictory attribute values together (such as content="follow, nofollow").
For example, you can block Google from indexing a page, following links on a page, or caching the page using this meta tag:
<meta name="googlebot" content="noindex, nofollow, noarchive">
More generally, the following tag tells legitimate bots (including the Googlebot) not to index a page or follow any of the links on the page:
<meta name="robots" content="noindex, nofollow">
For more information about Google's page-specific tags that exclude bots, and about the Googlebot in general, see http://www.google.com/bot.html.