Chapter 82. Blocking Parts of Your Site from Search Engines

It may seem counterintuitive to want to restrict search engines from any part of your site. You probably take great pains to get your site listed on as many search engines as possible. But follow the logic here: Maybe you want to control exactly how a visitor finds your site. You would rather new visitors come in by the front page, say, instead of some page three levels deep. Or maybe you don't want visitors to come in on a page that is supposed to be a popup window, where you may not offer the full range of navigation choices. The more you think about it, the more restricting certain areas from search engines makes good sense.

TIP

Some Web designers who prefer cheap solutions instead of good ones think that the methods in this topic give them a sure-fire way to secure sensitive information on their sites. These Web designers would do well to skip this topic entirely. The best solution for security is not to keep sensitive information on your Web server, period. If you can't avoid it, you need to research and implement actual security and authorization protocols, such as password-protected directories.

There's a relatively easy and reliable way to communicate your indexing preferences to robots, the software programs that search engines send out to catalog your site. You add a special text file called robots.txt to the top level of your remote site, right inside the remote root folder. The robots.txt file tells visiting search engines to ignore the specific directories or files that you list.

Here's the catch: For this to work, the robots have to follow the Robots Exclusion Standard, which is a little-known corollary to Asimov's Three Laws of Robotics. The Robots Exclusion Standard simply states that a robot must obey the instructions in robot.txt. But this standard isn't a law. It's more like good manners. The people who design robots for search engines don't have to program their creations to abide by the standard, and there are indeed renegade robots running amok on the Web, just as in I, Robot. Nevertheless, the robots for all the major search engines operate according to the guidelines.

GEEKSPEAK

A robot is a special piece of software that catalogs or sniffs (or spiders) your site for a search engine.

A simple robots.txt file looks something like this:

 User-agent: * Disallow: /popups/ Disallow: popup.htm Disallow: /images/ Disallow: /js/ Disallow: /css/

TIP

Make sure you use a text editor to create your site's robot.txt file, and save the result with the extension .txt. Don't create an HTML file and then change the extension to .txt.

The Disallow lines tell the robot which directories or files are off limits. In the preceding example, the popups, images, js, and css directories are blocked, as is the file called popup.htm.

The User-agent line indicates to which robot the Disallow lines apply. Giving the asterisk (*) as the value of User-agent means that the Disallow instructions apply to all robots. You may specify individual robots, too, and give different levels of access to each:

 User-agent: googlebot Disallow: /popups/ Disallow: popup.htm User-agent: Roverdog Disallow: /popups/ Disallow: popup.htm Disallow: /images/ Disallow: /js/ Disallow: /css/

In this scenario, google's googlebot can't look at the popups directory or the popup.htm file, while Roverdog can't get at the images, js, or css directories in addition to the popups folder and the popup.htm file.

The values in the Disallow lines are root-relative paths, by the way. So, if you want to hide a subfolder but not the top-level folder, make sure you give the entire path to the subfolder:

 User-agent: Roverdog Disallow: /swf/sourcefiles/

If you want to hide absolutely everything (in this case, from all robots), use:

 User-agent: * Disallow: /

TIP

The asterisk character in robots.txt is not a wildcard. For example, you can't disallow *.gif to bar search engines from all GIF image filesfor that, you have to put all your GIFs in a folder and then disallow that folder. The asterisk only works in the User-agent line, and only then as shorthand for all robots.

The following example keeps google out but permits all other robots:

 User-agent: googlebot Disallow: /

If you want to make everything on your site available to all robots, use:

 User-agent: * Disallow:

TIP

For more information about robots.txt and to look up the names of the various robots out there, see www.robotstxt.org/.

And if you want to permit only one robot (in this case, google's), use:

 User-agent: googlebot Disallow: User-agent: * Disallow: /

Now go back to the example at the beginning of this topic, where you want to try to force new visitors to come in through the front page. Say your site has five top-level directories: products, services, aboutus, images, and apps, along with an HTML file called contact.htm. Your robots.txt file looks like this:

 User-agent: * Disallow: /products/ Disallow: /services/ Disallow: /aboutus/ Disallow: /images/ Disallow: /apps/ Disallow: contact.htm

Put this file in the top-level directory of your remote site, and search engines will only index your home page (index.htm).