It may seem counterintuitive to want to restrict search engines from any part of your site. You probably take great pains to get your site listed on as many search engines as possible. But follow the logic here: Maybe you want to control exactly how a visitor finds your site. You would rather new visitors come in by the front page, say, instead of some page three levels deep. Or maybe you don't want visitors to come in on a page that is supposed to be a popup window, where you may not offer the full range of navigation choices. The more you think about it, the more restricting certain areas from search engines makes good sense.
There's a relatively easy and reliable way to communicate your indexing preferences to robots, the software programs that search engines send out to catalog your site. You add a special text file called robots.txt to the top level of your remote site, right inside the remote root folder. The robots.txt file tells visiting search engines to ignore the specific directories or files that you list. Here's the catch: For this to work, the robots have to follow the Robots Exclusion Standard, which is a little-known corollary to Asimov's Three Laws of Robotics. The Robots Exclusion Standard simply states that a robot must obey the instructions in robot.txt. But this standard isn't a law. It's more like good manners. The people who design robots for search engines don't have to program their creations to abide by the standard, and there are indeed renegade robots running amok on the Web, just as in I, Robot. Nevertheless, the robots for all the major search engines operate according to the guidelines.
A simple robots.txt file looks something like this: User-agent: * Disallow: /popups/ Disallow: popup.htm Disallow: /images/ Disallow: /js/ Disallow: /css/
The Disallow lines tell the robot which directories or files are off limits. In the preceding example, the popups, images, js, and css directories are blocked, as is the file called popup.htm. The User-agent line indicates to which robot the Disallow lines apply. Giving the asterisk (*) as the value of User-agent means that the Disallow instructions apply to all robots. You may specify individual robots, too, and give different levels of access to each: User-agent: googlebot Disallow: /popups/ Disallow: popup.htm User-agent: Roverdog Disallow: /popups/ Disallow: popup.htm Disallow: /images/ Disallow: /js/ Disallow: /css/ In this scenario, google's googlebot can't look at the popups directory or the popup.htm file, while Roverdog can't get at the images, js, or css directories in addition to the popups folder and the popup.htm file. The values in the Disallow lines are root-relative paths, by the way. So, if you want to hide a subfolder but not the top-level folder, make sure you give the entire path to the subfolder: User-agent: Roverdog Disallow: /swf/sourcefiles/ If you want to hide absolutely everything (in this case, from all robots), use: User-agent: * Disallow: /
The following example keeps google out but permits all other robots: User-agent: googlebot Disallow: / If you want to make everything on your site available to all robots, use: User-agent: * Disallow:
And if you want to permit only one robot (in this case, google's), use: User-agent: googlebot Disallow: User-agent: * Disallow: / Now go back to the example at the beginning of this topic, where you want to try to force new visitors to come in through the front page. Say your site has five top-level directories: products, services, aboutus, images, and apps, along with an HTML file called contact.htm. Your robots.txt file looks like this: User-agent: * Disallow: /products/ Disallow: /services/ Disallow: /aboutus/ Disallow: /images/ Disallow: /apps/ Disallow: contact.htm Put this file in the top-level directory of your remote site, and search engines will only index your home page (index.htm). |