Keeping Google Out

 < Day Day Up > 



Your priority might run contrary to this chapter, in that you want to prevent Google from crawling your site and putting it in the Web search index. It does seem pushy, when you think about it, for any search engine to invade your Web space, suck up all your text, and make it available to anyone with a matching keyword. Some people feel that Google’s cache is more than just pushy, and infringes copyright regulations by caching an unauthorized copy of a site.

If you want to keep the Google crawl out of your site, get familiar with the robots.txt file, also known as the Robots Exclusion Protocol. Google’s spider understands and obeys this protocol.

The robots.txt file is a short, simple text file that you place in the top-level directory (root directory) of your domain server. (If you use server space provided by a utility ISP, such as AOL, you probably need administrative help in placing the robots.txt file.) The file contains two instructions:

  • User-agent: This instruction specifies which search engine crawler must follow the robots.txt instructions.

  • Disallow: This line specifies which directories (Web page folders) or specific pages at your site are off-limits to the search engine. You must include a separate Disallow line for each excluded directory.

A sample robots.txt file looks like this:

User-agent: * Disallow: /

This example is the most common and simplest robots.txt file. The asterisk after User-agent means all spiders are excluded. The forward slash after Disallow means that all site directories are off-limits.

The name of Google’s spider is Googlebot. (“Here, Googlebot! Come to Daddy! Sit. Good Googlebot! Who’s a good boy?”) If you want to exclude only Google and no other search engines, use this robots.txt file:

User-agent: Googlebot Disallow: /

You may identify certain directories as impervious to the crawl, either from Google or all spiders:

User-agent: * Disallow: /cgi-bin/ Disallow: /family/ Disallow: /photos/

Notice the forward slash at each end of the directory string in the preceding examples. Google understands that the first slash implies your domain address before it. So, if the first Disallow line were found at the bradhill.com site, the line would be shorthand for http://www.bradhill.com/cgi-bin/, and Google would know to exclude that directory from the crawl. The second forward slash is the indicator that you are excluding an entire directory.

To exclude individual pages, type the page address following the first forward slash, and leave off the ending forward slash, like this:

User-agent: * Disallow: /family/reunion-notes.htm Disallow: /blog/archive00082.htm
Remember 

Each excluded directory and page must be listed on its own Disallow line. Do not group multiple items on one line.

Tip 

You may adjust the robots.txt file as often as you like. It’s a good tool when building out fresh pages that you don’t want indexed while still under construction. When they’re finished, take them out of the robots.txt file.



 < Day Day Up > 



Google for Dummies
Google AdWords For Dummies
ISBN: 0470455772
EAN: 2147483647
Year: 2005
Pages: 188

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net