Dealing with Robots


http://www.robotstxt.org/


Robots, web spiders, and web crawlers are names that define a category of programs that download pages from your website, recursively following your site's links. Web search engines use these programs to scan the Internet for web servers, download their content, and index it. Normal users use them to download an entire website or portion of a website for later offline browsing. Normally these programs are well behaved, but sometimes they can be very aggressive and swamp your website with too many simultaneous connections or become caught in cyclic loops.

Well-behaved spiders will request a special file, called robots.txt, that contains instructions about how to access your website and which parts of the website won't be available to them.

The syntax for the file can be found at http://www.robotstxt.org. You can stop the requests at the router or operating system levels.

But sometimes web spiders don't honor the robots.txt file. In those cases, you can use the Robotcop Apache module mentioned in the previous section, which enables you to stop misbehaving robots.




Apache(c) Phrase Book(c) Essential Code and Commands
Apache Phrasebook
ISBN: 0672328364
EAN: 2147483647
Year: 2006
Pages: 254

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net