How Google Searches the Internet | How the Internet Works (8th Edition)

When you search using Google, you're actually searching through an index of web pages. To gather the raw material for the index, Google's web-crawling robot, called Googlebot, sends a requests to a web server for a web page. It then downloads the page. Googlebot runs on many computers simultaneously, and constantly requests and receives web pages, making thousands of requests per second. In fact, Googlebot makes requests more slowly than its full capability, because if it operated fullthrottle, it would overwhelm many web servers, and the servers would not be able to deliver pages quickly enough to users.

Web masters who don't want their sites to be searchable via Google can instruct Google not to index their sites. To do it, they create a text file called robots.txt containing only these two lines and put it in the root directory:

User-agent: * Disallow /

That tells all search engines, not just Googlebot, to stay away.

They can also tell Googlebot or other search engines to not search their site by putting this HTML tag into the <head> section of the HTML for their web page: