How Search Engines Work | HTML & XHTML: The Complete Reference (Osborne Complete Reference Series)

< Day Day Up >

So how do search engines work? First, a large number of pages are gathered from the Web using a process often called spidering . Next, the collected pages are indexed to determine what they are about. Finally, a search page is built so that users can enter queries and see what pages are related to their queries. The best analogy for the process is that the search engine collects as big a haystack as possible, then tries to organize the haystack somehow, and finally lets the user try to find the proverbial needle in the resulting haystack of information by entering a query on a search page. Figure 17-2 shows a basic overview of how search engines work.

Figure 17-2: Overview of search engines

Adding to the Engines

Getting a site's pages gathered by a search engine is the first step in making a site findable on the Web. The easiest way to do this is simply to tell search engines that your site exists. Most search engines will allow you to add a URL to be indexed. For example, Google allows you to add a site for gathering by using a simple form (www.google.com/addurl.html). Of course, adding your site to every single search engine could be a tedious task, so many vendors (www.submit-it.com) are eager to provide developers with a way to bulk submit to numerous search engines. Most Web site promotion software, such as WebPosition Gold (www.webposition.com), also includes automated submission utilities.

You may wonder how many search engines you should submit your site to. Some people favor adding only a few links to the important top ten engines, especially Google, MSN, AltaVista and Yahoo! Numerous studies, as well as this author's experience, suggest that big search sites, particularly Google and Yahoo!, account for most search engine referring traffic. However, some site promotion experts feel this is not correct, and believe it is best to create as many links to sites as possible. In fact, a whole class of link sites called "Free For All" links or FFA sites (not to be confused with anything related to the Future Farmers of America) have sprung up to service people who believe that "all links should lead to me" works. The reality is that most of these link services are pretty much worthless and often generate worthless traffic and spam messages. Further, consider that even if you do get back links and e-mail, it is mostly from people who are doing the same thing you're doing-trying to get links.

Robot Exclusion

Before getting too involved putting yourself in every search engine, consider that it isn't always a good idea to have a robot index your entire site, regardless of whether it is your own internal search engine or a public search engine. First consider that some pages, such as programs in your cgi-bin directory, don't need to be indexed. Second, many pages can be transitory , and having them indexed might result in users seeing 404 errors if they enter from a search engine. Last, you might just not want people to enter on every single page- particularly those deep within a site. So-called "deep linking" can be confusing for users entering from public search engines. Consider that because these users start out deep in a site, they are not exposed to the home or entry page information that often is used to orient site visitors .

Probably the most troublesome aspect of search engines and automated site gathering tools such as offline browsers is that they can be used to stage a denial of service attack on a site. The basic idea of most spiders is to read pages and follow pages as fast as they can. Consider if you tell a spider to crawl a single site as fast as it possibly can. All the requests to the crawled server could very quickly overwhelm it, causing the site to be unable to fulfill requests-thus denying services to legitimate site visitors. Fortunately, most people are not malicious in spidering, but understand that it does happen inadvertently when a spider keeps re-indexing the same dynamically generated page.

Robots.txt

To deal with limiting robot access, the Robot Exclusion protocol was adopted. The basic idea is to use a special file called robots.txt that should be found in the root directory of a Web site. For example, if a spider was indexing http://www.htmlref.com, it would first look for a file at http://www.htmlref.com/robots.txt. If it finds a file, it would analyze the file first before proceeding to index the site.

Note

If you have a site such as http://www.bigfakehostingvendor.com/~customer, you will find that many spiders will ignore a robots.txt file with a URL of http://www.bigfakehostingvendor .com/~customer/robots.txt. Unfortunately, you will have to ask the vendor to place an entry for you in their robots.txt file.

The basic format of the robots.txt file is a listing of the particular spider or user agent you are looking to limit and statements including which directory paths to disallow. You can also specify rules to apply for all user agents using the wildcard *. Consider the following:

 User-agent: * Disallow: /cgi-bin/ Disallow: /temp/ Disallow: /archive/

In this case, you have denied access for all robots to the cgi-bin directory, the temp directory, and an archive directory-possibly where you would move files that are very old but still might need to be online. You should be very careful with what you put in your robots.txt. Consider the following file:

 User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /subscribers-only/ Disallow: /resellers.html

In this file, a special subscribers-only and resellers file has been disallowed for indexing. However, you have just let people know this is sensitive. For example, if you have content that is hidden unless someone pays to receive a URL via e-mail, you certainly will not want to list it in the robots.txt file. Just letting people know the file or directory exists is a problem. Consider that malicious visitors actually will look carefully at a robots.txt file to see just what it is you don't want people to see. That's very easy to do; just type in the URL like so: http:// www. companytolookat .com/robots.txt.

Be aware that the robot exclusion standard assumes that spidering programs will abide by it. A malicious spider will, of course, simply ignore this file, and you might be forced to set up your server to block particular IP addresses or user agents in case someone has decided to attack your site.

Robot Control with <meta>

An alternative method to the robots.txt file that is useful, particularly for those users who have no access to the root directory of their domain, is to use a <meta> tag to control indexing. To disallow indexing of a particular page, use a <meta> tag such as

  <meta name="robots" content="noindex" />

in the <head> of the document. You also can instruct a spider to not follow any links coming out of the page:

  <meta name="robots" content="noindex, nofollow" />

When using this type of exclusion, just make sure not to confuse the robot with contradictory information such as

  <meta name="robots" content="index, noindex" />

  <meta name="robots" content="index, nofollow, follow" />

as the spider may either ignore the information entirely or maybe even index anyway. The other downside to the <meta> tag approach is that fewer of the public search engines support it than robots.txt.

< Day Day Up >