SEARCH ENGINES | The Complete E-Commerce Book, Second Edition: Design, Build & Maintain a Successful Web-based Business

The vast majority of search engines use spiders. These ingenious software programs have only one task — to crawl the Web 24 hours a day, finding and indexing web pages. These spiders (also called “bots” or “crawlers”) visit a web page, read it, and then follow links to other pages within the site. The spider revisits websites on a regular basis (e.g. every month or two) to look for changes.

Everything a spider finds is put in the search engine’s index. The index, sometimes called the catalog, is a giant database that contains a copy of every web page that the spiders find. When a web page changes, the index is updated with the new information, but it doesn’t happen immediately — it can take a while (as much as six weeks) for new pages or changes to be added to an index. So although a web page may have been “spidered” it may not have been “indexed,” and until such indexing occurs, that new page is not available to those searching with that search engine.

Search engines do not index the entire Web (although it may seem like they do). Most also don’t include dynamically created web pages like library web catalogs or other data behind CGI-walls. And none index the entirety of every website, nor do they share a common search language (i.e. the algorithm used to determine what is searched and ranking of the searched item varies depending on the search engine). However, there are three important elements that are common to all search engines:

The database operates on the same principles as your website’s database. The database consists of indexed descriptions of web pages including a link list with a small description for each link. When a search request is received from a surfer, these databases utilize special search algorithms, using keywords, to find needed web pages.
Search engines give each page they find a ranking as to the quality of the match to the surfer’s search query. Relevant scores reflect the number of times a search term appears, if it appears in the title, if it appears at the beginning of the page or HTML tags, and if all the search terms are near each other. Some engines allow the user to control the relevance score by giving a different weight to each search word. A search term used too many times within a page can be considered web spamming (for which search engines penalize) so don’t overdo the use of a keyword or phrase on a page (don’t exceed the 15-25 count range).
Each search engine has its own peculiar ranking method. For example, if there are no links to other sites or pages within a website (a single page website) some search engines will not list that website.

SPIDERS

A spider is an automated software program designed by search engines to follow hyperlinks throughout a website, retrieving and indexing pages in order to document the site for searching purposes. But what should concern a website designer is a spider’s nuances — a spider determines relevancy, i.e. if someone searches for “beeswax candles,” the search results will be only those web pages that contain the words beeswax candles. That is simple enough, but suppose there are more than one website with the term “beeswax candles”? Search engine results are presented in descending order of relevancy to the search term that was used. Relevancy determines which results will be presented first, and which second, and on and on. The spider’s job is to work out which page is most relevant to the term “beeswax candles” and which is the least relevant.

Spiders calculate relevancy based on four factors: repetition, prominence, emphasis, and link popularity. Let’s examine each of these more closely.

Repetition. This is simply the number of times a word is repeated on the page. The more often it is repeated the greater is its relevancy to the page. But resist the temptation to simply repeat the “keyword” over and over again because spiders are programmed to de-list a web page if there are too many repetitions.

Prominence. This is where keywords appear within the website. Originally all a spider looked for was the “keyword” meta tag, but not any longer. Now they look in keyword meta tags, description meta tags, alternative text tags (on images), page titles, body text, and link text.

Emphasis. The number of times a search term appears, and whether it appears in the title and at the beginning of the page or HTML tags; and if all the search terms are near each other. Note: some engines allow the user to control the relevance score by giving a different weight to each search word, others won’t.

Link popularity. This is the number of third party sites that are linked to a website. Each link is regarded as a “vote” for the site. But to complicate matters, the “votes” carry greater weight if the linked website is one that the spider recognizes as having a similar theme as the web page it is crawling.

While all crawler-based search engines operate basically the same way, there are differences, which is one of the reasons the same search on different search engines produce different results. For instance, when spiders submit their results for indexing, either the data is placed directly into the search engine’s index, or it is vetted by humans prior to indexing. Once your pages are added to the relevant indexes, your potential customers can search using various keywords and phrases to find pages that best match their search criteria. Search engines also use software programs to sift through the millions of pages recorded in the index to find matches when a search term is entered. These software programs rank the web pages contained in the index — how the programs perform the ranking task is kept a closely guarded secret.

Appearing within the first 20 returns of any relevant search is critical to driving customers to your website. But since each search engine uses its own special “magic” to determine the rank or position of importance of each individual website, the exact rules that the search engine uses to rank pages for relevance are generally tough to ascertain (and they change often).