Finding Web Pages for the Organic Index | Search Engine Marketing, Inc.: Driving Search Traffic to Your Companys Web Site (2nd Edition)

Sounds easy, right? Searchers enter queries, and then the search engine looks up the search terms in its organic index, it ranks the best matches first, and then displays the results. But how did all those pages get into the index in the first place? That is what Figure 2-5 shows, and the rest of this chapter explains. This information is critical to you, the search marketer, because if your pages are not in the index, no searcher can ever find them.

Figure 2-5. How organic search engines index pages. Every search engine finds Web pages, analyzes their content, and builds a search index.

To build up the inventory of pages in the search index, search engines use a special kind of program known as a spider (sometimes called a crawler). Spiders start by examining Web pages in a seed list, because the spider needs to start somewhere. But after the spider gets started, it discovers sites on its own by following links.

Following Links

A spider uses the same links you click in your Web browser. When the spider examines the page, it sees the Hypertext Markup Language (HTML) code that indicates a link to another page (see Figure 2-6)the same HTML code that your browser formats to show you the page.

Figure 2-6. How spiders follow links. Every spider sees the same HTML code that your browser sees, and can follow links to other pages.

The spider scoops up the HTML for each page, noting links to other pages so it can come back to collect the HTML of those pages later. You can imagine that, given enough time, a spider can eventually find every page on the Web (or at least every page that is linked to another page). This process of getting a page, finding all the links on that page, and then getting those pages in turn, is called crawling the Web. Later in this chapter, we explain what the spider does with the HTML it collects from all of those pages that it crawls.

HOW MUCH OF THE WEB IS INDEXED BY SEARCH ENGINES?

It sounds easy. Spiders visit the pages and send them to the search index. Those little spiders keep crawling until they index the entire Web, right? Wrong. The truth is that the great majority of Web pages are not indexed in search engines.

Over the years, there have been many estimates of the gap between the number of indexed pages and all Web pages. In 1999, Lawrence and Giles found that the (now defunct) Northern Light search engine indexed just 16 percent of the estimated 800 million publicly available Web pages (Searching the World Wide Web by Steve Lawrence and C. Lee Giles, 1999). But the next year, Michael Dahn claimed the problem might be twice as bad as reported, because the "publicly available" Web may underestimate the total Web by half (Counting Angels on a Pinhead: Critically Interpreting Web Size Estimates by Michael Dahn, 2000).

Not to be outdone, in 2001 two studies estimated the total number of Web pages to be far larger than previously reported. Sherman and Price proclaimed the "Invisible Web is between 2 and 50 times larger than the visible Web" (The Invisible Web by Chris Sherman and Gary Price, p. 82, 2001). In Deep Content: Surfacing Hidden Value (BrightPlanet, 2001), Michael Bergman posited the Web contains 550 billion pages and search engines see only 0.03 percent of them.

Regardless of the wildly divergent numbers, the point is that an enormous number of pages are not indexed, and your Web site probably contains some of them. Each page on your site that is not indexed is completely invisible to searchers, which reduces traffic to your site, so your goal is to get as many indexed as possible.

Your organization's Web site is undoubtedly known to the search engine spiders, and you certainly have some pages listed in their search indexes. But you might not have as many of your pages listed as you think, and any page that is not in the index can never be found by the search engine. So, it is important to have as many pages in the index as possible. Chapter 10, "Get Your Site Indexed," shows you how to find out how many pages are indexed from your organization's site and some simple ways to get more of them indexed.

Remembering Links

Following links is important because it is the best way for a spider to comprehensively crawl the Web. But it is important for another reason, too. Spiders must carefully catalog every link they findchecking which pages link to your page and checking the words displayed that describe the link (the anchor text). Earlier in this chapter, we discussed how search engines rank search results; they do so with this information. Figure 2-7 shows how spiders collect the link information that is so important to ranking the results.

Figure 2-7. How spiders collect link information. Spiders pay attention to which pages link to every other page and what words they use on each link.

Keeping Up with Changes

As you can imagine, Web crawling is not the most efficient way to keep up with changes to those billions of Web pages. New pages can be added, old pages removed, and existing pages changed at any timethe spider will not immediately know that anything has changed. It can be days or weeks before the spider returns to see what happened. That is why a searcher sometimes gets a "page not found" message when clicking a search result. The spider found that page during its last crawl, but it has since been removed or given a new address.

This can be an especially vexing situation for some business Web sites. Your site might have fast-changing content, such as product catalogs that list what you have available each day. If you have new products introduced frequently, or a volatile supply environment, your pages on your site might not be a close match to the pages the spider has put in the search index. Chapter 3, "How Search Marketing Works," covers a service some search engines offer called paid inclusion that can help address this problem.

Even without paid inclusion, however, the best spiders try to compensate to keep their indexes "fresh" by varying their rates of revisiting sites. Spiders return more frequently to sites that change more quickly. If a spider comes to two pages on the same day and then returns to both exactly a month later, if one of them has changed and one has not, the spider can decide to revisit the changed page in two weeks, but wait six weeks to return to the unchanged page. Over time, this technique can greatly vary the return rate for the spider, raising the freshness of the index by revisiting volatile pages most frequently.

Spiders also revisit more often to sites that have the highest-quality pages. Google, for example, tends to revisit pages with higher PageRank more frequently (perhaps once per week) than other pages. The Yahoo! spider, in general, does not return to sites as frequently as Google, but also pays more attention to well-linked pages.

Feeding the Index Without Crawling

By far, the most pages in organic search engines are gathered by the search engine's spider, but it is not the only way to get your data into the search engine.

Some search engines allow your site to send its data instead of waiting for the spider to crawl your site. Yahoo! Search, some shopping search engines, and some others allow your site to provide a trusted feed; that is, your site sends pages to the search engine, which are processed and stored in the index as soon as they are received.

Some engines charge for trusted feeds (Yahoo!), some (especially shopping engines) require them but do not charge, and others (Google) do not accept them at all. (Google's shopping engine, Froogle, does accept them.) In Chapter 10, we examine the use of trusted feeds as part of your search marketing program.