After you have determined that your site is indexed, and you have calculated how many pages you have indexed, you are certain to be greedy for more. The number of pages you can have indexed is limited only by the number of pages on your site. Many sites have millions of pages included, whereas some prominent Web sites have only their home page indexed.
You can take several steps to raise the inclusion ratio of your site, including the following:
Many complain about the inability of spiders to index certain content. Although we are the first to agree that spiders can improve their crawling techniques, there are good reasons why spiders stay away from some of this content. You have a choice as to whether you wring your hands and complain about the spiders or set to work pleasing them so your pages are indexed. You can guess which path will be more successful.
If your site is suffering from a low inclusion ratio, you can take several steps, but eliminating spider traps is the most promising place to start.
Eliminate Spider Traps
As we have said before, spiders cannot index all pages. But we have yet to say what causes problems with the spiders. That's where we're going now.
Spiders are actually rather delicate creatures, and they can be thrown off by a wide variety of problems that we call spider traps. Spider traps are barriers that prevent spiders from crawling a site, usually stemming from technical approaches to displaying Web pages that work fineb for browsers, but do not work for spiders. By eliminating these techniques from your site, you allow spiders to index more of your pages.
Unfortunately, many spider traps are the product of highly advanced technical approaches and highly creative user-experience designswhich were frightfully expensive to develop. No one wants to hear, after all the money was spent, that your site has been shut out of search. Yet that is the bad news that you might need to convey.
Luckily, spiders become more sophisticated every year. Designs that trapped spiders a few years ago are now okay. But you need to keep up with spider advances to employ some cutting-edge techniques.
So here they come! Here is how you eliminate the most popular spider traps.
Carefully Set Robots Directives
Pretend that you are the Webmaster of your site, and you just learned that there is a software probe that has entered your Web site and appears to be examining every page on the site. And it seems to come back over and over again. Sounds like a security problem, doesn't it? Even if you could assure yourself that nothing nefarious is afoot, it is wasting the time of your servers.
Too often, that is how Webmasters view search spiders: a menace that needs to be controlled. And the robots.txt file is the way to control spiders.
It is a remarkably innocuous-looking file, a simple text file that is placed in the root directory of a Web server. Your robots.txt file tells the spider what files it is allowed to look at on that server. No technical reasons prevent spiders from looking at the disallowed files, but there is a gentleman's agreement that spiders will be polite and abide by the instructions.
A robots.txt file contains only two operative statements:
Figure 10-5 shows a robots.txt file with explanations of what each line means.
Figure 10-5. Coding robots.txt files. Robots.txt files direct the spider on how to crawl your Web site, or direct them to avoid your site completely.
Webmasters have a legitimate reason to keep spiders out of certain directories on their serversserver performance. Most Web servers have programs stored in the cgi-bin directory, so it is a good idea to have your robots.txt file say "disallow: /cgi-bin/" to save the server from having to send the spider all those program files the spider does not want to see anyway. The trouble comes when an unsuspecting Webmaster does not understand the implications of disallowing other files, or all files.
Although many Webmasters use the robots.txt file to deliberately exclude spiders, accidental exclusion is all too common. Imagine a case where this file was used on a beta site to hide it from spiders before the site was launched. Unfortunately, the exclusionary robots.txt file might be left in place after launch, causing the entire Web site to disappear from all search indexes.
In addition to the robots.txt that controls spiders across your entire site, there is a way to instruct spiders on every pagethe robots metatag. In the <head> section of the HTML of your page, a series of metatags are typically found in the form <meta name="type"> (where the "type" is the kind of metatag). One such metatag type is the robots tag (<meta name="robots">), which can control whether the page should be indexed and whether links from the page should be followed.
If the robots.txt file disallows a particular page, it does not matter what the robots metatag on that page says because the spider will not look at the page at all. If the page is allowed by the robts.txt instructions, however, the robots metatag is consulted by the spider as it looks at the page.
Figure 10-6 shows the variations available in the robots metatag for restricting indexing (placing the content in the index) and "link following" (using pages linked from this page as the next page to crawl). If the robots metatag is missing, the page is treated as if "index, follow" was specified.
Figure 10-6. Coding robots tags. Robots tags on your Web page direct the spider on whether to index the page, follow links from it, or do neither.
Although you would normally want your pages to be coded without robots metatags (or with robots metatags specified as "index,follow"), there are legitimate reasons to use a robots tag to suppress spiders. Some pages on your site should be viewed only from the beginning of the sequence, such as a visual tour or a presentation. Although there is no problem with allowing searchers to land in the middle of such sequences, some site owners might not want them to, so they could code a robots tag on the first page of the presentation that says "index,nofollow" and specify "noindex,nofollow" on all the other pages.
Another reason to use a "noindex" robots tag is to prevent an error for the visitor. Your commerce facility might require a certain route through pages to work properlyyou cannot land on the site at the shopping cart page, for example. Because there is no reason to have the shopping cart page indexed, you can code "noindex,nofollow" on that page to prevent searchers from falling into your cart.
But most of your pages should be available to be indexed. When many pages from your site are indexed, but a few are not, this tag is frequently the culprit. Unfortunately, it is common for this tag to be defined incorrectly in templates used to create many pages on your site. Or misguided Web developers employ the tags incorrectly. This was the case at Snap Electronics.
If you recall from Chapter 7, Snap's search landing page for the keyword phrase "best digital camera" was not in any of the indexes. Examining the pages showed that a number of Snap pages had restrictive robots tags. The product directory page was using the <meta name="robots" content="index,nofollow"> version of the tag. This caused the spider not to follow any of the directory page's links to the actual product pages. Moreover, even if this problem had not existed, each of the actual product pages had the <meta name="robots" content=" noindex,nofollow"> version of the tag. The Web developers indicated it was done so that the commerce system would not be overloaded by search engine spiders. After educating the developers about search marketing, the tags were removed and the pages were indexed.
Eliminate Pop-Up Windows
Most Web users dislike pop-up windows, those annoying little ads that get in your face when you are trying to do something else. Pop-up ads are so universally reviled that pop-up blockers are in wide use. Many sites still use pop-ups, however, believing that drawing attention to the window is more important than what Web users want.
Many Web sites use pop-up windows for more than ads. So, if user hatred is not enough to cure you of pop-up windows, maybe this is: Spiders cannot see them. If your site uses pop-ups to display related content, that content will not get indexed. Even worse, if your site uses pop-ups to show menus of links to other pages, the spider cannot follow those links, and those pages cannot be reached by the spider.
If your site uses pop-ups to display complementary content, the only way to get that content indexed is to stop using pop-up windows. You must add that content to the pages that it complements, or you must create a standard Web page with a normal link to it. If you are having trouble convincing your extended search team to dump pop-ups, remind them that the rise of pop-up blockers means that many of your visitors are not seeing this content either.
If you are using pop-up windows for navigation menus, you can correct this spider trap in the same way, by adding the links to each page that requires them and removing the pop-up, but you have another choice, too. You can decide to leave your existing pop-up navigation in place, but provide alternative paths to your pages that the spiders can follow. We cover these so-called spider paths later in the chapter.
Don't Rely on Pull-Down Navigation
Figure 10-7. Pull-down navigation prevents crawling. Pull-down windows block spiders from indexing any pages linked from them.
Simplify Dynamic URLs
So-called dynamic pages are those whose HTML code is not stored permanently in files on your Web server. Instead, for a dynamic page, a program creates the HTML "on-the-fly"whenever a visitor requests to view that pageand the browser displays that HTML just as if it had been stored in a file.
In the earliest days of the Web, every Web page was created by someone opening a file and entering his HTML code into the file. The name of the file and the directory it was saved within became its URL. So, if you created a file called sale.html and placed it in a top-level directory called offers on your Web server, your URL would be www.yourdomain.com/offers/sale.html (and that URL remained the same until you changed the file's name or moved it to a new directory). These kinds of pages are now referred to as static Web pages, to distinguish them from the dynamic pages possible today.
It did not take long to bump into the limitations of static pagesthey contained the exact same information every time they were viewed. Soon the first technique for dynamic pages was defined, called the Common Gateway Interface (CGI), which allowed a Web server to run a program to dynamically create the page's HTML and return it to the visitor's Web browser. That way, there never needs to be a file containing the HTMLthe program can generate the HTML the moment the page is requested for viewing.
You have probably noticed that some URLs look "different"they contain special characters that would not occur in the name of a directory or file. Figure 10-8 dissects a dynamic URL and shows what each part of it means.
Figure 10-8. Decoding a dynamic URL. Each part of a dynamic URL has a specific meaning that governs what content appears on the dynamic page.
The parameters in each dynamic URL (the words that start with the ampersand character [&]) is what causes complications for spiders. Because just about any value (the words that follow the equals sign character [=]) can be passed to the variable, search spiders have no way of knowing how many different variations of the same page can be shown. Sometimes different values passed to each parameter indicate a legitimate difference in the pages, such as in Figure 10-8each book has a different number. But other times, the values do not have anything to do with what content is displayed, such as so-called "track codes," in which the Web site is designed to log visitors coming from certain places for measurement purposes. A spider could look at the exact same page thousands of times because the tracking parameter in the URL is different each time. Not only does this waste the spider's time (when it could be looking at truly new pages from other sites), but sometimes it causes these pages to be stored in the index, resulting in massive duplication of content. Clearly spiders must be wary of how they crawl dynamic sites.
In the early days of dynamic pages, spiders had a simple solution for this dynamic site problemthey refused to crawl any page with one of the tell-tale characters (? or & or others) in its URL. But CGI programs were just the first of a long list of techniques allowing programs to generate Web pages dynamically. Over time, more and more Web pages have become dynamic, especially on corporate sites. Highly personalized sites consist of almost 100 percent dynamic pages. Most e-Commerce catalogs consist of dynamic pages.
Because so much important Web content has become dynamic, the search engines have tried to adjust. Search spiders now index dynamic pages under certain circumstances:
If your site relies on passing more than two parameters in the URL, you might benefit from the URL rewrite technique, which causes your dynamic URL to be shown as a static URL in appearance. For example, the page in Figure 10-8 might be rewritten as http://www.powells.com/book/62-1579123791-0 so that it appears to be a static page. This is a completely ethical technique that search spiders appreciate, and it has the benefit of making your URLs more readable for your human visitors. Each server platform and content management system has its own method of rewriting URLs, but let's look at how to do it for the most widely used Web server, Apache.
Apache Web servers contain a module called mod_rewrite that is very powerful. Just so you know, when technical people call a tool "powerful," it means that you can do anything you want with it, if you could only figure out how. Using mod_rewrite to change URLs is not dissimilar from performing woodcarving with a chain saw. It can be done, but you can also hurt yourself along the way. The mod_rewrite module allows an unlimited number of rules to be defined, requiring great attention to detail to ensure proper results. You can learn more about mod_rewrite at Apache's Web site (http://httpd.apache.org/docs-2.0/mod/mod_rewrite.html). Snap Electronics used mod_rewrite to convert its dynamic URLs for its e-Commerce catalog to appear to be static URLs. This allowed many of its product pages to be crawled that were missing from search indexes previously.
As noted above, pages with session identifiers cause problems for spidersyour pages will not be indexed unless you remove them from your URL parameters. You might be wondering, "Why did my Web developers use session identifiers in the first place?" It's not that complicated. As visitors move from page to page on your site, each program that displays a new page wants to "remember" what your visitor did on prior pages. So, for example, the order confirmation page wants to remember that your visitor signed in and provided credit card information on the checkout page. Simple enough, but where does the session identifier come in?
Your developer decided (correctly) that the best way to share information between these separate programs that display different pages was to store the information in a database that each program can read and change. So, the program that displays the checkout page can store the credit card and sign-in data in the databasethen the program that displays the order confirmation page can read that data from the database. But how does the order confirmation page know which record in the database has the information for each person that views the page? That is where the session identifier comes in.
When the visitor reaches the checkout page, the checkout page program creates a session identifiera unique number that no other visitor getsthat it will associate with that visitor for the rest of the session (that visit to the Web site). When it stores information in the database, that program stores it with a "key" of the session identifier and any of the other programs can read that information if they know the key. Which brings us back to the original problemthe developer is passing the key to each program in the URL session identifier parameter.
Your developers can provide this function without using session identifier parameter, however. If your programmers are using a sophisticated Web application environment, a "session layer" usually provides a mechanism for programs to pass information to one anotherthat is the best solution for the session identifier problem. If your Web infrastructure is not so sophisticated, you can use a cookie to hold the session information. If you go the cookie route, be careful not to trap the spider by forcing all visitors to have cookies enabled. We discuss why that is a problem in our next section.
Eliminate Dependencies to Display Pages
Web sites are marvels of technology, but sometimes they are a bit too marvelous. Your Web developers might create such exciting pages that they require visitors' Web browsers to support the latest and greatest technology. Or have their privacy settings set a bit low. Or to reveal information. In short, your Web site might require visitors to take certain actions or to enable certain browser capabilities in order to operate. And although that is merely annoying for your visitors, it can be deadly for search spiders, because they might not be up to the task of viewing your Web site.
If you are as old as we are, you might remember the Pac-Man video game, in which a hungry yellow dot roamed the screen eating other dots, but changed course every time it hit a wall or another impediment. Search spiders are very similar. They will hungrily eat your spider food pages until they hit an impedimentthen they will turn tail and go in a different direction. Let's look at some of the most popular technical dependencies:
Use Redirects Properly
The inventor of the Web, Tim Berners-Lee once observed that "URLs do not change, people change them." The best advice to search marketers is never to change your URLs, but at some point you will probably find it necessary to change the URL for one of your pages. Perhaps your Webmaster might want to host that page on a different server, which requires the URL to change. At other times the content of a page changes so that the old URL does not make sense anymore, such as when you change the brand name of your product and the old name is still in the URL.
Whenever a URL is changed, you will want your Webmaster to put in place something called a redirectan instruction to Web browsers to display a different URL from the one the browser requested. Redirects allow old URLs to be "redirected" to the current URL, so that your visitors do not get a "page not found" message (known as an HTTP 404 error) when they use the old URL.
A visitor might be using an old URL for any number of reasons, but here are the most common ones:
Now that you understand that URLs will often change for your pages, and that redirects are required so that visitors can continue to find those pages, you need to know a little bit about spiders and redirects.
Spiders, as you have learned, are finicky little creatures, and they are quite particular about how your Webmaster performs page redirects. When a page has been permanently moved from one URL to another, the only kind of redirect to use is called a server-side redirectyou might hear it called a "301" redirect, from the HTTP status code returned to the spider. A 301 status code tells the spider that the page has permanently changed to a new URL, which causes the spider to do two vitally important things:
Unfortunately, not all Webmasters use server-side redirects. There are several methods of redirecting pages, two of which are especially damaging to your search marketing efforts:
How your Webmaster implements a 301 redirect depends on what kind of Web server displays the URL. For the most common Web server, Apache, the Webmaster might add a line to the .htaccess file, like so:
Redirect 301 /OldDirectory/OldName.html http://www.YourDomain.com/NewDirectory/NewName.html
You would obviously substitute your real directory and filenames. Understand, however, that some Apache servers are configured to ignore .htaccess files, and other kinds of Web servers have different means of setting up permanent redirects, so what your Webmaster does might vary. The point is that your Webmaster probably knows how to implement server-side redirects, and search spiders know how to follow them.
Server-side redirects are also used for temporary URL changes using an HTTP 302 status code. A 302 temporary redirect can be followed by the spider just as easily as a 301. Webmasters have various reasons for implementing 302s, but one that is important to search marketers, so-called vanity URLs. Sometimes it is nice to have a URL that is easy to remember, such as www.yourdomain.com/product that shows the home page for one of your products. You tell everyone linking to your product page to use that vanity URL. But behind the scenes, your Webmaster can move that page to a different server whenever needed for load balancing and other reasons. By using a 302 redirect, the spider uses your vanity URL in the search index but indexes the content on the page it redirects to.
Before implementing any 301 or 302 redirect, your Webmaster should take care not to add "hops" to the URLin other words, not adding a redirect on top of a previous redirect. For example, if the vanity URL has been temporarily directed (302) to the current URL and now needs to be directed to a new URL, the existing 302 redirect should generally be changed to the new URL. If, instead, the Webmaster implements a permanent (301) redirect from the current URL to the new URL, you now have two "hops" from your vanity URL to the real page. Not only does this slow performance for your visitors, but spiders are known to abandon pages with too many hops (possibly as few as four). Use a free tool at www.searchengineworld.com/cgi-bin/servercheck.cgi to check how your URLs redirect.
Make sure that your Webmaster is intimately familiar with search-safe methods of redirection, and confirm that the proper procedures are explained in your site standards so that all redirects are performed with care. Make sure that redirects are regularly reviewed and purged when no longer needed so that the path to your page is as direct as possible.
Ensure Your Web Servers Respond
If it sounds basic, well, it is; however, it is a problem on all too many Web sites. When the spider comes to call, your Web server must be up. If your server is down, the spider receives no response from your Web site. At best, the spider moves along to a new server and leaves your pages in its search index (without seeing any page changes you have made, of course). At worst, the spider might conclude (after a few such incidents over several crawls) that your site no longer exists, and then deletes all of the missing pages from the search index.
Don't let this happen to you. Your Webmaster obviously wants to keep your Web site available to serve your visitors anyway, but sometimes hardware problems and other crises cause long and frequent outages for a period of time, possibly causing your pages to be deleted from one or more search indexes.
A less-severe but related problem is slow page loading. Although your site is technically up, the pages might be displayed so slowly that the spider soon abandons the site. Few spiders will wait 10 seconds for a page. Spiders are in a hurry, so if good performance for your visitors is not enough of a motivation, speed up your site for the spider's sake.
Reduce Ignored Content
After you have eliminated your spider traps and the spiders can crawl your pages, the next issue you might encounter is that they ignore some of your content. Spiders have refined tastes, and if your content is not the kind of food they like, they will move on to the next page or the next site. Let's see what you should do to make your spider food as tasty as possible.
Slim Down Your Pages
Like most of us, spiders do not want to do any unnecessary work. If your HTML pages routinely consist of thousands and thousands of lines, spiders are less likely to index them all, or will index them less frequently. For the same time they spend crawling your bloated site, they could crawl two others.
In fact, every spider will stop crawling a page when it gets to a certain size. The Google and Yahoo! spiders seem to stop at about 100,000 characters, but every spider has a limit programmed into it. If you have very large pages, they might not be getting crawled or not crawled completely.
Once in a while, someone decides to put all 264 pages of the SnapShot DLR200 User's Guide on one Web page. Obviously, the 264-page manual belongs on dozens of separate Web pages with navigation from the table of contents. Breaking up a large page also helps improve keyword density by making the primary keywords stand our more in the sea of words. Not only is this better for search engines, your visitors will be happier, too.
Validate Your HTML
When you surf your Web site with your browser, you rarely see an error message. The Web pages load properly and they look okay to you. It is understandable for you to think that the HTML that presents each page on your site has no errors. But you would be wrong.
Here is why. Web browsers, especially Internet Explorer, are designed to make visitors' lives easier by overlooking HTML problems on your pages. Browsers are very tolerant of flaws in the HTML code, striving to always present the page as best as possible, even though there might be many coding errors. Unfortunately, spiders are not so tolerant. Spiders are sticklers for correct HTML code.
And most Web sites are rife with coding errors. Web developers are under pressure to make changes quickly, and the moment it looks correct in the browser, they declare victory and move on to the next task. Very few developers take the time to test that the code is valid.
You must get your developers to validate their HTML code. They must understand that coding errors provide the wrong information to the search spider. Consider something as seemingly minor as misspelling the <title> tag as <tilte> in your HTML. Browsers will not display your title in the title bar at the top of the window, but because the rest of the page looks fine, your developers and your visitors probably will not notice the error. The title tag, however, is an extremely important tag for the search enginea missing title makes it much harder (sometimes impossible) for that page to be found by searchers. Validating the code catches this kind of error before it hurts your search marketing.
Sometimes the errors are more subtle than a broken <title> tag. Comments in your HTML code might not be ended properly, causing the spider to ignore real page text that you meant to be indexed because it takes that text as part of the malformed comment. In addition, browsers will sometimes correctly display pages with slight markup errors, such as missing tags to end tables, but sometimes search spiders might lose some of your text. So, the page might look okay, but not all of your words got indexed, so searchers cannot find your page when they use those words. Occasionally, HTML linksespecially those using relative addresses where the full URL of the link is not spelled outwork fine in a browser but trip up the spider.
It is easy for your developers to validate their code. Just send them to http://validator.w3.org/ and they can enter the URL of any page they want to test. There are several flavors of valid HTML, from the strictest compliance with the standards to looser compliance that uses some older tags. As long as your page states what flavor it adheres to in the <doctype> tag, it will be validated correctly, and search spiders can read any flavor of valid HTML code. Make sure that your everyday development process requires that each page's HTML be validated before promotion to your production Web site.
Reserve Flash for Content You Do Not Want Indexed
Macromedia is a very successful company that has brought a far richer user experience to the Web than drab old HTML, allowing animation and other interactive features that spice up visual tours and demonstrations. This technology, called Flash, is supported on 98 percent of all browsers and can make your Web site far more appealing. (There are other graphical user environments similar to Flash, but Flash content is the vast majority, so we will just refer to everything as Flash, which is not far off.)
But (and you knew there was a but coming) spiders cannot index Flash content. Because Flash content is a lot closer to a video than a document, it is not clear how to index that content even if the spider could read it. Clearly, there is a lot less printed information in Flash content than on the average HTML page. So does that mean that you should not use Flash on your site? No. But it does mean you should use it wisely.
Reserve your use of Flash for content that you are happy not to be indexedthat 3D interactive view of your product or the walking tour of your museum's latest exhibit. You can also use Flash for application development, such as your online ordering systemsomething that you would not want indexed anyway. Do not use Flash to jazz up your annual report, unless you accept that no one can search for any words in the report to find it. And do not make your home page a Flash experience, unless you are exceedingly careful to ensure that spiders have another way into your site besides walking through the Flash doorgive the spiders a plain old HTML link to boring old HTML pages. (Remember, you cannot pop up a question asking whether visitors want Flash or non-Flash because spiders cannot answer that question either.)
When you do use Flash, make sure you always have an HTML landing page to kick off any Flash experience. That way you can have a short page that describes the great walking tour of your museum and allows visitors to click the Flash content. By using this technique, you will give the search engines a page to index that might be found by searchers looking for your walking tourthey will find the dowdy HTML page that leads to the exciting Flash tour.
If you have a Web site built entirely in Flash content and you absolutely cannot change it to HTML, you can legitimately use the IP delivery technique discussed earlier to get your content into the search index. Here's how. Your Webmaster must implement an IP detection program that runs whenever a page requiring Flash is to be displayed. That program uses the user agent name and IP address to recognize the difference between when a spider is calling and when a Web browser is calling. The Flash content is served up as usual for Web browsers (for your visitors), but spiders get a different mealthey are served an HTML page that has the same text on it as the Flash content. This use of IP delivery is entirely legitimate because you are serving the same text content to visitors and spiders. Be extremely careful, however, never to serve different text to visitors and spiders, because that would (rightly) be considered spamming. Ensure that your publishing process forces your Flash and your HTML content to be synchronized after every update so that you do not inadvertently violate spam guidelines.
So remember, use Flash for things that are truly interactive and visualnot documents. Or if you must use Flash for documents, make sure there is an HTML version of the document as well for spiders.
If your site's design has not been updated in a while, you might still have pages that use frames. Frames are an old technique of HTML coding that can display multiple sources of content in separate scrollable windows in the same HTML page. Frames have many usability problems for visitors, and have been replaced with better ways of integrating content on the same pageusing content management systems and dynamic pages. But some sites still have pages coded with frames.
If you are among the unlucky to have frame-based pages on your site, the best thing to do is to replace them. Your visitors will have a better experience, and you will improve search marketing, too, because spiders have a devil of a time interpreting frame-based pages. Typically spiders ignore everything in the "frameset" and look for an HTML tag called <noframes> that was designed for (ancient) browsers that do not support frames.
There are techniques that people use to try to load the pertinent content for search into the <noframes> tag, but it is a lot of work to create and maintain. Our advice is to ditch frames completely. Creating a new frame-free page will end up being a lot less work in the long run and will improve the usability of your site, too.
Create Spider Paths
Now that you have learned all about removing spider traps, let's look at the opposite approach, too. Sometimes it is very difficult, costly, or expensive to remove a spider trap. In those cases, your only option is to provide an alternative way for the spider to traverse your site, so it can go around your trap. That's where spider paths come in.
Spider paths are just easy-to-follow routes through your site, such as site maps, category maps, country maps, or even text links at the bottom of the key pages. Quite simply, spider paths are any means that allow the spider to get to all the pages on your site. The ultimate spider path is a well thought-out and easy to navigate Web siteif your Web site has no spider traps, you might already have a wonderful set of spider paths. With today's ever-more-complex sites full of Flash, dynamic pages, and other spider-blocking technology, however, you need to make accommodations for spiders trapped by your regular navigation.
Site maps are very important, especially for larger sites. Human visitors like them because they enable them to see the breadth of information available to them, and spiders love them for the same reason.
Not only do site maps make it easier for spiders to get access to your site's pages, they also serve as very powerful clues to the search engine as to the thematic content of the site. The words you use as the anchor text for links from your site map can sometimes carry a lot of weight. Site maps often use the generic name for a product, whereas the product page itself uses the brand namesearchers for the generic name might be brought to your product page because the site map linked to it using that generic name. Work closely with your information architects to develop your site map, and you will reap large search dividends.
For a small site, your site map can have direct links to every page on your site. You can categorize each page under a certain subject, similar to the way Yahoo! categorizes Web sites in its directory, so that your site map lists a dozen or so topics with links to a few pages under each one. Your site map does not need to follow your folder structuresometimes the site map can offer an alternative way of navigating the site that helps some visitors. This simple approach probably works until you have about 100 pages.
When your site reaches several hundred pages, you cannot fit that many links on one site map page. You should modify your site map to link to category hub pages (maybe corresponding to the same topics that you used for your original site map). Because you might have just 10 to 15 links on your page (1 for each category), you might want to add a descriptive paragraph for each category to augment the link. From each category hub page your visitor can link deeper into the site to see all other pages. This approach can work even for sites with 10,000 pages or more.
Very large Web sites (100,000 pages or more) frequently have multiple top-level hub pages that, taken together, form an overall site map, because they cannot fit all of their topics on one site map page. IBM's Web site (www.ibm.com) uses this approach, with its top three hub pages for "Products," "Services & Solutions," and "Support & Downloads," as you can see in Figure 10-9. These three pages are shown in a navigation bar at the top of every page on the site, including the home page, making it very easy for spiders. Each page lists a number of categories relevant to the pagethe "Products" page lists all of IBM's product categories, with a similar list on the "Services & Solutions" page, and the "Support & Downloads" page provides links to the support centers for IBM products. Taken together, these pages form an extensive site map that spiders feast on, returning at least weekly to see whether any important links have been added to IBM's site. In addition, search engines consider these pages to be highly authoritative, with Google assigning them a PageRank of 9 or 10 at times.
Figure 10-9. A system of site map pages for a large Web site. IBM has three hub pages that are visited regularly by the search engines.
Your site map might not do as well as that of a popular site such as ibm.com, but it shows the importance of a site map page as the key page on your site for spiders. If you have new content that you want the spiders to find quickly, add a link to it from your site map page. Remember, too, that because some search engine spiders limit the number of links that they index on a page, links within the site map should be ordered by level of importance. You should also try to include text on your site map page, rather than just a list of bare links. Adding text to this page provides the spiders with more valuable content to index and more clues as to what your content is about.
As you have seen, different Web sites have different versions of site maps that might list product categories, services, or anything else that appears on your Web site. You can always categorize your pages in an organized manner and display them as a kind of site map. A particular kind of spider path that is a bit different from a site map is a country map.
Many medium-to-large organizations have operations in multiple countries, often requiring them to show similar products and other content in each country using different languages, monetary currencies, and legal terms. These organizations frequently organize their corporate Web domain as a series of country sites. Depending on how the country sites are linked to the main domain, spiders might easily find them all or might be completely stopped.
Using the simplest techniques will work every time. The pet product manufacturer Iams chose the most basic approacha static HTML page listing every country choice, as shown in Figure 10-10. This basic country map is linked from Iams' worldwide home page, but also from each country home page. This technique ensures that search spiders can access the Iams country sites on a regular basis.
Figure 10-10. A simple HTML country map. Iams' worldwide site map includes search-friendly text links to each of its country Web sites.
Figure 10-11 shows a similar technique used by Castrol. While the country map looks different from that of Iams, it is just as easy to follow for a spider, and equally effective for search marketing.
Figure 10-11. Effective country map techniques. Castrol's worldwide home page provides multiple paths for spiders to reach country sites.
Regardless of what kind of Web site you have, spider paths are an invaluable way to get more pages from your site indexed. Whether you use country maps, site maps, or a related technique, you will provide the spiders with easy access to every page on your site, leaving an escape hatch to avoid those pesky spider traps. Next up, we look at one last way of getting more of your pages in the organic search indexpaying your way in.
Use Paid Inclusion
As discussed in Chapter 3, paid inclusion is a technique you can use to get your pages added to the search index, and to get the index updated rapidly every time the content on your pages change. Not many years ago, almost every major search engine (except Google) had paid inclusion programs, but today only Yahoo! has one (among worldwide search engines). MSN Search withdrew its paid inclusion program in 2004, although some observers believe that MSN will eventually reinstate its program. Despite that trend, most experts believe that paid inclusion will grow. JupiterResearch, for example, projects the current $110 million market will surpass $500 million by 2008.
There are two related types of paid inclusion programs:
Yahoo! offers both kinds of paid inclusion programs, called Site Match and Site Match Xchange. Site Match is a single URL submission program, designed for submitting fewer than 1,000 URLs. Site Match Xchange handles more, allowing you to provide either a trusted feed or a single URL from which the spider can crawl all of your pages. (Remember, if you opt to provide a single URL, such as a site map, to be crawled by the spider, you must be sure that your site is free of the spider traps listed earlier in the chapter, whereas trusted feeds avoid these spider problems.)
Both single URL submission and trusted feed programs have similar cost structures, although the actual prices might differ. You should expect the following costs for both kinds of programs:
Taking the Yahoo! program as an example, we see that Site Match (the single URL submission program) charges an annual fee based on the number of URLs submitted, as shown in Table 10-3. Site Match subscribers also pay a fixed cost per click for each searcher choosing their page (with no cost per action). Most content categories are charged at 15¢ per click, although selected categories are priced at 30¢ each.
Turning to trusted feed programs, we see that Site Match Xchange is open to search marketers submitting more than 1,000 URLs or spending more than $5,000 per month. The per-click fee is the same as for Site Match, and there is no annual fee.
Paid Inclusion Can Make Your Life Easier
Paid inclusion can improve your organic search marketing in several ways if you have the budget to pay for it, including the following:
How to Get Started with Paid Inclusion
Signing up with Yahoo! for Site Match (the single URL submission program) is very simple, but implementing trusted feeds for Yahoo! and for shopping search engines takes quite a bit more work.
Site Match submission requires just one stepfilling out the submission form shown in Figure 10-12. All you need to do is enter the URLs for each page that you want included, along with the subject category of your sitethe category chosen determines whether you are charged 15¢ or 30¢ per click.
Figure 10-12. Submitting to Site Match. The Overture (now Yahoo!) low-volume paid submission program walks you through its simple steps.
Reproduced with permission of Yahoo! Inc. © 2005 by Yahoo! Inc. YAHOO! and the YAHOO! logo are trademarks of Yahoo! Inc.
Trusted feeds take a bit more work, as you might expect. Your programmers must create a file containing the content you want included and send that to the search engine. Different search engines accept different formats, ranging from CSV (comma-separated variable) files to Microsoft Excel spreadsheets to custom XML. (As mentioned earlier, some search engines will also crawl your site, but then you need to do all the work of removing spider traps and creating spider paths as described earlier in this chapter.) Just about every search engine accepts XML format, which is the cheapest to maintain in the long run. (XML is a markup language similar to HTML that allows tags to be defined to describe any kind of data you have, making it very popular as a format for data feeds.)
What data you must put in your feed depends on the search engine you are sending it to, because each engine has different data requirements. For example, Yahoo! requires the title, description, URL, and other text from the page. Shopping search engines typically expect the price, availability, and features of your products, in addition to the product's name and description. Most data feeds include some or all of these items:
Inktomi, now owned by Yahoo!, pioneered the concept of feeding large amounts of data from commerce Web sites directly into its search index. Inktomi defined a custom XML format for supplying documents named IDIF (Inktomi Document Interchange Format), which is still used by Yahoo! today and is depicted in Figure 10-13.
Figure 10-13. Sample trusted feed. To use the Yahoo! trusted feed program, you must regularly send them an XML file containing your data.
Making the Most of Paid Inclusion
It's not all that complicated to get started with paid inclusion, especially if you start with a single URL submission program, but most medium-to-large sites probably need to use trusted feed programs. You also need to use trusted feeds to send your data to shopping search engines, because they cannot be fed any other way. And anyone feeding shopping search engines needs trusted feed programs. They are a bit more complex to set up, as you have seen, so you want to make sure you get the most out of them and that you avoid any pitfalls along the way. Here are some tips to make your paid inclusion program a success:
Paid inclusion, especially trusted feeds, can require some work upfront, but they can pay off handsomely when executed properly. If your site would benefit from sales from shopping search engines, or you need to boost your pages indexed in Yahoo!, paid inclusion could be the extra organic lottery ticket it makes sense to buy.