Section 8.1. Working with the Bot | The Truth About Search Engine Optimization

8.1. Working with the Bot

To state the obvious, before your site can be indexed by a search engine, it has to be found by the search engine. Search engines find web sites and web pages using software that follows links to crawl the web. This kind of software is variously called a crawler, a spider, a search bot, or simply a bot (bot is a diminutive for "robot").

To be found quickly by search engine bot, it helps to have inbound links to your site. More important, the links within your site should work properly. If a bot encounters a broken link, it cannot reachor indexthe page pointed to by the broken link.

Note: As I explained earlier, you don't have to wait for the bot to find your site on its own if you list your sitemanually or using a submission toolwith the important search engines. However, it's still important, even if you list your site, that it be "bot friendly."

8.1.1. Images

Pictures don't mean anything to a search bot. The only information a bot can gather about pictures come from the alt attribute used within a picture's <img> tag, and from text surrounding the picture. Therefore, always take care to provide description information via the <img> tag's alt attribute along with your images, and to provide at least one text-only (e.g., outside of an image map) link to all pages on your site.

8.1.2. Links

Certain kinds of links to pages (and sites) simply cannot be traversed by a search engine bot. The most significant issue is that a bot cannot login to your site. (This is probably a very good thing, or we'd all be in big trouble!)

So if a site or page requires a user name and a password for access, then it probably will not be included in a search index.

Don't be fooled by seamless page navigation using such techniques as cookies or session identifiers. If an initial login was required, then these pages can probably not be accessed by a bot.

Complex URLs that involve a script can also confuse the bot (although only the most complex dynamic URLs are absolutely non-navigable). You can generally recognize this kind of URL because a ? is included following the script name. Pages reached with this kind of URL are dynamic, meaning that the content of the page varies depending upon the values of the parameters passed to the page generating the script (the name of the script, often code written in PHP, comes before the ? in the URL).

You can try this example by comparing the two URLs to see for yourself the difference a changed parameter makes!

Dynamic pages opened using scripts that are passed values are too useful to avoid. Most search engine bots can traverse dynamic URLs provided they are not too complicated. But you should be aware of dynamic URLs as a potential issue with some search engine bots, and try to keep these URLs as simpleusing as few parametersas possible.

8.1.3. File Formats

Most search engines, and search engine bots, are capable of parsing and indexing many different kinds of file formats. For example, Google indexes file types including: pdf, asp, jsp, html, shtml, xml, cfm, php, doc, xls, ppt, rtf, wks, lwp, wri, and swf.

However, simple is often better. To get the best search engine placement, you are well advised to keep your web pages, as they are actually opened in a browser, to straight HTML.

Note: Even though a file opens in straight HTML in a browser, it can be generated using a server-side script. How a file was created is essentially irrelevant to the search engine, which cares only about the actual file that is browsed.So check the source file as shown in a browser rather than the script file used to generate a dynamic page to see what the search engine will index.

Google puts the "simple is best" idea this way: "If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." The only way to know for sure whether a bot will be unable to crawl your site is to check your site using an all-text browser.

8.1.4. Viewing Your Site with an All-Text Browser

Improvement implies a feedback loop: you can't know how well you are doing without a mechanism for examining your current status. The feedback mechanism that helps you improve your site from an SEO perspective is to view your site as the bot sees it. This means viewing the site using a text-only browser. A text-only browser, just like the search engine bot, will ignore images and graphics, and only process the text on a page.

The best-known text-only web browser is Lynx. You can find more information about Lynx at http://lynx.isc.org/. Generally, the process of installing Lynx involves downloading source code and compiling it.

The Lynx site also provides links to a variety of pre-compiled Lynx builds you can download.

Don't want to get into compiled source code, or figuring out which idiosyncratic Lynx build to download? There is a simple Lynx Viewer available on the Web at http://www.delorie.com/web/lynxview.html. You'll need to follow directions carefully to use it. Essentially, these instructions involve adding a file to your web site to prove you own the site. The host of the Lynx Viewer is offering a free service, and doesn't want to be deluged, but it is not hard to comply

Using Lynx Viewer, it's easy to see the text that the search bot sees when you are not distracted by the "eye candy" of the full image version.

Tip: Users, web designers, and advertisers are often very fond of fancy graphics and graphic effects on the Web. But bear in mind that search engines are almost exclusively concerned with words (meaning plain text rather than pictures). Therefore, your SEO efforts should be focused on text, not the "look and feel" of a site. When practicing SEO stick to word-craft, and don't get sidetracked by irrelevant visual issues.