Analyzing the Content

Now that you see how spiders find pages on the Web, it's time to see what search engines do with all those pages. The first thing that you will find is that not every document in the search index is an HTML-coded Web page.

Converting Different Types of Documents

Up until now, we have assumed that all Web pages are made of HTML, but many are not. Modern search engines can analyze Adobe Acrobat (PDF) files and many other kinds of documents. Trusted feeds, in particular, tend to use their own formats.

When search engines come across a non-HTML document, they convert these documents to a standard format that they use to store all the other documents. For simplicity's sake, we examine the rest of the text analysis work as if all the documents are coded in HTML, but you know now that it is a bit more complicated than that.

Deciding Which Words Are Important

If you take a look at the average Web page, you will see a lot more than just the text that appears on the screen. If you view the HTML source, in fact, most of what you see is markup, or HTML tags. Because you do not want the names of these tags found when you search, you might imagine that search engines throw them away, but they do not. They use the markup to help them analyze the text.

When you look at a Web page on your screen using your browser, some words stand out more than others. Some words are in color or bold type; others are set in a larger size; some are set apart as headings. Also, because most Web pages are written in "newspaper style," the most important information tends to come near the top of the page.

As discussed previously in this chapter, search engines realize that emphasized words and words near the beginning of the page are more important than the rest of the words on that same page. This is the step of the index building process in which search engines decide which placements of words make them more important than others. In Chapter 12, we show you how to use this information to your advantage as you create and edit your own Web pages.

Spotting Words You Don't Normally See

Some of the most important tags are ones that you do not usually see. Because search engines see the actual HTML code, they can learn things about the page that you would never notice unless you viewed the HTML source yourself. These tags that contain information about the page are often called metatags.

The most important metatag is the title tag, but the title tag might not do what you expect. The words at the top of the Web pagethe ones that your eye tells you make up the titleare probably generated by a heading tag or by an image. The actual HTML title tag shows up in the title bar of the browser window, as shown in Figure 2-8. (The words coded in the title tag also appear as the name of the page when you bookmark it or save it as a favorite.)

Figure 2-8. How titles are used. You can see titles in several places if you look hard, but the search results page is where it matters the most.

Even though you often do not notice the title, search engines know that this tag provides a lot of information about your page. The theory is that the title contains the words that best describe the page. Search engines pay special attention to the words you use in your title tag, as mentioned in the ranking discussion earlier in this chapter. Moreover, search engines display the title when they show your page in the search results. Searchers use the title as a big factor in deciding whether to click through to your page.

Another important metatag stores the description of the page. As with the title, search engines expect that the words in your description summarize your page, and some search engines give words found in your description special importance. Unlike the title, searchers rarely see the description. In the past, many search engines displayed your page's description right under its title in the results list, but few do that today. Therefore, the description tag is less important than it once was. Chapter 12 covers metatags in detail.

Deducing Information from the Page

Search engines also analyze the page to figure out things that are not coded in the HTML. One thing that almost every search engine figures out is the language of the text. Search engines examine the beginning of the page and recognize that the words are from a certain spoken language, such as French or Korean. This recognition helps the search engine to limit its results to pages that are in the language the searcher understands.

Some search engines also deduce other things that are not explicitly coded on the page. Ask Jeeves, for example, algorithmically analyzes the words and links on every page to determine each page's communities, recognizing that pages about a certain subject, such as woodworking or car repair, tend to use similar words and form a commonly linked community. Ask Jeeves uses this information to hone in on pages considered experts within the community of each search query, believing this improves the search results.

More and more, the secret sauce of search engines is composed of special text analytics such as Ask Jeeves communities, where the search engine deduces information about your pages that was not there when you coded them.

What Search Engines Don't See

But as smart as search engines sometimes seem to be, it is striking how much they miss. The most striking misses are the pictures. Search engines read and understand text of any kind, and as you have read, they even deduce information beyond what is encoded in the text.

But pictures have no meaning to search engines. Although a person can look at a picture and immediately recognize that it is a zebra, a search engine cannot make any sense of the pattern in that image file. Some search engines, such as Google, can find zebra images through tricky use of text, such as noticing that the image file is named zebra.gif or that some text associated with the image contains the word zebra, as shown in Figure 2-9.

Figure 2-9. How search engines "see" pictures. You recognize a picture when you see it, but search engines see only the text associated with the image.

In fact, one way to think about search engines is that they use the Web the way sight-impaired people do. Blind Web users employ software called screen readers that literally read the text on the screen out loud to them, using the computer's speaker. Screen readers can speak any text, but they have nothing to say when confronted with a pictureany pictureeven a "picture" of text.

Search engines suffer from a similar "blindness." This is an important reason not to use images for display textthe large titles that often occur at the top of the page. Even though sighted visitors to the page can easily read the words displayed from the image, search engines cannot, as shown in Figure 2-10. (Another important reason to avoid images containing text is that screen readers cannot read them to sight-impaired Web users.)

Figure 2-10. How search engines miss words. You can read the text as it is shown on the screen, but search engines only see the image tag's text.

These examples show how important it is for you to use alternate text that strongly describes all of your images so that search engines (and sight-impaired readers) understand them as well as possible.

