Now that you have a better idea of what Google is and what it does, let's take a look at how it does what it doesin particular, how a Google search works. There's a lot of sophisticated technology behind even the most simple search. How a Typical Search WorksThe typical Google search takes less than half a second to complete. That's because all the searching takes place on Google's own web servers. That's right; you may think you're searching the Web, but in effect you're searching a huge index of websites stored on Google's servers. That index was created previously, over a period of time; because you're only searching a server, not the entire Web, your searches can be completed in the blink of an eye. Note Google's servers are actually midpriced personal computers, just like the kind you have on your desktop. Google uses approximately 10,000 of these PCs, all of which run the Linux operating system. Google uses three types of servers: web servers (which host Google's public website), index servers (which hold the searchable index to the bigger document database), and document servers (which house copies of all the individual web pages in Google's database). So what happens when you enter a query into the Google search box? It's a process that looks something like this:
Of course, you're unaware of all this behind-the-scenes activity. You simply type your query into the search box on Google's main web page, click the Google Search button, and then view the search results page when it appears. All the shuffling of data from server to server is invisible to you. Note Google's document servers store the full text of each web page in the Google database. Snippets of each page are extracted to creating the page listings on Google's search results pages. In addition, these stored documents provide the cached pages that are linked to from the search results page. How Google Builds Its DatabaseAnd Assembles Its IndexAt the heart of Google's search system is the database of web pages stored on Google's document servers. These servers hold literally billions of individual web pagesnot the entire Web, but a good portion of it. How does Google determine which web pages to index and store on its servers? It's a complex process with several components. First and foremost, most of the pages in the Google database are found by Google's special spider software. This is software that automatically crawls the Web, looking for new and updated web pages. Google's crawler, known as GoogleBot, not only searches for new web pages (by exploring links to other pages on the pages it already knows about), it also re-crawls pages already in the database, checking for changes and updates. A complete re-crawling of the web pages in the Google database takes place every few weeks, so no individual page is more than a few weeks out of date. The GoogleBot crawler reads each page it encounters, much like a web browser does. It follows every link on every page until all the links have been followed. This is how new pages are added to the Google database, by following those links GoogleBot hasn't seen before. Note GoogleBot is smart about how it updates the Google database. Web pages that are known to be frequently updated are crawled more frequently than other pages. For example, pages on a news site might be crawled hourly. The pages discovered by GoogleBot are copied verbatim onto Google's document serversand copied over each time they're updated. These web pages are used to compile the page summaries that appear on search results pages; they can also be viewed in their entirety when you click the Cached link in the search results. (These cached pages are a good way to view older versions of pages that have recently changed or been deleted.)
In order to search the Google database, Google creates an index to all the stored web pages. This search engine index is much like the index found in the back of this book; it contains a list of all the important words used on every stored web page in the database. Once the index has been compiled, it's easy enough to search for a particular word, and have returned a list of all the web pages on which that word appears. And that's exactly how the Google index and database work to serve your search queries. You enter one or more words in your query, Google searches its index for those words, and then those web pages that contain those words are returned as search results. Fairly simple in concept, but much more complex in executionespecially since Google is indexing all the words on several billion web pages. How Google Ranks Its ResultsSearching the Google index for all occurrences of a given word isn't all that difficult, especially with the computing power of 10,000 PCs driving things. What is difficult is returning the results in a format that is usable by and relevant to the person doing the searching. You can't just list the matching web pages in random order, nor is alphabetical or chronological order all that useful. No, Google has to return its search results with the most important or relevant pages listed first; it has to rank the results for its users. How does Google determine which web pages are the best match to a given query? I wish I could give you all the details behind the scheme, but Google keeps this core methodology under lock and key; this methodology is what makes Google the most effective search engine on the Web today. Even with all this secrecy, Google does provide some hints as to how its ranking system works. There are three components to the ranking:
Although the other factors are important, PageRank is the secret sauce behind Google's page rankings. The theory is that the more popular a page is, the higher that page's ultimate value. While this sounds a little like a popularity contest (and it is), it's surprising how often this approach delivers high-quality results. The actual formula used by PageRank (called the PageRank Algorithm) is super-duper top-secret classified, but by all accounts it's calculated using a combination of quantity and quality of the links pointing to a particular web page. In essence, the PageRank Algorithm considers the importance of each page that initiates a link, figuring (rightly so) that some pages have greater value than others. The higher the PageRank of the pages pointing to a given page, the higher the PageRank will be of the linked-to page. It's entirely possible that a page with fewer, higher-ranked pages linking to it will have a higher PageRank than a similar page with more (but lower-ranked) pages linking to it. The PageRank factor on the linking page is also affected by the number of total outbound links on that page. That is, a page with a lot of outbound links will contribute a lower PageRank to each of its linked-to pages than will a page with just a few outbound links. As an example, a page with PageRank of PR8 that has 100 outbound links will boost a linked-to page's PageRank less than a similar PR8 page with just 10 outbound links. It's important to note that Google's determination of a page's rank is completely automated. There is no human subjectivity involved, and no person or company can pay to increase the ranking of their listings. It's all about the math. Note PageRank is page specific, not site specific. This means that the PageRank of the individual pages on a website can (and probably will) vary from page to page. |