Choosing a Communication Method


All of the examples to this point in the book have relied on some form of communication to achieve their goals. In fact, every Google Web Services application you create will include some type of communication with the remote server unless that application relies on static data. Even then, you need to consider significant licensing issues for updates because the updated data will have to come from some source. For example, a PDA could obtain updates from a local desktop, which avoids having the PDA connect to Google Web Services, but the desktop will still need some source of updated information (usually a direct connection).

The following sections discuss the design issues surrounding the various communication choices you have. You'll find that Google Web Services isn't very flexible, which means you must exercise care in choosing a communication option. In most cases, your only choice is to use SOAP because that's what Google supports natively.

Understanding That Google Only Directly Supports SOAP

The only form of communication that Google supports directly ”at least at the moment ”is SOAP. Unlike other Web services, you can't use techniques such as XML over HTTP (also called REST, REpresentational State Transfer) or XML-RPC (eXtensible Markup Language Remote Procedure Call) with the current setup. However, this situation could (and probably will) change in the future. Given that Google only supports this one method of communication directly, most developers will use it rather than develop an alternative that could cause problems later.

Note  

Most JavaScript applications require a separate SOAP library such as SOAPlite. However, Mozilla users can rely on the built-in SOAP support provided by their browser. This chapter doesn't discuss use of the built-in SOAP support. However, you can find the technique described and demonstrated on scottandrew.com at http://www.scottandrew.com/weblog/googleapi.

The Web Services Description Language (WSDL) and sample SOAP files provided with the kit show how the SOAP requests work to an extent. The WSDL file is the most reliable source, but it's completely undocumented, so you're left to figure out what each entry means. The sample SOAP files are easier to understand, but contain inaccuracies. For example, unless you read the README.TXT file that appears in the \GoogleAPI folder, rather than the SAMPLES-README.TXT file that appears with the samples in the \GoogleAPI\soap-samples folder, you won't know that Google ignores the ie and oe arguments. In this case, the sample SOAP files would lead you to believe that using the ie and oe arguments is perfectly acceptable. Unfortunately, the Google Web API Reference doesn't document these arguments clearly for a search request and doesn't include any documentation for the cached page or spelling requests. Consequently, the following sections describe these calls in detail.

Defining the Search Request Arguments

The search request asks Google for a list of links based on the search criteria you provide. Defining a complete and exact search request is so important that all of Chapter 2 focuses on this topic. However, the search criteria are just one element of the search request. The following list defines each of the search request arguments. Your application must provide these arguments in order as part of the search request. Otherwise, Google won't honor the request.

key Every request you make requires the license key you obtained from Google. When you make a request without the license key or using the key found in the Google examples, you'll receive an invalid authorization key message. Google uses a string of zeros ( 00000000000000000000000000000000 ) as the sample key ”this key looks nothing like the actual key. The examples in this book use " Your-License-Key " as the sample key. In both cases, you must replace the sample key with a real key.

q This string argument contains the search request. Tests indicate that you can make search requests of any length and Google will honor them. However, Google only checks for the first 10 search terms ”it ignores the remaining search terms and doesn't raise an error. Even limiting your search to 10 terms means that you can be quite specific in requesting what you want or you can make a general query and filter the data locally. The one caveat you do need to observe is that complex search specifications tend to reduce the number of results to the point that you don't get any results at all. Smart search techniques make the request specific, without attempting to locate the one result that perfectly matches a need. See Chapter 2 for a complete discussion of search request elements.

start This numeric argument contains the 0-based starting point for the search. You must couple this starting point with the number of results you request, with 10 results being the maximum. Consequently, if you want to view the third set of results and you request 10 results for each set, you would set this argument to 30. An odd problem can occur when working with Google Web Services, however. Although the results you receive are sequential, they aren't necessarily complete. The actual results might start at 1 or skip a similar result. This means you can't use a strict starting point, you must base the starting point on the returned values. The " Defining the Search Results" section describes this issue in greater detail.

maxResults Google lets you request a maximum of 10 results. However, you can specify less than that amount when you only need a few results. In addition, you may receive fewer results if the search criteria are strict enough. The benefit of requesting fewer results is that you get the response faster, so your application performs better. However, if you plan to request additional results anyway, it's probably better to request the maximum number of results and cache the additional results locally.

filter The documentation doesn't make this particular argument very clear. You provide true or false as the input values ”not the values described in the Automatic Filtering section of the Google Web API Reference. Turning filtering on means that Google looks for results that have the same title and snippet and removes them from the result set. The Web service only returns the first result and eliminates the others. In addition, Google only returns the first two results from a particular Web host. Filtering means that you have to select the start argument value carefully because Google will leave out some of the results.

restrict This argument restricts the results you receive to a particular country of origin. Don't confuse this argument with a language restriction. For example, when you select the United States as the country of origin, you could still receive pages written in Spanish or German. However, you won't receive results from either Spain or Germany. The Restricts section of the Google Web API Reference contains a chart of country codes you must use for this argument. As with most restrictions, this argument will create holes in the result set and affect the start argument value.

safeSearch Generally, you can use this argument to ensure you don't receive any results with pornographic content. However, the filter doesn't work all the time and some search terms will almost certainly retrieve adult content despite the use of this filter. In addition, as with most filtering, using the safe search feature may mean that you won't see some results, even though they don't contain any pornographic material. Set the argument to true when you want to avoid adult content. Contact Google at safesearch@google.com when you encounter pornographic material that you don't want. Telling Google about the problem will help refine the filter so that others don't encounter the same results.

lr Sometimes you need a result in a specific language. This argument doesn't restrict the country that you get a result from, but it does restrict the language of the result. For example, you could tell Google that you only want results written in Japanese. The Restricts section of the Google Web API Reference contains a chart of country codes you must use for this argument. As with most restrictions, this argument will create holes in the result set and affect the start argument value.

ie This argument is ignored. You still need to provide it as part of the SOAP message, but leave the content blank (an empty string). Google no longer offers input encoding (the use of special character sets); all output appears in 8-bit Unicode Transformation Format (UTF-8) encoding.

oe This argument is ignored. You still need to provide it as part of the SOAP message, but leave the content blank (an empty string). Google no longer offers output encoding.

As you can see, the search request provides a number of ways to restrict the result set in addition to the search criteria. Listing 3.4 provides a simple example of how to create a SOAP search request. The search examples will increase in complexity as the book progresses.

Defining the Spelling Request Arguments

You can perform spelling checks using Google Web Services. However, simple experiments have shown so far that the spelling service only appears to work in English. Google will probably fix this limitation in the future. The actual request process is very easy. All you need supply are the two arguments shown below using the doSpellingSuggestion() method.

key See the key argument explanation in the " Defining the Search Request Arguments" section.

phrase This argument contains a string that you want to check. Google makes a reasonable effort to correct the spelling. However, it won't correct some types of spelling errors. For example, the spelling checker couldn't fix misspellings such as uneke (unique). The probably of getting a completely corrected string also seems to decrease as the length of this argument increases .

Listing 4.1 shows how to make a spelling request using JavaScript. You'll find the complete source for this example in the \Chapter 04\SpellingRequest folder of the source code located on the Sybex Web site.

Listing 4.1: Spelling Request with a Browser
start example
 function CallGoogle()   {      // Create the SOAP client.      var SoapClient = new ActiveXObject("MSSOAP.SoapClient30");      // Initialize the SOAP client so it can access Google      // Web Services.      SoapClient.MSSoapInit("http://api.google.com/GoogleSearch.wsdl",                            "GoogleSearchService",                            "GoogleSearchPort");      // Make a spelling request.      SubmissionForm.CorrectStr.value =         SoapClient.doSpellingSuggestion("Your-License-Key",                                         SubmissionForm.SpellStr.value);   } 
end example
 

The code begins by creating a SOAP client. You must use a SOAP client to communicate with Google Web Services. The client ensures the message is properly formatted and also obtains any return values provided by the server.

Once the code creates the SOAP client, it initializes a connection to Google Web Services. Every application you create using JavaScript includes these two steps. The location of the Web Services Description Language (WSDL) doesn't change, and you'll always use the same service and port entries.

The code makes the SoapClient.doSpellingSuggestion() method call at this point using your license key and input string. Because this method returns a string, you can place it directly in the output label. Figure 4.1 shows typical results from this method.

click to expand
Figure 4.1: This example shows typical spelling check results.

Defining the Cache Request Arguments

Google provides access to cached versions of many Web sites. You can use these cached versions of a number of purposes, including obtaining information that no longer appears on a particular site. The following list describes the arguments you supply to Google to retrieve a cached page using the doGetCachedPage() method.

key See the key argument explanation in the " Defining the Search Request Arguments" section.

url This argument contains the URL of the site. You must make sure that Google actually has a cached version of the site using a search call or provide some form of error handling. The return value is a base 64 representation of the Web page. You can find a great description of base 64 encoding at http://www.robertgraham.com/tools/base64coder.html along with a tool you can use to test the results you receive from Google.

Listing 4.2 shows how to obtain the cached page from Google. You'll find the complete source for this example in the \Chapter 04\GetCachedPage folder of the source code located on the Sybex Web site.

Listing 4.2: Cached Page Request with a Browser
start example
 function CallGoogle()   {      // Create the SOAP client.      var SoapClient = new ActiveXObject("MSSOAP.SoapClient30");      // Initialize the SOAP client so it can access Google      // Web Services.      SoapClient.MSSoapInit("http://api.google.com/GoogleSearch.wsdl",                            "GoogleSearchService",                            "GoogleSearchPort");      // Make a cached page request.      var TheResult =         SoapClient.doGetCachedPage("Your-License-Key ",                                    SubmissionForm.URLStr.value);      for (var Counter = 0; Counter < TheResult.Length; Counter++)      {         document.write(TheResult[Counter].toString());      }   } 
end example
 

Unfortunately, this listing points out a problem with JavaScript. The code fails at the for loop because of an inherent limitation in most scripting languages. The return value includes a byte array, and the version of JavaScript that comes with Internet Explorer doesn't know how to interact with it. This same limitation occurs with every JavaScript interpreter that follows the ECMAScript standard (http://www.ecma-international.org/ publications /standards/Ecma-262.htm). To work with this data, you need to create a special object to interpret the byte array, use an existing object that might not appear on every machine that uses the application, or rely on a third party nonstandard interpreter such as the NJS JavaScript Interpreter (http://www.bbassett.net/njs/). Because of the issues surrounding this particular request, I recommend that you use a more advanced language, such as Visual Basic for Applications (VBA), Visual Basic, C#, PHP, or Java (among others) that do support byte arrays directly.

Note  

ECMA (European Computer Manufacturer's Association) is the official group tasked with maintaining the JavaScript (now called ECMAScript) standard. Most versions of JavaScript add extensions to the ECMAScript standard. For example, Microsoft's version, JScript, includes specialized support for the Windows Scripting Engine. You can find a complete list of Microsoft differences at http://www.script- info .net/jsvbs/msscript/js56/js56jsgrpnonecmafeatures.php.

Understanding the Google Data Output

Once you make a request, Google sends a response. Your code must interpret the response and present it to the user . The following sections provide an overview of the responses that Google sends to various requests. The sample code in the remainder of the book helps you explore this topic in greater detail.

Tip  

It's important to remember that Google regularly changes its search algorithms to reduce the risk that some Web sites will receive more representation than they deserve. (There are other reasons for changing the search algorithm, including changes in the type of data provided on the Internet.) According to a Ziff Davis Channel Zone article (http://www.eweek.com/article2/0,4149,1400623,00.asp?kc=EWNWS120203DTX1K0000599), these changes occur regularly and don't always meet with Webmaster expectations. The lesson for anyone using Google Web Services is that you should expect the output of your application to change over time and prepare users for this eventuality. Make sure users don't assume that application output will remain constant.

Defining the Search Results

The search results are the most complex return that Google Web Services provides. The Search Results Format section of the Google Web API Reference discusses these results to an extent. However, it really helps to see a pictorial representation of typical results. Figure 4.2 shows a tree view of the sample SOAP response found in the doGoogleSearchResponse.xml file of the \GoogleAPI\soap-samples folder.

click to expand
Figure 4.2: Use a tree view of the Google search results to understand the structure of the data you receive better.

You should consider a few coding tricks that the documentation doesn't mention, but that become somewhat obvious when you work with Google Web Services for a while. The first is that the startIndex won't always be the same as the start index that you specified. Google might have removed the first result from the list. Because you can't be sure about the starting index, you should always use the startIndex value when telling the user which results appear on screen.

Another important coding trick is the use of the endIndex . Add 1 to this value and use it as the input to the next request. Otherwise, the user will receive some duplicate search results. For example, consider a case where you request 10 results starting at index 0. Google Web Services does return 10 results, but entries 1, 3, and 5 are missing because they're duplicates. Consequently, you should use a starting index of 13, rather than 10, for the next request.

The Google Web API Reference is also unclear about some issues. For example, they never define ODP (Open Directory Project) or what it means to the user. Consequently, you don't learn about the importance of the directoryCategory element data provided with each item . When an item contains this entry, it tells you that it also appears on other search engines, such as Netscape Open Directory, Netscape What's Related, Lycos, HotBot, Dog-pile, Thunderstone, Mars Society, and Linux.com Links. The additional information is helpful in determining the presence of a particular site. Once you understand ODP, you also understand the difference between a snippet and a summary element. Google always generates the snippet element, while the summary element comes from the ODP database.

Tip  

You can find a lot of interesting information about ODP online. A history of the project appears at http://www.laisha.com/zine/odphistory.html. One of the main ODP pages is at http://dmoz.org/. This page lets you test ODP out and become part of the project. Suggestions for submitting your site for admittance into the ODP appear at http://www.searchengineworld.com/misc/odp.htm. Finally, you can discuss ODP on the forum at http://www.resourcezone.com/.

Defining the Spelling Results

The spelling request results are the easiest of the responses to interpret. All that Google returns is a simple string that you can display for the user. The string contains corrections for every misspelled work in the input string. However, you need to consider a few caveats. When Google doesn't recognize a word, it leaves the misspelled version in the return string. Consequently, the user still has to interpret the results. In addition, a user could misspell a word in such a way that Google recognizes it as something else. The return word is correctly spelled, but not the word the user meant to provide.

Defining the Cached Page Results

The cached page results aren't difficult to understand conceptually ”all that you get is a base 64 encoded value. Initially, this value is a string of characters that don't bear much of a resemblance to the data. Depending on the SOAP parser you use, the base 64 encoded string could appear as a byte array to the application. In short, the parser performs the essential decoding for you. However, you still can't use the resulting data.

Most application programming languages provide some means for working with the byte array. For example, the "Choosing between Current and Cached Data" section of Chapter 10 shows that C# developers can easily convert the byte array to a char array, which then acts as input for a new string. The resulting text is ready to display.




Mining Google Web Services
Mining Google Web Services: Building Applications with the Google API
ISBN: 0782143334
EAN: 2147483647
Year: 2004
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net