Knowing What to Expect as Output | Mining Google Web Services: Building Applications with the Google API

For many developers, the idea of a Web service is easy to grasp ”knowing what to expect from it is hard. The output begins with a certain amount of raw data that you'll receive from the Web service. However, the raw data doesn't really define the Web service output completely. You also need to consider quantifiable components such as the input to the Web service and that data manipulation you'll perform. In addition, there are variant elements to the output, such as the timeliness of the data. Finally, you need to consider the intangible elements. The output has some value to you, but someone else will view the output in another way. Concepts such as relevancy are difficult to quantify or even define.

Google Web Services is no different from any other Web service when it comes to output. You'll provide input, receive raw data, manipulate that data in some way, and view the output ”the result of everything you have done with the Web service. The following sections discuss various elements of Web service output as they relate to Google Web Services. These sections provide an overview ”the book continues to explore the subject in other chapters. However, this is the starting point ”the point at which you start to consider what to expect as output from your efforts.

Limitations of Google Web Services Output

Many developers are used to working with a variety of data types when creating applications. Data types help define the kind of data you're using. For example, if a data element is a number, you might use an integer (a number without a decimal) or real number (one that has a decimal and equates to a Single or Double for Visual Basic developers). A Web service has no concept of data type when it comes to the data itself. Every data transfer is text. The XML used to transfer the data does include type information, but of the sort that's normally associated with database fields, which means you have to know the field names to make an interpretation. For example, you might receive data in a message like the one shown here.

 <item xsi:type="ns1:ResultElement">      <cachedSize xsi:type="xsd:string">12k</cachedSize>      <hostName xsi:type="xsd:string" />      <snippet xsi:type="xsd:string">             <b>...</b> some text <b>highlight</b>) more text <b>...</b>      </snippet>      <directoryCategory xsi:type="ns1:DirectoryCategory">             <specialEncoding xsi:type="xsd:string" />             <fullViewableName xsi:type="xsd:string" />      </directoryCategory>      <relatedInformationPresent xsi:type="xsd:boolean">             True      </relatedInformationPresent>      <directoryTitle xsi:type="xsd:string"/>      <summary xsi:type="xsd:string" />      <URL xsi:type="xsd:string">            http://www.mwt.net/~jmueller      </URL>      <title xsi:type="xsd:string"><b>DataCon Services</b></title> </item>

You don't have to understand the XML portion of this message segment, but look at the data. Google Web Services sends all data as characters (as do all other Web services) and defines the data using tags (the words between the angle brackets) and attributes (extra information within the tag). For example, the line that contains <URL xsi:type= " xsd:string " > http://www.mwt.net/~jmueller </URL> includes the <URL> tag that tells you that this value is http://www.mwt.net/~jmueller and that the tag type is an xsi:type= xsd:string . The tag tells you what kind of information this is. By knowing the Google database layout, you also know the data type and other information about the entry. However, the information you receive from Google is still plain text. You can see other examples of XML responses in the \GoogleAPI\soap-samples of the kit. Simply open them using Internet Explorer or another browser that supports XML. Figure 1.5 shows a typical view of one of the examples in the folder.

Figure 1.5: View example files using Internet Explorer or other XML-compatible browser.

Your browser is actually very handy for viewing XML data, even if it might not make sense right now. The "Viewing XML Data in Your Browser" section of Chapter 3 discusses in detail how you can use your browser. For right now, all you need to know is that you can look at the various kinds of XML responses by opening the files in your browser.

Figure 1.5 points to another potential problem with Web service output. All of the tags and other information supplied in a request and response consume space. The file is larger than a text file with the same data because of all the tag information required. In addition, it's far more efficient to store many data types in their native format, rather than use characters. Consequently, Web service data suffers from bloat. The data uses more bandwidth than a binary message and consequently, you could experience performance problems. Because of this issue, you need to create efficient queries for your application that maximize data throughput despite the limitations of the XML format. The "Making Sensible Queries" section of the chapter discusses this issue in detail.

The results you obtain from Google are largely a matter of the input you provide in the form of a request. The "Conducting an Expansion Search" section of the chapter points out a serious flaw in making any assumptions about the return you receive from Google. The query can become quite complex because even the order of the words makes a difference in the results you receive. Google must make this assumption because most people enter the words in the order they think about them, which is usually most important to least important. Consequently, if you always assume that your first query returns all possible results, you'll find Google Web Services disappointing.

The ranking of results you receive from Google Web Services is also unlikely to be the same as the ranking you need. Google sells keywords to make some sites turn up higher in the result list. In addition, Google often bases the site ranking on criteria that won't match your own, such as the number of times that a keyword appears. The bottom line is that the output you receive from Google is "raw" output ”information that you haven't filtered or organized in any way. One of the reasons to use Google Web Services is to enable you to perform tasks such as site ranking so the results appear in the order that's best for your organization.

Making Sensible Queries

Google Web Services can help you perform a number of tasks. The problem is that each request and response consumes resources. To get the most from this Web service, you need to optimize the requests and responses so that the value of the information you receive exceeds the cost of transmitting and manipulating the data.

Creating a request and then handling the response has several costs associated with it. Some of the costs are real world in that you must provide the infrastructure required to perform the task. Inefficient queries could mean adding additional bandwidth capacity or providing additional servers (if you make enough queries). Some costs are employee related ”inefficient queries mean more waiting time as the computer crunches the data. Finally, inefficient queries can incur intangible costs. For example, people can become frustrated with poor query results, which affects their performance. Some of these costs are impossible to measure accurately, but they're real.

I often rely on the online search engine to help tune queries. Using the Advanced Search (http://www.google.com/advanced_search/) page shown in Figure 1.1 can help you define and tune searches to obtain maximum data with minimal resource use. For example, computer technology is quickly outdated , so I normally provide a date range as part of my search. Using the online search to customize the date range for specific keywords can greatly enhance performance.

The Advanced Search page can help you tune keyword order ”making it possible to reduce the number of keyword permutations you use on an expanded search (see the "Conducting an Expansion Search" section of the chapter for details). It also helps you decide on how to use permanent keywords. For example, a site that sells a specific product might include the product name as a permanent search term ”one that is always included even if the user doesn't specify it.

One of the features the kit provides is a more complete list of special phrases and characters you can use for a search. Although the Advanced Search page will help you ferret out many of these search features, you won't find them all. For example, the Advanced Search page includes a blank for a search phrase, which is different from a keyword in that the search phrase must appear as specified on the target page (keywords can appear in any order). Fortunately, all of the special keywords and characters mentioned in the kit also work on the Advanced Search page so you can try them out. (Chapter 2 discusses search techniques in detail.) Depending on which special features you use for a search, Google Web Services output might not provide the information you imagined. Consequently, it pays to try these special features out to see what effect they have on your search results.

Tip

You might wonder why I'm suggesting such heavy use of the Advanced Search page. Google allows you to make 1,000 requests per day using Google Web Services. The Advanced Search page doesn't have such a limit ”making it easier to keep testing search techniques until you find the technique you want to use in your code. At that point, you can start making requests from Google Web Services to test your code. Don't waste calls on search techniques.

Even when you create a perfect search and properly filter the results using code, the information you receive from Google Web Services might not fulfill every need. At some point, you need to perform some level of human filtering. Users will need to state a preference or define how well a particular search result works. Only by tuning the filter can you hope to obtain specific results from Google Web Services. Tuning makes it possible to reduce search times from hours to minutes.

Defining Static and Dynamic Data

Web applications can include the concept of static and dynamic data. Dynamic data is the best type to use for Web services because it reflects changes in the Google database. An application gains important benefits by using dynamic data. For example, you won't try to access an old Web site that Google used to list because the dynamic nature of your application automatically removes the link from the list of results.

Unfortunately, dynamic data can also cause problems. For one thing, you need a connection to the Internet to work with dynamic data. When you use a desktop machine, maintaining a connection usually isn't a problem. However, many third party developers are working on applications where a connection might not be available, such as a research list application for a PDA. You download the information from Google Web Services and then use it to create a report while on the road ”the connection doesn't exist while you're on the road so the data is no longer dynamic.

Note

Google does support the concept of cached data that is stored from the original Web site, but even the cached data ages at some point and becomes unavailable. Cached data does have an important use. For example, you can use cached data to obtain copies of old articles that a Web site no longer carries.

Using the term dynamic to refer to application data is also somewhat of a misnomer. Nothing is truly a dynamic data application. The moment the response to your query leaves the Google server, it begins to age. The data doesn't change once it leaves the Google server, so in reality it isn't truly dynamic. The only way you can achieve a dynamic presentation of sorts is to make multiple queries. You must define how often is often enough for your needs. Google doesn't provide any guidance in this case because search result viability varies by person, focus, and need.

These facts lead into the discussion of static data. Truly static data never changes at all. Most Web sites still rely on static data presentation because the information they display doesn't change often enough to warrant a dynamic presentation. When you make a single query to Google Web Services, the response you receive is static data. It's a snapshot of that particular part of the database at a specific time. The data won't change unless you make another query.

Understanding the static and dynamic nature of data is important when you design an application that relies on Google Web Services. Errors creep into the presentation you create as the data from Google Web Services ages on your system. Part of the design process for your application is to determine how much error you can accept.