Understanding XML Basics | Mining Google Web Services: Building Applications with the Google API

Almost everyone has heard about and used XML in some way. If nothing else, you've seen XML extensions on some Web pages because many magazines now use XML as a fast way to present highly formatted data online. Even though your browser presents what appears to be a standard Web page, underneath the presentation the page is an XML file. XML sees more use than Web pages and Web services ”it's becoming the glue that holds the Internet together. In fact, you'll find XML used for many non-Internet purposes, such as application configuration files.

In many ways, knowing XML is a way to understand the presentation and distribution of information on the Internet today. Presentation is especially important when working with Google because this search engine can help you find just the right presentation of the many sources of data you find. You might need some data presented as a table, rather than as running text. Distribution is also very important if you want to save time in sending information to other people. For example, it's often possible to find the same data distributed as a Web page, PDF file, and PowerPoint presentation. Although the following sections provide the information you need to work with Google Web Services, you'll eventually want to explore this topic further by using the resources in the " Learning More about XML" section.

Defining the Parts of an XML Message

All XML messages consist of three components: elements, attributes, and data. For all of the complexity of the examples in the previous chapters, XML doesn't contain very much in the way of complex information. In addition, XML messages consist entirely of text for the most part. Yes, you can attach encoded data, but the message itself is pure text, which makes XML quite readable. Here's a simple example that shows all three kinds of XML message components . You'll find this example in the \Chapter 03\Sample XML folder of the source code located on the Sybex Web site.

 <?xml version="1.0" encoding="UTF-8"?>   <Hello xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">       <Element1>Some Text 1</Element1>       <Element2 MyAttribute="SomeValue">Some Text 2</Element2>   </Hello>

The first line is an element. It's a special kind of element that every XML file has ”the XML heading. The <?xml?> element (or tag as some books say) defines this file as an XML file of some kind. The version attribute further defines the XML file by telling the XML parser that this is a version 1.0 file. The encoding attribute states how the data preparer formed the characters within the file. The two most popular encoding techniques in use now are UTF-8 and UTF-7. You can learn more about the Unicode Transformation Format (UTF) standard at http://www.ietf.org/rfc/rfc2152.txt and http://www.utf-8.com/.

The second line also contains an element. However, notice this element has an opening and a closing tag. The opening <Hello> tag appears first, followed by two child elements, followed by the closing </Hello> tag. Standard elements all require an opening and closing tag unless they're self-contained. You can create a self-contained tag by adding the ending slash as part of the initial tag like this <Hello/> . The <Hello> element includes a special namespace attribute. You can detect namespace elements because they normally begin with xmlns, followed by a colon , followed by the name of the namespace (xsi in this case). The namespace normally has an URL attached to it. The page pointed to by the URL contains a description of the elements that the namespace defines. Whenever an XML parser sees a namespace attached to an element, it goes to the URL defined for that namespace to learn how to interpret the element, associated attributes, or data.

The <Element1> element is a child of the <Hello> element. Elements can have child/parent relationships. This element doesn't include any attributes, but it does have data in the form of Some Text 1 . The value of <Element1> is Some Text 1. The XML Parser links the element to its data.

The <Element2> element is also a child of the <Hello> element and a sibling of <Element1> . This element also includes an attribute. In this case, the attribute is extra data that describes the element in some way. The value of MyAttribute appears in quotes after the attribute. To create an attribute, you must always provide a name, followed by the equals sign and a string value in quotes. An element can contain as many attributes as needed to provide a full description of its functionality.

Viewing XML Data in Your Browser

One of the problems of working with XML data is that it can become quite lengthy. The length of a Google search result can make it difficult to locate the very information you seek. Fortunately, you can see the formatted data in a browser such as Internet Explorer. The data contains indentations to show the relationships between parent and child. In addition, you can differentiate between elements, attributes, and data by looking at the colors. Finally, special elements such as processing instructions and the XML header appear in a different color .

Unless you know how a browser displays XML, you might conclude that the indentation and coloration are the only help you receive. However, browsers have a lot more to offer than that in most cases. At the very least, you can expand and collapse various levels of information. Figure 3.1 shows an example of a Google response that relies on the ability of the browser to collapse information to present a clearer picture of the response. You'll find this example in the \Chapter 03\Sample XML folder of the source code located on the Sybex Web site.

Figure 3.1: Internet Explorer and other browsers can display XML files in a variety of ways.

As you can see, the entire response now fits within one screen, making it easy to get an overview of the data. Notice the minus ( ˆ’ ) sign next to the < resultElements > element. This symbol indicates that you can collapse this level. The plus (+) sign that appears next to each of the < item > elements shows that you can expand the level to show child entries. Clicking either a minus or plus sign performs that task within the browser, so you can display any level of detail desired.

Getting XML Data Tools

A browser is a good tool for viewing XML, but you can't modify the XML using it. Because XML is pure text, you can use any editor like Notepad to edit it. However, Notepad isn't optimal because it doesn't display the XML structure. In addition, Notepad lacks tools for making the editing experience better. For example, if you want to add a new element, you must type the tag manually. Manual techniques often leave you open to data errors. Consequently, you need an editor that works well with XML files.

Note

Many of the tools mentioned in this book rely on Microsoft XML Core Services (MSXML) 4.0. In addition, some of the coding examples also rely on this library. The latest version at the time of writing is Service Pack (SP) 2, which you can download at http://www.microsoft.com/downloads/details.aspx?familyid=3144B72B-B4F2-46DA-B4B6-C5D7485F2B42 (MSXML 4.0 is approximately 5MB, so make sure you allocate enough time to download it). Both of the editors in the sections that follow rely on MSXLM 4.0. However, Netpadd is perfectly happy using MSXML 5.0, which comes with Microsoft Office 2003. On the other hand, XMLwriter 2 specifically requests MSXML 4.0 every time you start it, even if you have MSXML 5.0 installed. Fortunately, you can install the two versions of MSXLM side by side without any ill effects.

XML editors use a number of methods for displaying the XML file. Because presentation is very important when working with XML, you should choose an XML editor that presents the information in a way that you can understand. For example, Figure 3.2 shows the tree view editor used by many XML editors such as XML Notepad and XMLSpy. In addition, XML editors cost differing amounts based on the features they provide. Some editors are very expensive because they provide automated generation features and edit a number of file types.

Figure 3.2: Many XML editors provide a tree view display that experts like.

I chose the two XML editors presented in the sections that follow because they're simple to use and you can download them free. I'm not endorsing these editors as the only selections on the market ”you should try a number of editors before you settle on one. However, because these editors provide good functionality and don't provide too many confusing features, you might want to try them as a starting point for your XML learning experience.

Using Netpadd

Netpadd (http://www.netpadd.com/) is a freeware product that has some interesting features, but is also very simple to use. To write XML using this product, you need to type all of the information manually. There's little automation in this product, so when you create an opening tag, you must create the closing tag to go with it.

The display, shown in Figure 3.3, is very much like the display you'd see in Notepad, rather than the heavily formatted display provided by products such as XMLSpy that some newer developers find confusing. (You'll find the sample file shown in Figures 3.2 and 3.3 in the \Chapter 03\Sample XML folder of the source code found on the Sybex Web site.) Netpadd does provide keyword highlighting for your XML file, which makes viewing the information a lot easier. Use the Options Hilite XML menu options to define which file types receive highlighting.

Figure 3.3: Netpadd provides an easy to understand display of the XML file.

Tip

One of the most popular XML editors on the market today is XMLSpy (http://www.xmlspy.com/). You can download a limited time evaluation copy of the product from the Altova Web site. Once the evaluation period ends, you must either remove XMLSpy from your system or buy a copy. Another popular choice is XML Notepad ”a free download originally provided by Microsoft. Microsoft doesn't officially support XML Notepad any longer. Consequently, you can't download it from Microsoft. Some alternate download sites include WebAttack at http://www.webattack.com/get/xmlnotepad.shtml (this site has the 1.5 version) and DevHood at http://www.devhood.com/tools/tool_details.aspx?tool_id=261 (this site has the 1.0 version, which works much the same as the 1.5 version and has about the same feature set). Both of these editors provide superior handling of XML files by including special symbols and specific methods for adding data. In addition, XMLSpy works on a number of other file types and provides some task automation that you'll find helpful if you work on XML files frequently.

One of the more interesting features that Netpadd provides for XML developers is multiple data views. For example, you can use the View XML Tree command to display a tree view like the one shown in Figure 3.4. Other commands let you view the XML in other ways. Use the View XSL Transformation command to display the information as transformed by an XSL file. This particular view is very helpful when working with Google Web Services because you can fine-tune your XSL file without making numerous requests to the Web service.

Figure 3.4: View your XML files in various ways using Netpadd's View commands.

You'll also like the special dialog boxes that Netpadd provides. For example, the View Special Characters command displays the Special Characters dialog box. Select the character you want to use and click either Paste or Paste as HTML to place the symbol in your document. If you've ever wasted time looking up language codes online, you'll really like the Language Codes dialog box (displayed using the View Language Codes command). Simply select the language you want to use and click Paste.

Using XMLwriter 2

XMLwriter 2 (http://www.xmlwriter.net/) is a try before you buy product. I won't say that this product is shareware in the strictest sense because the trial period limits use to 30 days. That said, the trial period means you can download the product and try it before making a buying decision, which makes the buying decision easier.

Unlike many other XML editors on the market, XMLwriter 2 also uses a Notepad-style document display for editing as shown in Figure 3.5. This product automatically assumes you want to use color-coding for keywords. You'll also find the use of automation nice. For example, when you type an opening tag, XMLwriter 2 automatically creates a closing tag for you. Load a schema for your XML file and you'll be able to choose tags directly from the TagBar displayed on the left side of the screen. The IDE also features an XML checker. Simply right-click the document and select Validate XML File from the context menu. Any errors appear in a TODO list at the bottom of the IDE.

Figure 3.5: XMLwriter 2 uses a document style editor, but provides many features found in high-end products.

This product includes a number of features that the serious XML developer will need. For example, you can build projects using XMLwriter 2. Creating a project organizes the files and makes it easier to build the links you need. XMLwriter 2 comes with built-in support for all of the standard files ”including XML, XSL, XSLT, HTML, XHTML, CSS, DTD, XSD, and text. In addition, you can open some types of image files, such as the GIF, JPG, and PNG files used by many Web sites. However, the files you can open aren't actually limited to these types. You can add new types to the list, so long as XMLwriter 2 can read them (which means that you can add any text-based file extension).

The IDE itself is fully configurable using any of the Options dialog box entries. Fortunately, the default setup is quite usable. For example, the tabbed presentation means you can see multiple versions of your XML file with ease. For example, if you want to see a tree view of your document, simply right-click the document and select View As Tree from the context menu. Figure 3.6 shows a typical example of the tree view. You can also choose a browser view for your document.

Figure 3.6: Select a tree view to see the overall layout of your XML document.

Sending Special Characters Using URL Encoding

For most people, working with Web sites is a unique experience because they encounter unexpected oddities that they haven't had to consider in the past. When you type a space into a word- processed document, nothing odd happens ”the computer simply accepts the character. However, look at the word processor again. Notice how the word processor automatically looks at the space and uses it to determine where to split lines of text. The word processor does treat the space differently ”it treats it as a delimiter (a fancy term that programmers use to mean a character that has a special meaning). Likewise, when you add a hyphen to a word, the computer could choose to split the sentence in the middle of the word. The hyphen acts as a delimiter.

The Internet also uses delimiters for a number of purposes, including URLs. When a Web server sees a space, it could assume that it has reached the end of the URL or the beginning of a new input parameter (or a number of other things). Consequently, you must replace spaces, question marks, and other characters with other characters that don't work as delimiters. You might have noticed this practice at work when you fill out a form on the Internet. The browser commonly replaces a space between two words with %20 or a plus sign (+). The Web server interprets these special character sequences as a space.

At first, you might think that the character replacement is random, but there's some method to the madness. In fact, it's relatively easy to write a JavaScript function that performs the character replacement so you don't need to worry about it. Listing 3.1 shows this function. You'll find the complete source for this example in the \Chapter 03\URL Encode folder of the source code located on the Sybex Web site. Note that you can write similar functions in other languages; I'm just using this one because most people can run JavaScript using their browsers.

Listing 3.1: Replacing Characters in a String

 function ReplaceCharacter(InputStr, Replace, UseInstead)   {      // Define the length of the inputs.      var InputLength = InputStr.length;      var ReplaceLength = Replace.length;      // Determine whether either input has a 0 length. If so,      // the function can't succeed. However, because this is      // a recursive function, the function does need to return      // the original string.      if ((InputLength == 0)  (ReplaceLength == 0))         return InputStr;      // Locate the first replacement value.      var ReplaceIndex = InputStr.indexOf(Replace);      // If the replacement value doesn't appear within the string,      // then return. Again, keep the recursive nature of the      // function in mind.      if (ReplaceIndex == -1)         return InputStr;      // Create a string that includes the first part of the original      // string and the replacement character, but not the rest of the      // string.      var Output = InputStr.substring(0, ReplaceIndex) + UseInstead;      // Use recursion to process the string again if there is more data      // to process.      if (ReplaceIndex + ReplaceLength < InputLength)         // Keep adding to the output string after each recursion.         Output += ReplaceCharacter(            InputStr.substring(ReplaceIndex + ReplaceLength, InputLength),            Replace,            UseInstead);      // Return the output during each recursion. return Output;   }

This might look like a lot of very complicated code, but it's actually an easy program. It uses a special technique called recursion to perform its work. In recursion, the programmer writes a program that solves the simplest form of a problem, and then has that program keep calling itself until it achieves that simple form. No matter how complex the input is, the program can solve it (given enough memory and time) because eventually the input will reach this simple solution.

In this instance, the program keeps calling itself until one of several conditions occurs. First, the program could run out of text to process. Second, the program might have some text left, but it might not contain the special character you want to replace (such as a space). If that's the case, then the program has already performed all of the required work, so it can stop.

Once the program determines there's data to process, it uses the substring() function to look for that character in the string. The substring() function returns just the first part of the string ”the part that doesn't contain the special character. To this string, the code adds the replacement characters, such as %20 for a space.

It's at this point that the recursion process occurs. The code still has the other part of the string to consider ”the last half. The first half of the string is free of the special character, but not the second half. When the code detects that there's still string to process, it calls itself again with the last half of the string. This process continues until the code has processed all of the input string. Figure 3.7 shows typical output from this program.

Figure 3.7: The example program shows how you can perform URL encoding.

Of course, the problem is figuring out which characters to replace and what numbers to use to replace them. Unfortunately, Google doesn't publish a list of offending characters, so you'll need to experiment a little with special characters that you want to use. A space never works, and you have to exercise care with both double and single quotes. Determining what number to use is easy. Simply break out a copy of the Character Map utility that comes with Windows and you have everything you need. Figure 3.8 shows what this utility looks like.

Figure 3.8: Character Map makes it easy to learn the numbers associated with special characters.

Simply select the character you want and look at the number that appears at the bottom of the dialog box. This is the number you should use to replace the character in a string. You can also hover the mouse over the character and the program will display both the character name and the associated number. For example, you replace the quotation mark with %22 in an URL encoded string.

Learning More about XML

Whether you know it or not, you'll run into XML many times during your computer use. The reason is simple ”XML makes a great way to exchange data between disparate systems. Fortunately, XML is relatively easy to learn. Visit the W3C Schools site at http://www.w3schools.com/xml/ to find a complete XML tutorial. You might also want to review the namespace tutorial at http://www.zvon.org/index.php?nav_id=172&ns=34.

Unlike many topics discussed in this book, there are multiple versions of XML so you can't rely on just one reference. The most important reference for Google Web Services appears at http://www.zvon.org/xxl/xmlSchema2001Reference/Output/index.html. However, make sure you also look at the references at http://www.zvon.org/xxl/xmlSchemaReference/Output/index.html for complete information. The annotated XML reference at http://www.xml.com/axml/axml.html is also handy for seeing the specification and expert commentary side by side.

You can also find a number of good general-purpose XML sites online. For example, the Microsoft XML Developer Center (http://msdn.microsoft.com/nhp/default.asp?contentid=28000438) is a great place to visit if you use Microsoft products.