A Crash Course in XML | Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds

Before feeds can be examined in detail, a basic understanding of the language they are provided in is needed. XML should look familiar to anyone with experience with HTML because they both follow a similar set of rules. The primary difference in application is that HTML was designed to be used for the Web, whereas XML was designed to be used with anything, and only incidentally is of great use on the Web.

XML is a textual format for describing data, currently being used everywhere, for nearly everything. HTML (Hypertext Markup Language) is converting to an XML-compliant format with XHTML — publishers are using XML to store data internally, content providers are using XML to provide their users with easy-to-consume data, and web service providers are using XML to allow complex transactions to occur between disparate systems.

As you might suspect, XML documents have to follow certain rules in order to be considered what is known as valid and well formed. As well as these two terms, XML, like any other technology, has its own special lingo. This section gets you up to speed on all the rules and terminology you will need to be able to use web feeds competently.

Terminology

The best way to learn all the terminology associated with XML is to look at a simple XML document, like the one shown here:

 <bookshelf>   <book>Professional PHP 4</book>   <book price="$9.99CDN">Learn how to program PHP poorly in 24 minutes</book>   <magazine>PHP|Architect</magazine> </bookshelf>

In this instance you can say the following:

bookshelf is the root element for this XML document
bookshelf is the parent of book
book is an element within this XML document
book is the child of bookshelf
price is an attribute within the book element

You should be able to tell that the names given in bold in the preceding list are pretty intuitive, assuming you look at them in the context of the structure of the document. The root element contains all other elements; it is therefore the parent of the elements contained within. An element that has a parent (is contained by a higher-level XML tag) is called a child, and tags can contain attributes, which describe their properties.

That covers most of what you will need to know in terms of how the structure of an XML document is thought of. However, you still need to understand the concepts of validity and well-formedness.

Well-Formed XML

There are three key rules for creating well-formed XML: It must have a single root-level element, tags must be opened and closed properly, and entities must be well-formed.

A Single Root-Level Tag

Every XML document has a single root-level element. This element must contain all other elements within the document.

An example of not well-formed XML:

 <book>Professional PHP 4</book> <book>Beginning PHP 4</book>

Here is an example of well-formed XML:

 <bookshelf>   <book>Professional PHP 4</book>   <book>Beginning PHP 4</book> </bookshelf>

The first example has two root elements, both titled book. The second example has a single root element, titled bookshelf. The first is not well-formed because XML documents can have only a single root element. In the second example, the book elements are enclosed within a single bookshelf element, making the bookshelf element the single root tag and fulfilling the first condition for well-formedness in the process.

Note

XML is used to add a logical structure to information. Invariably, you need to structure information to indicate or reflect a real-life situation. In this case, a library has bookshelves, which contain books, so it makes sense that the bookshelf tag contains book elements. You wouldn't really want a book root tag containing bookshelf elements.

Tags Must Be Opened and Closed Properly

Each and every element within the XML document must be opened and closed properly.

An XML tag is considered closed if it has a matching, closing tag. You can also open and close the tag at once by placing a forward slash immediately before the closing brace. In contrast, HTML often has tags that are left open, (img, p, hr, br, and so on).

This is not well-formed XML:

 <bookshelf>   <book>Professional PHP 4   <book>Beginning PHP 4 </bookshelf>

This is well-formed XML:

 <bookshelf>   <book>Professional PHP 4</book>   <book>Beginning PHP 4</book>   <book title="Learning PHP 4" /> </bookshelf>

It should be obvious to you that the first example document is not well-formed because neither of the book tags is closed. The second example shows two closed book tags and one empty, closed book tag. Why is the final tag empty? Well, if you look closely, the final book tag has a title attribute and is closed immediately after this. In this instance, the attribute tells you about the book's title.

A tag opened within another tag must close before its parent. Similar rules exist for HTML, which are often ignored when the situation suits the coder. Whereas most programs that deal with HTML are forgiving in this respect, XML parsers, generally, are not.

This is not well-formed XML:

 <bookshelf>   <book>Learn PHP for only <price>$9.99</book>CDN</price> </bookshelf>

This is well-formed XML:

 <bookshelf>   <book>Learn PHP for only $9.99</book>   <price>$9.99CDN</price> </bookshelf>

Here is another example of well-formed XML:

 <bookshelf>   <book price="$9.99CDN">Learn PHP for only $9.99</book> </bookshelf>

In the first incorrect example, the book tag is opened, then the price tag is opened, then the book tag is closed, and finally the price tag is closed — these tags overlap. The second and third examples show valid ways to record the information. The last example is preferable because it properly demonstrates the relationship between the book and its price.

Entities Must Be Well-Formed

Entities accomplish several things within XML. At their most basic level they provide a method to encode several characters to avoid confusion (<, >, &, ‘, "), as well as represent nonstandard characters such as ©. They can also be used to represent user-defined text (entire sentences or paragraphs). The first use is the only method discussed here because it is critical in creating well-formed XML.

Encoded XML entities all take on a similar form: &entity identifier;. The appropriate encoding for an ampersand is &, which is a special named entity. XML has five named entities, as described in the following table.

Entity Name	Value	Example
lt	Less than <	6 < 7
gt	Greater than >	7 > 6
amp	Ampersand &	Tom & Jerry
apos	Single quote or apostrophe '	Kelly's Car
quot	Double quote "	100% "Secure"

Each of these five named entities must be used within your documents to be considered well formed.

This is invalid XML:

 <bookshelf>   <book>Tom & Jerry's, a history of</book>   <book>2 + 2 < 5, Math for beginners</book> </bookshelf>

Here is an example of valid XML:

 <bookshelf>   <book>Tom &amp; Jerry&apos;s, a history of</book>   <book>2 + 2 &lt; 5, Math for beginners</book> </bookshelf>

The transition between the two should be obvious; the named entities were encoded as necessary in the second example. This encoding is necessary for each of the five named elements for the XML to be considered well-formed.

Encoding other characters (such as ©) is similarly easy; the format is &#Unicode character number;. So © is ©. You can also use the hexadecimal representation of the number by prefixing it with an x: ©.

Valid XML

Valid XML is the next step from well-formed XML. As such, before an XML document can be considered valid, it must first be well-formed. A document can be well-formed while still not being valid.

Valid XML references a Document Type Definition (DTD), which may either be contained within the document itself or, more likely, an external resource. In order for the document to be valid, it must follow the rules outlined by that DTD. Here is an example DTD from an RSS feed:

 <!DOCTYPE rss SYSTEM "http://my.netscape.com/publish/formats/rss-0.91.dtd">

With that declaration in place, the program that parses the XML can retrieve the DTD and ensure it is valid before attempting to process it.

Fully exploring the relationship between DTDs and XML documents is beyond the scope of this book — for now it should suffice to accept that valid XML documents have a DTD and follow the rules outlined within it (all of the feed examples presented within this book are valid).

Additional Considerations

You should be mindful of two additional items when creating XML documents.

Capitalization Matters

In HTML, capitalization is irrelevant. That is not the case with XML; for example, bookShelf is different from bookshelf or BookShelf.

This is not well-formed XML:

 <bookShelf>   <book>Professional PHP 4</BOOK>   <book>Beginning PHP 4</BOOK> </bookshelf>

This is well-formed XML:

 <bookshelf>   <book>Professional PHP 4</book>   <book>Beginning PHP 4</book> </bookshelf>

White Space Will Remain

White space within HTML is stripped out or ignored. Repeated spaces, new lines, and so forth are all removed by the browser, but this is not the case with XML. In XML, white space characters are considered as much of the data as any other character, so they remain. This isn't to say that a web browser displaying the XML won't try and do something funny with the characters (such as displaying white space in a manner identical to its treatment of HTML), but just that XML does recognize those characters.

Now that you have a pretty good idea of XML, the underlying technology behind web feeds, the next section talks about what a web feed looks like in the flesh. Having put in the hard work learning how XML works, you should find that the content of a web feed is pretty easy to understand.