Section 7.1. Introducing Atom | Developing Feeds with Rss and Atom

7.1. Introducing Atom

When, in 2003, it became painfully clear that the RSS world was not going to declare a truce and agree to sort out the remaining problemsthe competing formats being the biggest of them, the lack of documentation the seconda large group of interested developers split off to design a new format from the ground up. After much tooing, froing, cogitating, and argument, not least over the name of the thing, a format has arisen: the Atom Syndication Format.

At time of writing, the format is at Version 0.5, and this book is based on that version. It is hoped that by April 2005, the Atom format will reach a solid Version 1.0 and will be submitted to the Internet Engineering Task Force as a proposed standard. You should therefore, after reading this chapter, consult the necessary web pages for the latest details. Changes will have been made, but nothing too drastic, I believe. Nevertheless, it is safer to warn you that what I am about to write may well be wrong by the time you come to read it.

This chapter is based on the standard found at http://www.ietf.org/internet-drafts/draft-ietf-atompub-format-05.txt, and the mailing list for discussing the syntax of the specification itself is at http://www.imc.org/atom-syntax/index.html.

One key difference between the development of RSS and the development of Atom is that Atom's whole design process is held out in the open, on the Atom-Syntax mailing list just mentioned and on the Atom wiki. The wiki (http://www.intertwingly.net/wiki/pie/FrontPage) is a great place to find the latest developments, issues, ideas, and pointers to the latest specification documents. It is well worth exploring, once you've finished reading.

7.1.1. The Structure of an Atom Feed

Because Atom is really two standards, one for syndication and one for the remote retrieval, creation, and editing of online resources (or, to put it more simply, a weblog API), an Atom document is deeply structured. The syndication format, our bailiwick here, defines two document formats: the Atom feed and the Atom entry.

An Atom feed is made up of none or more Atom entries (although it's most probably going to have at least one entry), plus some additional metadata. An Atom entry is just that: one single indivisible piece of "content." This single indivisible piece of content is what gives Atom its name and is the key to understanding the whole Atom project.

7.1.1.1 The Atom entry

Before we get too excited, let's look at an Atom entry document (Example 7-1).

Example 7-1. An Atom entry document

<?xml version="1.0" encoding="utf-8"?> <entry version="draft-ietf-atompub-format-05: do not deploy"   xmlns="http://purl.org/atom/ns#draft-ietf-atompub-format-05"> <title>Example Entry Document</title> <link  rel="alternate"  type="text/html"  href="http://example.org/example_entry" hreflang="en" title="Example Entry Document" /> <edit href="http://example.org/edit?title=example_entry"> <author>         <name>Ben Hammersley</name>         <uri>http://www.benhammersley.com</uri>         <email>ben@benhammersley.com</email> </author> <contributor>         <name>Albert Einstein</name>         <uri>http://example.org/~al</uri>         <email>BigAl@example.org</email> </contributor> <id>http://example.org/2004/12345678</id> <updated>2004-10-22T22:08:02Z</updated> <published>2004-10-22T20:19:02Z</published> <summary type="TEXT">This is an example of an Atom Entry Document.</summary> <content type="HTML"><p><em>This</em> is an example of an Atom Entry Document. It's rather nice, don't you think?</p></content> <copyright type="TEXT">This example of an Atom Entry Document is hereby granted into the Public Domain</copyright> </entry>

I will address the finer details of the syntax in the next section. An Atom entry document contains, quite readably, a good deal of the information you can possibly say about an Internet resource, plus the content itself. It doesn't contain any metadata about the meaning of the contentleaving that to RDF and RSS 1.0but it does give all the information you might need to display the content and the first order of information about that content: who wrote it, and when, for example.

The Atom Publishing API uses this exact same format to transfer documents around. The ramifications of this architecture will be examined later in this chapter after we've looked over the format more thoroughly. For now, let's look at feeds.

7.1.1.2 Combining entries to make a feed

A feed, happily enough, is just a collection of entry documents, wrapped up with some additional information. Example 7-2 shows a single entry feed using the example entry in Example 7-1.

Example 7-2. An example Atom feed document

<?xml version="1.0" encoding="utf-8"?> <feed version="draft-ietf-atompub-format-05: do not deploy"   xmlns="http://purl.org/atom/ns#draft-ietf-atompub-format-05"> <head>         <title>An Example Feed</title>                  <link          rel="alternate"          type="text/html"          href="http://example.org/index.html"         hreflang="en"         title="Example Page"         />                  <introspection href="http://www.example.org/introspection.xml" />         <post href="http://www.example.org/post" />                  <author>                 <name>Ben Hammersley</name>                 <uri>http://www.benhammersley.com</uri>                 <email>ben@benhammersley.com</email>         </author>                  <contributor>                 <name>Albert Einstein</name>                 <uri>http://example.org/~al</uri>                 <email>BigAl@example.org</email>         </contributor>                  <tagline>Two Atoms are Walking Down the Street.</tagline>         <id>http://www.example.org/feed.xml</id>         <generator          uri="http://www.example.org/atomtool.html"          version="1.0">Acme Atom Tool</generator>                  <copyright>Unless otherwise stated, this feed and its entries are all copyright example.org 2004, and may not be reused under any circumstances under pain of death.</copyright>                  <info>This is an example feed.</info>         <updated>2004-10-22T22:08:02Z</updated> </head> <entry> <title>Example Entry Document</title> <link  rel="alternate"  type="text/html"  href="http://example.org/example_entry" hreflang="en" title="Example Entry Document" /> <edit href="http://example.org/edit?title=example_entry"/> <author>         <name>Ben Hammersley</name>         <uri>http://www.benhammersley.com</uri>         <email>ben@benhammersley.com</email> </author> <contributor>         <name>Albert Einstein</name>         <uri>http://example.org/~al</uri>         <email>BigAl@example.org</email> </contributor> <id>http://example.org/2004/12345678</id> <updated>2004-10-22T22:08:02Z</updated> <published>2004-10-22T20:19:02Z</published> <summary type="TEXT">This is an example of an Atom Entry Document.</summary> <content type="HTML"><p><em>This</em> is an example of an Atom Entry Document. It's rather nice, don't you think?</p></content> <copyright type="TEXT">This example of an Atom Entry Document is hereby granted into the Public Domain</copyright> </entry> </feed>

This too is simple enough. You have the entry document, changed only for the sake of XML syntax (moving up the namespace declaration), and the tiny issue of moving the version attribute to the root element. Other than that, it's unchanged. If there are more entries, they will just drop in below in a predictable manner, as you'll see later.

7.1.2. The Reusable Syntax of Constructs

Both types of document, feed and entry, are made up of standardized elements. Each element is blessed with content that has been organized into one of the options provided by the Reusable Syntax of Constructs. Apart from being a particularly good name for a modern jazz quintet, the idea behind the Reusable Syntax of Constructs is to make the discussion of elements, both established and proposed, much simpler.

All the elements in an Atom document, therefore, can be one of seven alternative Constructs: Text, Person, Date, Service, Link, Category, and Identity. Here's what they mean:

Text

Human-readable text. This may have a type attribute, set to either TEXT, HTML, or XHTML, denoting its format. If the attribute is missing, it is assumed to be TEXT. If you have entity-encoded markup in a Text construct without declaring it as HTML or XHTML, the application reading the feed will display tags literally and won't render it as if the application were a web browser. This ability to categorically state what the content actually is, is a significant difference and improvement over RSS 2.0

If the type attribute is set to HTML, the markup must be entity-encoded like so: <em>this</em>. If you use entity codes within the content itself, they need to be double-encoded. So, if you want to include some HTML code that displays an ampersand, it needs to be marked as &amp; within the Atom element.

With the type attribute set to XHTML, the markup isn't entity-encoded but must be valid and well-formed. Tags must balance and close; if they don't, it throws out the entire document, so great care must be used here.

Person

This construct describes a "person, corporation, or similar entity" according to the specification. It takes three subelements:

name is mandatory and should contain a human-readable name for the entity.
uri is optional, can occur only once if it occurs at all, and must be a standard URI associated with the entity.
email is optional, can only occur once, and must be a valid email address.

The Person construct can also be extended by any namespace-qualified subelements. We will deal with those in Chapter 11.

Date

The simplest construct. Its content is a date/time value, conforming to RFC3339. It's in the format YYYY-MM-DDTHH:MM:SS.ss+HH:MM.

Service

The Service construct is a single empty element with an attribute, HRef, that points to the endpoint of the Atom Publishing API service denoted by the name of the element. For example, the href attribute of the edit element of an Atom entry document points to the endpoint of the edit service for that entry. The HRef attribute of the post element of a feed document points to the post endpoint of that particular feed's installation.

Link

The Link construct is the most complicated of the constructs but perhaps the most interesting and powerful. It denotes a connection from the Atom document to another web resource. It has five attributes:

rel denotes what sort of relationship the link is. It's most commonly either alternate (for an alternatively formatted version of the same content) or related, and is optional. However, if it's left out, it is assumed to be alternate. I'll cover the many different types of link later in this chapter.
type indicates the media type of the resource. This is optional and is only to be taken as a hint. It doesn't override the media type the server returns with the resource. No amount of wishful thinking on behalf of the feed can make a text/plain resource into an audio/mpeg. Its value must be a registered media type as detailed in RFC 2045.
length indicates the size of the resource in octets. Again, like type, this is optional and is only to be taken as a hint: it doesn't override reality.
href is the URI the link points to and so is compulsory and must be a URI. xml:base processing must be applied to the content of this attribute, which means that if the Atom document has declared an xml:base attribute in its root element, this must be taken into account. The lack of an xml:base declaration, too, is significant: relative URIs are meaningless and wrong without one.
HReflang denotes the language of the resource found at the href. It's an optional attribute whose value must be a standard language tag as per RFC 3066. If this attribute is used with rel="alternate", it implies that the resource referenced is a translation of the document in hand.

Category

The Category construct contains information that categorizes elements of an Atom document: the feed itself, or individual entries. It consists of three attributes:

term is a string that identifies the category of the parent element within whatever taxonomy you are operating. If you use the Category construct, you must have a term attribute present.
scheme is a URI that identifies a formal taxonomy within which the term attribute is found. It's optional.
label provides a human-readable label for the term attribute for display by end-user applications. It's optional and allows an element using the Category construct to provide both nicely readable categorization as well as references to more unfriendly formal categories, which might be written as, say, numeric codes.

Identity

This contains a URI to represent the construct's parent for its entire existence. It must be permanent and universally unique, and doesn't change. No matter what happens to that Atom documentwhether it's relocated, migrated, syndicated, republished, exported or imported, updated, downgraded, abused, folded, spindled, or mutilatedan Identity construct is unwavering. It stands by its man. It doesn't change. We salute it.

You should also bear in mind that the Identity construct is a URI, as defined by RFC 2396. This means that it isn't a simple string but is its own datatype with its own rules. See the sidebar Sidebar 7-1 for details.

Dealing with URIs in Atom

URIs are used in Atom to identify the feed or entry. They allow applications to keep track of things they have seen in order to flag new or changed content. Applications do this by comparing the URIs as strings. However, for various complex reasons, (as detailed by Mark Pilgrim at http://www.xml.com/pub/a/2004/08/18/pilgrim.html), this isn't simple. To avoid these problems, Atom specifies that URIs must be normalized by the document's publisher before they are published. Most standard publishing packages already produce normalized URIs, but here are the rules as per RFC 2396bis:

Provide the scheme in lowercase characters: have http:// rather than HTTP://.
Provide the host, if any, in lowercase characters: have www.example.org instead of WWW.EXAMPLE.ORG.
Perform percent-encoding only where essential and use only uppercase A through F characters; decode all percent-encoded characters to their ASCII equivalents if they have any. If not, you should write them like %C3%87 rather than %c3%87 (that's a capital C with a cedilla, by the way).
For schemes that define a default authority, use an empty authority if the default is desired; some URI schemes allow you to pass a username and password within the URI. HTTP, for example, allows http://username:password@www.example.org/ and rules that leave them off entirely, and so accessing the resource as the default user is the same as leaving them blank. (http://@www.example.org/ is the same as http://www.example.org/). If this is the case, and you want to access the resource as the default user, leave off the authentication section entirely. So, use http://www.example.org/, not http://@www.example.org/.
For schemes that define an empty path to be equivalent to a path of /, use /. Don't use www.example.org as shorthand for www.example.org/. With some URI schemes, in some circumstances, the presence or absence of the trailing slash changes the meaning. So, the rule is if you can add the slash without changing the meaning of the path, you should always add it.
For schemes that define a port, use an empty port if the default is desired. As with the authentication, if you're using the default setting, leave it off entirely. So, instead of http://www.example.org:80/ or http://www.example.org:/, use http://www.example.org.
Preserve empty fragment identifiers and queries. With URIs that represent query strings or fragments, you should keep them there, even if they are empty. So http://www.example.org/search?q=atom&x= should remain so; don't change to http://www.example.org/search?q=atom even if the resultant query is exactly the same when dereferenced.
Ensure that all portions of the URI are UTF-8-encoded NFC form Unicode strings. There are multiple ways to encode multibyte unicode characters. Use the form known as Normalized Form C, or the "composed" version.

Sam Ruby has published a Python script to do all this at http://intertwingly.net/stories/2004/08/04/urlnorm.py.