5.1. Metadata in RSS 2.0
As all good tutorials on the subject will tell you, metadata is data about data. In the case of RSS 2.0, this includes the name of the author of the feed, the date the channel was last updated, and so on. In Example 5-1, the bold code is the metadata. You can remove this data, and the feed itself will still both parse and be useful when displayed as HTML. Like a Hitchcock cameo, the metadata is in the background, silent, but meaningful to those who can see it.
Example 5-1. The metadata within an RSS 2.0 feed
<rss version="2.0"> <channel> <title>RSS2.0 Example</title> <link>http://www.oreilly.com/example/index.html</link> <description>This is an example RSS2.0 feed</description> <language>en-gb</language> <copyright>Copyright 2004, Oreilly and Associates.</copyright> <managingEditor>email@example.com</managingEditor> <webMaster>firstname.lastname@example.org</webMaster> <pubDate>03 Apr 04 1500 GMT</pubDate> <lastBuildDate>03 Apr 04 1500 GMT</lastBuildDate> <docs>http://backend.userland.com/rss091</docs> <skipDays> <day>Monday</day> </skipDays> <skipHours> <hour>20</hour> </skipHours> <cloud domain="http://www.oreilly.com" port="80" path= "/RPC2" registerProcedure="pleaseNotify" protocol="XML-RPC" /> <image> <title>RSS0.91 Example</title> <url>http://www.oreilly.com/example/images/logo.gif</url> <link>http://www.oreilly.com/example/index.html</link> <width>88</width> <height>31</height> <description>The World's Leading Technical Publisher</description> </image> <textInput> <title>Search</title> <description>Search the Archives</description> <name>query</name> <link>http://www.oreilly.com/example/search.cgi</link> </textInput> <item> <title>The First Item</title> <link>http://www.oreilly.com/example/001.html</link> <description>This is the first item.</description> <source url="http://www.anothersite.com/index.xml">Another Site</source> <enclosure url="http://www.oreilly.com/001.mp3" length="54321" type"audio/mpeg"/> <category domain="http://www.dmoz.org"> Business/Industries/Publishing/Publishers/Nonfiction/</category> </item> <item> <title>The Second Item</title> <link>http://www.oreilly.com/example/002.html</link> <description>This is the second item.</description> <source url="http://www.anothersite.com/index.xml">Another ;Site</source> <enclosure url="http://www.oreilly.com/002.mp3" length="54321" type"audio/mpeg"/> <category domain="http://www.dmoz.org"> Business/Industries/Publishing/Publishers/Nonfiction/</category> </item> </channel> </rss>
With this sort of simple metadata, written in the grammar of RSS 2.0's XML format, we are describing simple statements. Take the first line of metadata, for example, which focuses on the language aspect:
<channel> ... <language>en-gb</language> ... </channel>
Here is the language element with a value of en-gb. The language element is a subelement of channel, so a simple translation of the XML into English reads, "The object called channel has a subelement called language whose value is en-gb."
This phrase is grammatically and semantically correct, but it lacks a certain poetry. (The use of the term "object" is bound to confuse people when they have their programmer's hat on.) Here's a rewrite into something a little more friendly: "The channel's language is en-gb."
Now that's more like it. It's a statement of fact from the metadata: "The language of the channel is British English."
So far, so easy, you say. Well, you're quite right; metadata is all about making statements. With the simple metadata present in RSS 2.0, we do it all the time:
<language>en-gb</language> <copyright>Copyright 2004, O'Reilly Media, Inc.</copyright> <managingEditor>email@example.com</managingEditor> <webMaster>firstname.lastname@example.org</webMaster> <pubDate>03 Apr 04 1500 GMT</pubDate> <lastBuildDate>03 Apr 04 1500 GMT</lastBuildDate>
From this section, you can see the feed is in English; it is copyright 2004, O'Reilly Media, Inc.; the managing editor is email@example.com, and so on.
You will notice, alas, that all isn't perfect with this syntax. For example, the managing editor is defined as firstname.lastname@example.org. To you and me, it is obvious that this is an email address for a person, and you can act accordingly, but to a machinea search engine, for exampleit is a general email address at best and just a string at worst. Either way, no one can tell anything at all about the managing editor. Herein lies a classic problem.
Let's recap. The simple metadata found in RSS 0.9x makes a simple statement based on its element, the element's value, and the place of the element within the document. We know the language element refers to the channel that is one level above it within the XML document. We also know that in the example, the value of language is en-gb, and by understanding what the element and its value mean, we can make the statement that the channel is written in British English.
Going back to childhood grammar classes, it's apparent that this is a simple subject/predicate/object sentence:
This sort of statement is called a triple. Remember this word: you'll need it later. Now, these simple triples work well for most things within RSS 2.0, but they somewhat limit you to raw data values: things such as dates and language codes that are unambiguous and easily understood. Triples don't help one bit when you're talking about abstract concepts, such as subjects, or when referring to other entities, such as people. Plus, and this is key, without human interaction, the combination of an arbitrary element name, value, and position within the document is meaningless. If you disregard the ability to read English, you can't tell what any of the element names refer to, and you can't understand their values. As it stands, RSS 2.0's metadata can't be understood by machines, and the triples there are, though elegant, are limited when you take the human out of the equation. Without machine comprehension, you lose a great deal of potential utility from RSS feeds.
To start rectifying this situation, we need to define exactly what every word in the statement means. To do this, let's study the Uniform Resource Identifier (URI).
5.1.1. Using URIs in RSS
A URI is a string of characters using a particular syntax that identifies a resource. This resource can be anything that has an identity, whether it is tangible or not: a person, a book, a standard, a web site, a service, an email address, and so on. For example:
You'll notice that these look very similar to URLsthe standard hyperlinks. You're right; URLs are a subset of URIs. There are, however, some major differences between the two.
Primarily, even though many URIs are named after, and closely resemble, network-contactable URLs, this doesn't mean that the resources they identify are retrievable via that network method: a person can be represented by a URI that looks like a URL, but pointing a browser at it doesn't retrieve the person. A conceptthe XML standard, for examplehas its own URI that starts with http://, but typing it into your address bar won't make your computer understand the XML standard. In fact, there is a whole debate, that we need not detain ourselves with in this book, over exactly what the URI represents: the thing, or a representation of the thing. It gets very philosophical and can be very interesting. But it's also well beyond the scope of actual useful code.
So, a URI simply provides a unique identifier for the resource, whatever it is. Granted, wherever possible, the URI gives you something useful (documentation on the resource, usually) if it is treated like a URL, but this isn't necessary.
Now, by allowing resources to be defined, we can make our metadata more robust. Let's reconsider the managingEditor example:
At the moment, we can't make any form of definitive statement about this, except what can be understood from being able to read English. We can't say for sure what managingEditor actually means (what context is this in?), nor can we understand what the value denotes. Is it an email address you can freely contact, or is it something else? You just can't tell.
If you assign URIs to each resource in this statement, you can give it more meaning:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:RSS091="http://purl.org/rss/1.0/modules/rss091#" xmlns:rss="http://purl.org/rss/1.0/"> <rss:channel rdf:about="http://www.example.org/example.rss"> <RSS091:managingEditor>email@example.com</RSS091:managingEditor> </rss:channel> </rdf:RDF>
This example introduces a few more concepts, which we'll discuss in the next section. In the meantime, if you look at the emphasized code, you'll see that the channel gains a URI, denoted by the rdf:about="" attribute, and the managingEditor element becomes RSS091:managingEditor.
This immediately gives more context to the metadata. For one, the channel is uniquely defined. Second, the managingEditor element is associated with a concept of RSS091, which itself is given a URI to identify it uniquely. Third, the concept of a channel is associated with its own URI. From this information, you can make the following assertion:
Because you can know what the managingEditor element means in the context of the resource represented by the URI http://purl.org/rss/1.0/modules/rss091# (it's the guy in charge of the site the feed is from, but you'll have to wait until Chapter 6 to see why), you can now understand what the statement means. Even better than that, you can start to make definitive statements about the metadata within a document, and hence about the document itself. We, and other machines, can definitively state that the managing editor of this feed has the email address firstname.lastname@example.org, because we've defined all the terms we are using. There is no ambiguity as to what each phrase means or to what it refers.
You probably noticed the additional lines of code within the example. This was your first look at RDF. The rest of this chapter deals with RDF, so let's take a look at it in some detail.