Appendix A. The XML You Need for RSS


Appendix A. The XML You Need for RSS

The purpose of this appendix is to introduce you to XML. A knowledge of XML is essential if you want to write RSS documents directly, rather than having them generated by some utility. If you're already acquainted with XML, you don't need to read this appendix. If not, read on.

The general overview of XML given in this appendix should be more than sufficient to enable you to work with the RSS documents. For further information about XML, the O'Reilly books Learning XML and XML in a Nutshell are invaluable guides, as is the weekly online magazine XML.com.

Note that this appendix makes frequent reference to the formal XML 1.0 specification, which can be used for further investigation of topics that fall outside the scope of RSS. Readers are also directed to the "Annotated XML Specification," written by Tim Bray and published online at http://XML.com, which provides an illuminating explanation of the XML 1.0 specification, and "What is XML?" by Norm Walsh, also published on XML.com.


    A.1. What Is XML?

    XML (Extensible Markup Language) is an Internet-friendly format for data and documents, invented by the World Wide Web Consortium (W3C). Markup denotes a way of expressing the structure of a document within the document itself. XML has its roots in a markup language called SGML (Standard Generalized Markup Language), which is used in publishing and shares this heritage with HTML. XML was created to do for machine-readable documents on the Web what HTML did for human-readable documentsthat is, provide a commonly agreed-upon syntax, so that processing the underlying format becomes a commodity and documents are made accessible to all users.

    Unlike HTML, though, XML comes with very little predefined. HTML developers are accustomed to both the notion of using angle brackets (<>) for denoting elements (that is, syntax) and also the set of element names themselves (such as head, body, etc.). XML shares only the former feature (i.e., the notion of using angle brackets for denoting elements). Unlike HTML, XML has no predefined elements but is merely a set of rules that lets you write other languages like HTML. (To clarify XML's relationship with SGML: XML is an SGML subset. In contrast, HTML is an SGML application. RSS uses XML to express its operations and thus is an XML application.)

    Because XML defines so little, it is easy for everyone to agree to use the XML syntax and then build applications on top of it. It's like agreeing to use a particular alphabet and set of punctuation symbols, but not saying which language to use. However, if you come to XML from an HTML background (and have an interest in extending RSS), you may need to prepare yourself for the shock of having to choose what to call your tags!

    Knowing that XML's roots lie with SGML should help you understand some of XML's features and design decisions. Note that although SGML is essentially a document-centric technology, XML's functionality also extends to data-centric applications, including RSS. Commonly, data-centric applications don't need all the flexibility and expressiveness that XML provides and limit themselves to employing only a subset of XML's functionality.


      A.2. Anatomy of an XML Document

      The best way to explain how an XML document is composed is to present one. The following example shows an XML document you might use to describe two authors:

      <?xml version="1.0" encoding="us-ascii"?>
      <authors>
          <person >
              <name>Edward Lear</name>
              <nationality>British</nationality>
          </person>
          <person >
              <name>Isaac Asimov</name>
              <nationality>American</nationality>
          </person>
          <person />
      </authors>

      The first line of the document is known as the XML declaration. This tells a processing application which version of XML you are using (the version indicator is mandatory) and which character encoding you have used for the document. In this example, the document is encoded in ASCII. (The significance of character encoding is covered later in this appendix.)

      If the XML declaration is omitted, a processor makes certain assumptions about your document. In particular, it expects it to be encoded in UTF-8, an encoding of the Unicode character set. However, it is best to use the XML declaration wherever possible, both to avoid confusion over the character encoding and to indicate to processors which version of XML you're using.

      A.2.1. Elements and Attributes

      The second line of the example begins an element, which has been named authors. The contents of that element include everything between the right angle bracket (>) in <authors> and the left angle bracket (<) in </authors>. The actual syntactic constructs <authors> and </authors> are often referred to as the element start tag and end tag, respectively. Don't confuse tags with elements! Note that elements may include other elements, as well as text. An XML document must contain exactly one root element, which contains all other content within the document. The name of the root element defines the type of the XML document.

      Elements that contain both text and other elements simultaneously are classified as mixed content. RSS doesn't generally use mixed content.

      The sample authors document uses elements named person to describe the authors themselves. Each person element has an attribute named id. Unlike elements, attributes can contain only textual content. Their values must be surrounded by quotes. Either single quotes (') or double quotes (") may be used, as long as you use the same kind of closing quote as the opening one.

      Within XML documents, attributes are frequently used for metadata (i.e., "data about data"), describing properties of the element's contents. This is the case in our example, where id contains a unique identifier for the person being described.

      As far as XML is concerned, the order in which attributes are presented in the element start tag doesn't matter. For example, these two elements contain the same information, as far as an XML 1.0-conformant processing application is concerned:

      <animal name="dog" legs="4"/>
      <animal legs="4" name="dog"/>

      On the other hand, the information presented to an application by an XML processor after reading the following two lines is different for each animal element, because the ordering of elements is significant:

      <animal><name>dog</name><legs>4</legs></animal>
      <animal><legs>4</legs><name>dog</name></animal>

      XML treats a set of attributes like a bunch of stuff in a bagthere is no implicit orderingwhile elements are treated like items on a list, where ordering matters.

      New XML developers frequently ask when it is best to use attributes to represent information and when it is best to use elements. As you can see from the authors example, if order is important to you, then elements are a good choice. In general, there is no hard-and-fast "best practice" for choosing whether to use attributes or elements.

      The final author described in our document has no information available. All that's known about this person is his or her id, mysteryperson. The document uses the XML shortcut syntax for an empty element. The following is a reasonable alternative:

      <person ></person>

      A.2.2. Name Syntax

      XML 1.0 has certain rules about element and attribute names. In particular:

      • Names are case-sensitive: e.g., <person/> isn't the same as <Person/>.

      • Names beginning with xml (in any permutation of uppercase or lowercase) are reserved for use by XML 1.0 and its companion specifications.

      • A name must start with a letter or an underscore, not a digit, and may continue with any letter, digit, underscore, or period. (Actually, a name may also contain a colon, but the colon is used to delimit a namespace prefix and isn't available for arbitrary use as of the second edition of XML 1.0. Knowledge of namespaces isn't required for understanding RSS, but for more information, see Tim Bray's "XML Namespaces by Example," published at http://www.xml.com/pub/a/1999/01/namespaces.html.)

      A precise description of names can be found in Section 2.3 of the XML 1.0 specification at http://www.w3.org/TR/REC-xml#sec-common-syn.

      A.2.3. Well-Formedness

      An XML document that conforms to the rules of XML syntax is said to be well-formed. At its most basic level, well-formedness means that elements should be properly matched, and all opened elements should be closed. A formal definition of well-formedness can be found in Section 2.1 of the XML 1.0 specification at http://www.w3.org/TR/REC-xml#sec-well-formed. Table A-1 shows some XML documents that aren't well-formed.

      Table A-1. Examples of poorly formed XML documents

      Document

      Reason it's not well-formed

      <foo>
        <bar>
        </foo>
      </bar>

      The elements aren't properly nested, because foo is closed while inside its child element bar.

      <foo>
        <bar>
      </foo>

      The bar element was not closed before its parent, foo, was closed.

      <foo baz>
      </foo>

      The baz attribute has no value. While this is permissible in HTML (e.g., <table border>), it is forbidden in XML.

      <foo baz=23>
      </foo>

      The baz attribute value, 23, has no surrounding quotes. Unlike HTML, all attribute values must be quoted in XML.


      A.2.4. Comments

      As in HTML, it is possible to include comments within XML documents. XML comments are intended to be read only by people. With HTML, developers have occasionally employed comments to add application-specific functionality. For example, the server-side include functionality of most web servers uses instructions embedded in HTML comments. XML provides other ways to indicate application-processing instructions. A discussion of processing instructions (PIs) is outside the scope of this book. For more information on PIs, see Section 2.6 of the XML 1.0 specification at http://www.w3.org/TR/REC-xml#sec-pi. Comments should not be used for any purpose other than those for which they were intended.

      The start of a comment is indicated with <!--, and the end of the comment is indicated with -->. Any sequence of characters, aside from the string --, may appear within a comment. Comments tend to be used more in XML documents intended for human consumption than those intended for machine consumption. Comments aren't widely used in RSS.

      A.2.5. Entity References

      Another feature of XML that is occasionally useful when writing RSS documents is the mechanism for escaping characters.

      Because some characters have special significance in XML, there needs to be a way to represent them. For example, in some cases the < symbol might really be intended to mean "less than" rather than to signal the start of an element name. Clearly, just inserting the character without any escaping mechanism will result in a poorly formed document, because a processing application assumes you are starting another element. Another instance of this problem is the need to include both double quotes and single quotes simultaneously in an attribute's value. Here's an example that illustrates both difficulties:

      <badDoc>
        <para>
          I'd really like to use the < character
        </para>
        <note title="On the proper 'use' of the " character"/>
      </badDoc>

      XML avoids this problem by using the predefined entity reference. The word entity in the context of XML simply means a unit of content. The term entity reference means just that: a symbolic way of referring to a certain unit of content. XML predefines entities for the following symbols: left angle bracket (<), right angle bracket (>), apostrophe ('), double quote ("), and ampersand (&).

      An entity reference is introduced with an ampersand (&), which is followed by a name (using the word "name" in its formal sense, as defined by the XML 1.0 specification) and terminated with a semicolon (;). Table A-2 shows how the five predefined entities can be used within an XML document.

      Table A-2. Predefined entity references in XML 1.0

      Literal character

      Entity reference

      <

      &lt;

      >

      &gt;

      `

      &apos;

      "

      &quot;

      &

      &amp;


      Here's the problematic document, revised to use entity references:

      <badDoc>
        <para>
          I'd really like to use the &lt; character
        </para>
        <note title="On the proper &apos; use &apos;  of the &quot;character"/>
      </badDoc>

      XML 1.0 allows you to define your own entities and use entity references as shortcuts in your document, but the predefined entities are often all you need for RSS or Atom; in general, entities are provided as a convenience for human-created XML. Section 4 of the XML 1.0 specification, available at http://www.w3.org/TR/REC-xml#sec-physical-struct, describes the use of entities.

      A.2.6. Character References

      You may find character references in the context of RSS documents. Character references allow you to denote a character by its numeric position in the Unicode character set (this position is known as its code point). Table A-3 contains a few examples that illustrate the syntax.

      Table A-3. Example character references

      Actual character

      Character reference

      1

      &#48;

      A

      &#65;

      ~

      &#xD1;

      ®

      &#xAE;


      Note that the code point can be expressed in decimal or, with the use of x as a prefix, in hexadecimal.

      A.2.7. Character Encodings

      The subject of character encodings is frequently a mysterious one for developers. Most code tends to be written for one computing platform and, normally, to run within one organization. Although the Internet is changing things quickly, most of us have never had cause to think too deeply about internationalization.

      XML, designed to be an Internet-friendly syntax for information exchange, has internationalization at its very core. One of the basic requirements for XML processors is that they support the Unicode standard character encoding. Unicode attempts to include the requirements of all the world's languages within one character set. Consequently, it is very large!

      A.2.7.1 Unicode encoding schemes

      Unicode 3.0 has more than 57,700 code points, each of which corresponds to a character. (You can obtain charts of all these characters online by visiting http://www.unicode.org/charts/.) If you were to express a Unicode string using the position of each character in the character set as its encoding (in the same way as ASCII does), expressing the whole range of characters would require four octets for each character. (An octet is a string of eight binary digits, or bits. A byte is commonly, but not always, considered the same thing as an octet.) Clearly, if a document is written in 100% American English, it would be four times larger than requiredall the characters in ASCII fitting into a 7-bit representation. This places a strain on both storage space and on memory requirements for processing applications.

      Fortunately, two encoding schemes for Unicode alleviate this problem: UTF-8 and UTF-16. As you might guess from their names, applications can process documents in these encodings in 8- or 16-bit segments at a time. When code points are required in a document that can't be represented by one chunk, a bit-pattern indicates that the following chunk is required to calculate the desired code point. In UTF-8, this is denoted by the most significant bit of the first octet being set to 1.

      This scheme means that UTF-8 is a highly efficient encoding for representing languages using Latin alphabets, such as English. All of the ASCII character set is represented natively in UTF-8; an ASCII-only document and its equivalent in UTF-8 are identical byte for byte.

      This knowledge will also help you debug encoding errors. One frequent error arises because of the fact that ASCII is a proper subset of UTF-8; programmers get used to this fact and produce UTF-8 documents but use them as if they were ASCII. Things start to go awry when the XML parser processes a document containing, for example, characters such as &Aacute;. Because this character can't be represented using only one octet in UTF-8, a two-octet sequence is produced in the output document; in a non-Unicode viewer or text editor, it looks like a couple of characters of garbage.

      A.2.7.2 Other character encodings

      Unicode, in the context of computing history, is a relatively new invention. Native operating system support for Unicode is by no means widespread. For instance, although Windows NT offers Unicode support, Windows 95 and 98 don't.

      XML 1.0 allows a document to be encoded in any character set registered with the Internet Assigned Numbers Authority (IANA). European documents are commonly encoded in one of the ISO Latin character sets, such as ISO-8859-1. Japanese documents normally use Shift-JIS, and Chinese documents use GB2312 and Big 5.

      A full list of registered character sets can be found at http://www.iana.org/assignments/character-sets.

      XML processors aren't required by the XML 1.0 specification to support any more than UTF-8 and UTF-16, but most commonly support other encodings, such as US-ASCII and ISO-8859-1. Although most RSS transactions are currently conducted in ASCII (or the ASCII subset of UTF-8), there is nothing to stop RSS documents from containing, say, Korean text. However, you will probably have to dig into the encoding support of your computing platform to find out if it is possible for you to use alternate encodings.

      A.2.8. Validity

      In addition to well-formedness, XML 1.0 offers another level of verification called validity. To explain why validity is important, let's take a simple example. Imagine you invented a simple XML format for your friends' telephone numbers:

      <phonebook>
        <person>
          <name>Albert Smith</name>
          <number>123-456-7890</number>
        </person>
        <person>
          <name>Bertrand Jones</name>
          <number>456-123-9876</number>
        </person>
      </phonebook>

      Based on your format, you also construct a program to display and search your phone numbers. This program turns out to be so useful, you share it with your friends. However, your friends aren't so hot on detail as you are, and they try to feed your program this phone book file:

      <phonebook>
        <person>
          <name>Melanie Green</name>
          <phone>123-456-7893</phone>
        </person>
      </phonebook>

      Note that, although this file is perfectly well-formed, it doesn't fit the format you prescribed for the phone book, and you find you need to change your program to cope with this situation. If your friends had used number as you did to denote the phone number, and not phone, there wouldn't have been a problem. However, as it is, this second file isn't a valid phonebook document.

      A.2.8.1 Document type definitions (DTDs)

      For validity to be a useful general concept, we need a machine-readable way of saying what a valid document isthat is, which elements and attributes must be present and in what order. XML 1.0 achieves this by introducing document type definitions (DTDs). For the purposes of RSS, you don't need to know much about DTDs. Rest assured that RSS does have a DTD, and it spells out in detail exactly which combinations of elements and attributes make up a valid document.

      The purpose of a DTD is to express the allowed elements and attributes in a certain document type and to constrain the order in which they must appear within that document type. A DTD is generally composed of one file, which contains declarations defining the element types and attribute lists. (In theory, a DTD may span more than one file; however, the mechanism for including one file inside anotherparameter entitiesis outside the scope of this book.) It is common to mistakenly conflate element and element types. The distinction is that an element is the actual instance of the structure as found in an XML document, whereas the instance's kind of element is the element type.

      A.2.9. Putting It Together

      If you want to validate RSS against a DTD, you need to know how to link a document to its defining DTD. This is done with a document type declaration, <!DOCTYPE ...>, inserted at the beginning of the XML document, after the XML declaration in our fictitious example:

      <?xml version="1.0" encoding="us-ascii"?>
      <!DOCTYPE authors SYSTEM "http://example.com/authors.dtd">
      <authors>
          <person >
              <name>Edward Lear</name>
              <nationality>British</nationality>
          </person>
          <person >
              <name>Isaac Asimov</name>
              <nationality>American</nationality>
          </person>
          <person />
      </authors>

      This example assumes the DTD file has been placed on a web server at example.com. Note that the document type declaration specifies the root element of the document, not the DTD itself. You can use the same DTD to define person, name, or nationality as the root element of a valid document. Certain DTDs, such as the DocBook DTD for technical documentation (see http://www.docbook.org), use this feature to good effect, allowing you to provide the same DTD for multiple document types.

      A validating XML processor is obligated to check the input document against its DTD. If it doesn't validate, the document is rejected. To return to the phone book example, if your application validated its input files against a phone book DTD, you would have been spared the problems of debugging your program and correcting your friend's XML, because your application would have rejected the document as being invalid. While some of the programs that read RSS files do worry about validation, most don't.

      A.2.10. XML Namespaces

      XML 1.0 lets developers create their own elements and attributes, but it leaves open the potential for overlapping names. "Title" in one context may mean something entirely different than "Title" in a different context. The "Namespaces in XML" specification (which can be found at http://www.w3.org/TR/REC-xml-names) provides a mechanism developers can use to identify particular vocabularies using URIs.

      RSS 1.0 uses the URI http://purl.org/rss/1.0/ for its base namespace. The URI is just an identifier; opening that page in a web browser reveals some links to the RSS, XML 1.0, and Namespaces in XML specifications. Programs processing documents with multiple vocabularies can use the namespaces to figure out which vocabulary they are handling at any given point in a document.

      Namespaces are very simple on the surface but are a well-known field of combat in XML arcana. For more information on namespaces, see O'Reilly's XML in a Nutshell or Learning XML. The use of namespaces in RSS is discussed in much greater detail in Chapters 6 and 7.