Infosets

Infosets

While discussing creating XML documents, it's worth discussing another XML specification: the XML Information Set specification, which you'll find at www.w3.org/TR/xml-infoset.

XML documents excel at storing data, and this has led developers to wonder if XML will ultimately be able to solve an old problem: being able to directly compare and classify the data in multiple documents. For example, consider the World Wide Web as it stands today: There can be thousands of documents on a particular topic, but how can you possibly compare them? For example, a search for the term XML turns up millions of matches, but it would be extraordinarily difficult to write a program that would compare the data in those documents because all that data isn't stored in any remotely compatible format.

The idea behind XML information sets, also called infosets, is to set up an abstract way of looking at an XML document so that it can be compared to others. To have an infoset, XML documents may not use colons in tag and attribute names unless they are used to support namespaces. Documents do not need to be valid to have an infoset, but they need to be well formed .

An XML document's information set consists of two or more information items (the information set for any well-formed XML document contains at least the document information item and one element information item). An information item is an abstract representation of some part of an XML document, and each information item has a set of properties, some of which are considered core and some of which are considered peripheral.

An XML information set can contain 15 different types of information items:

  • A document information item (core)

  • Element information items (core)

  • Attribute information items (core)

  • Processing instruction information items (core)

  • Reference to skipped entity information items (core)

  • Character information items (core)

  • Comment information items (peripheral)

  • A document type definition information item (peripheral)

  • Entity information items (core for unparsed entities, peripheral for others)

  • Notation information items (core)

  • Entity start marker information items (peripheral)

  • Entity end marker information items (peripheral)

  • CDATA start marker information items (peripheral)

  • CDATA end marker information items (peripheral)

  • Namespace declaration information items (core)

There is always one document information item in the information set. Here's a list of the core properties of the document information item:

  • [children] This property holds an ordered list of references to child information items, in the original document order.

  • [notations] This property holds an unordered set of references to notation information items (which we'll see more about in the next chapter).

  • [entities] This property holds an unordered set of references to entity information items, one for each unparsed entity declaration in the DTD.

The document information item can also have these properties:

  • [base URI] This property holds the absolute URI of the document entity.

  • [children - comments] This property holds a reference to a comment information item for each comment outside the document element.

  • [children - doctype] This property holds a reference to one document type definition information item.

  • [entities - other] This property holds a reference to an entity information item for each parsed general entity declaration in the DTD.

The other information items, such as element information items and processing instruction information items, have similar properties lists.

Currently, no applications create and work with infosets. However, W3C documentation often refers to the information stored in an XML document as its infoset, so it's an important term to know. The closest you come to working with infosets right now is working with canonical XML documents (see the next topic).



Real World XML
Real World XML (2nd Edition)
ISBN: 0735712867
EAN: 2147483647
Year: 2005
Pages: 440
Authors: Steve Holzner

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net