XML and Metadata | JavaT P2P Unleashed

At IBM in 1974, Charles Goldfarb created the Standard Generalized Markup Language (SGML) to automate document processing in general. SGML is an international standard for the description of marked-up electronic text. In 1989 Tim Berners-Lee and Robert Caillau at CERN used SGML to develop a flexible hypertext document markup language that they called HTML. Later, members of the W3C designed a new markup language designed specifically for the Web called XML, also a subset of SGML.

Because it leverages the powerful features of SGML and omits the more obscure features, XML is a simple and general syntax for describing hierarchical data that is explained in greater detail in Chapter 3, "P2P Application Types." The SGML Editorial Board under the W3C officially developed XML in 1996. The original specification sets out the following goals:

It shall be straightforward to use XML over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs that process XML documents.
The number of optional features in XML is to be kept to an absolute minimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.

XML namespaces separate application-defined XML data from the special instructions and information used by XML extensions. These features make XML very useful in expressing metadata.

The Semantic Web A Historical Perspective

To reflect back on the beginning of this chapter, we discussed the W3C's vision of a Semantic Web. Layering a Web of metadata over the existing WWW would result in a comprehensive and coherent ontology of content and information. Today, hyperlinks do indeed link pages across the WWW, but with no apparent context. In an effort to learn from the past for the benefit of the next generation of P2P frameworks, let's review some of these concepts.

An early effort to address some of the concepts of a Semantic Web was the Platform for Internet Content Selection (PICS). One of the driving forces for this movement was to facilitate a wide range of filtering and rating services and content.

The implementation used a simple metadata label like the following, which could capture a classification and rating:

 HTTP/1.0 200 OK  Date: Thu, 30 Jun 1995 17:51:47 GMT Last-modified: Thursday, 29-Jun-95 17:51:47 GMT Protocol: {PICS-1.1 {headers PICS-Label}} PICS-Label:  (PICS-1.1 "http://www.gcf.org/v2.5" labels on "1994.11.05T08:15-0500" exp "1995.12.31T23:59-0000" for "http://www.greatdocs.com/foo.html" by "George Sanderson, Jr." ratings (suds 0.5 density 0 color/hue 1)) Content-type: text/html ...contents of foo.html...

Once a label was created, the developer would distribute it along with the document(s) in one of several ways. The recommended method, if your HTTP server allowed it, was to insert an extra HTTP header that preceded the contents of documents sent to Web browsers. The correct format as documented in the specifications was to include the two headers Protocol and PICS-Label, as seen in the previous example. The other method was to embed the label directly in the HTML, as follows:

 <HTML><head>  <META http-equiv="PICS-Label" content=' (PICS-1.1 "http://www.gcf.org/v2.5"    labels on "1994.11.05T08:15-0500"          until "1995.12.31T23:59-0000"            for  http://www.greatdocs.com/foo.html     ratings (suds 0.5 density 0 color/hue 1))  '>  </head>...  </HTML>

PICS defined a detailed set of classifications and ratings, but they were tightly coupled to specific applications. Although one of its goals was to enable different groups to create their own content rating vernacular, it lacked namespaces, which prevented a label to reference or borrow from multiple independent vocabularies. Because of this and other issues, the group decided to preserve the work that they had done in the areas of vocabularies and security in the protocol and lead the formation of the Resource Description Framework (RDF). For more information on PICS see http://www.w3.org/PICS/.

RDF is a data model and XML syntax for type description that enables definition of the relationship between two hyperlinked resources. RDF leverages the vocabulary and query protocols from PICS to relate unique resources via unique identifiers. The unique identifier in RDF is a key notion for the world of metadata. Using unique identifiers and specific means of describing them solves some of the challenges that PICS was unable to accomplish. RDF deployments use a common semantic set to identify all elements as resources, regardless of their origin, and uses Uniform Resource Identifiers (URIs). In the WWW, the Uniform Resource Locator (URL) is a type of URI that describes the location and retrieval of a resource. The URI is more generic and could identify a resource that is not retrievable.

Beyond the world of resources lies the notion of properties. Also a URI, the property describes the elements of a resource, as well as its relationships with other resources. Like resources, properties can take a plurality of definitions and meanings, which provides immense flexibility. By explicitly separating the data from the meaning of the data, RDF provides a consistent layer of abstraction that is defined by the developer, rather than by a standards body.

In RDF, the Resource Description Framework Schema (RDFS) language models class and property hierarchies, as well as other primitives from the RDF model. Figure 8.1 displays RDF semantics and relationships. RDFS defines a schema that RDF documents can be checked against for consistency. RDFS support for modeling ontological concepts and relations is quite basic, and has been placed on the low end of expressiveness. RDFS does not attempt to provide the answer to all knowledge representation problems, but rather an extensible core language. This model for managing content and resources for the Web can be applied to P2P, as the challenges and issues are quite similar. For example, we can imagine that if the content being exchanged leveraged RDF, searching and discovery of content across a P2P landscape would be significantly more effective.

Figure 8.1. The RDF model.

graphics/08fig01.gif

Dublin Core Metadata Initiative

As we discussed earlier, in the mid-1990s a team set out to define a set of elements that could describe any resource available online to pick up where PICS left off. By focusing on normalizing the requirements between many resources, the group intentionally avoided trying to define the semantics for specific instances. The group eventually defined 15 core metadata elements. The following is a typical example of Dublin Core metadata in HTML; <META> tags encapsulate elements from the Dublin Core:

 <html>    <head>     <title>John's Book Page</title>     <meta name="description" content="This book is about...">     <meta name="subject" content="Computing ">     <meta name="creator" content="John Smith ">     <meta name="date" content="2002-01-22T00:10:00+00:00">     <meta name="type" content="html">     <meta name="language" content="en-us">

Because in HTML there is native control of the names of the elements with a <META> tag, there is the potential for ambiguity. With the Dublin Core, we can also use XML namespaces to enable the mixing of descriptive elements that are defined by individual groups without dealing with name or label issues. As each piece of data is linked to a URI, a context and definition is provided by that particular entity.

In the following example using RDF, you can see that the Dublin Core elements have a prefix of dc:, which is associated with the http://purl.org/dc/elements/1.1 namespace as indicated in the header section of the XML document:

 <?xml version="1.0" encoding="iso-8859-1"?>  <rdf:RDF   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"   xmlns:dc="http://purl.org/dc/elements/1.1/"   xmlns="http://purl.org/rss/1.0/" > ...   <item rdf:about="http://www.foo.com/.../metadata.html">     <title>Book</title>     <link>http://www.foo.com/.../book.html </link>     <dc:description>This book is about...</dc:description>     <dc:subject>computing </dc:subject>     <dc:creator>John Smith </dc:creator>     <dc:date>2002-01-22T10:34:00+00:00</dc:date>     <dc:type>book</dc:type>     <dc:language>en-us</dc:language>     <dc:format>text/html</dc:format>     ...   </item>

The <dc:subject> tag equates with the subject element in the dc namespace as defined at the URI http://purl.org/dc/elements/1.1. As a result, namespaces enable you to incorporate additional semantics for specific resources from a generic framework provided by the Dublin Core Metadata Initiative.

In leveraging the Dublin Core, a document does not need to incorporate all 15 elements that it defines. However, it is obvious that the more tags that are embedded in the document, the better defined the metadata. For P2P frameworks, this might result in a decrease in network resources usage, as fewer results might be returned for a specific peer's query.

Other Standards

Ontology Interchange Language (OIL) is a representation and inference layer for ontologies to represent semantics of information on the WWW. OIL synthesizes work from three different communities to provide a general-purpose markup language for the Semantic Web. These areas include the following:

Frame-based systems A model of using classes and properties (modeling primitives)
Description logics Describing knowledge as concepts (semantics)
XML and RDF Web standards Syntax

OIL is an early ontology representation language that is grounded in W3C standards such as RDF and XML. For more information, see http://www.ontoknowledge.org/oil.

The DARPA Agent Markup Language (DAML) is another emerging standing that extends RDF, XML, and OIL. By including support for bounded lists, class expressions, equivalence, and formal semantics, many architects are looking to DAML so solve complex metadata problems. For more information on DAML see http://www.daml.org.

Java and RDF: HP's Jena Toolkit

Jena is a Java API for manipulating RDF models that has been released into the open source community by HP Labs. Their vision of standardized representations for data (RDF) and the conceptual structures behind that data (RDFS, DAML) has been realized in the publication of Jena, an open source toolkit that implements these standards (http://www.hpl.hp.com/semWeb/download.html). The toolkit provides:

Statement-centric methods for manipulating an RDF model
Resource-centric methods for manipulating an RDF model as a set of resources with properties
Cascading method calls for more convenient programming
Built-in support for RDF containers
Enhanced resources The application can extend the behavior of resources
Integrated parsers

To illustrate the Jena API, Listing 8.1 is a class to create, iterate over, and write out an RDF model. This example illustrates a small model, which stores information on how a person rates a book.

Listing 8.1 Sample from HP Jena Toolkit

 import com.hp.hpl.mesa.rdf.jena.mem.ModelMem; import com.hp.hpl.mesa.rdf.jena.model.*; import com.hp.hpl.mesa.rdf.jena.vocabulary.*; import java.io.FileOutputStream; import java.io.PrintWriter; public class RDFGenerator extends Object {     public static void main (String args[]) {         // some definitions         String personURI    = "http://somewhere/JohnSmith";         String givenName    = "John";         String familyName   = "Smith";         String fullName     = givenName + " " + familyName;         try {             // create an empty graph             Model model = new ModelMem();             // create the resource             //   and add the properties cascading style             Resource johnSmith               = model.createResource(personURI)                      .addProperty(VCARD.FN, fullName)                      .addProperty(VCARD.N,                                   model.createResource()                                        .addProperty(VCARD.Given, givenName)                                        .addProperty(VCARD.Family, familyName));             // now write the model in XML form to a file             model.write(new PrintWriter(System.out));         } catch (Exception e) {             System.out.println("Failed: " + e);         }     }

The output from the above code generates the following RDF code:

 <rdf:RDF    xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'   xmlns:vcard='http://www.w3.org/2001/vcard-rdf/3.0#'  >   <rdf:Description rdf:about='http://somewhere/JohnSmith'>     <vcard:FN>John Smith</vcard:FN>     <vcard:N rdf:resource='#A0'/>   </rdf:Description>   <rdf:Description rdf:about='#A0'>     <vcard:Given>John</vcard:Given>     <vcard:Family>Smith</vcard:Family>   </rdf:Description> </rdf:RDF>

The RDF specifications specify how to represent RDF as XML, and we can see that representation above. Let's examine the RDF output a bit closer, and initially look at the <rdf:RDF> element. This element is optional, and defines the two namespaces used in the document. The <rdf:Description> element describes the resource whose URI is 'http://somewhere/JohnSmith'. If the rdf:about attribute were missing, this element would represent a blank node. The <vcard:FN> element describes a property of the resource and the property name is the FN in the vcard namespace. RDF converts this to a URI reference by concatenating the URI reference for the namespace prefix and FN, the local name part of the name. In this example, this gives a URI reference of 'http://www.w3.org/2001/vcard-rdf/3.0#FN'. More detail can be found at www.w3.org/TR/rdf-primer in regards to the XML representation of RDF.