Current Representations of Metadata

If you examine the vast world of information, you'll find that many examples of metadata management exist, particularly in the storage space. Relational databases manage how data is represented, stored, and retrieved using metadata. Beyond the enterprise, you discover that a majority of content exchanged is either structured in text files with delimiters, or not structured at all. These semistructured text files typically have no metadata associated with them, but rather rely on the sender and recipient implicitly understanding the contents and semantics of the files. As we will discuss in Chapter 9, "Integration and Interoperability," EDI fundamentally works on the notion that the sender and the receiver explicitly use a certain structure based on a template for the data.

On the other hand, if you look at a text document without structure, there is no mechanism by which the context or concepts of the information can be inferred. This had led to the use of linguistic and heuristic techniques to build metadata from raw content; several companies have built products that apply these techniques to unstructured text. But what is needed is a nonproprietary ASCII (or ASCII-friendly) "meta language" to assign descriptions of each element or attribute. This language is XML.

The P2P Dilemma

As we have discussed, the WWW today provides an immense amount of readily available online information that lacks metadata describing its content. Traditionally, the WWW browser has been the single interface for accessing content on the Web. This interface, built around the concept of the hyperlink, gave rise to most of the content on the Web today.

With P2P architectures, there might be content on the Web with no browser-friendly link, and thus only accessible via a unique interface. A prime example of this would be Napster, which is a network that primarily shares music files encoded in MP3 format. To view or access files on this network, a proprietary client application is required; the only semantic link between content comes from a centralized server that provides content location. There is no fundamental link relating different content in the network; only peers that act as both server and client depending on their current dynamic roles. How does a peer in one group discover content from a peer in a different network?

The issue of building additional infrastructure to connect and share content compounds the difficulty in building a metadata solution to address the heterogeneity of interconnected networks and the content transferred over them. In the P2P domain, how can we learn from our experience with the WWW to prevent history from repeating itself? How can we solve the metadata problem before P2P content and networks are too pervasive for a solution?

Metadata and Information

The notion of metadata initially emerged in the sciences: How do we coherently organize information in a manner that scales well as the amount of information grows geometrically? Metadata is not a complex concept, really; it is just "data about data."

Ontologies are a methodology for defining metadata standards for a particular domain. For example, biologists more than a hundred years ago defined an ontology by which to organize various living organisms in a comprehensive and understandable manner. Ontologies help establish consensual terminologies that make sense to both sites, thus allowing distinct peers to interact in a common vernacular.

Metadata plays the largest role in areas where there is a complex or large set of information that requires both searching and indexing. Yahoo! was founded on the concept of building a structure to manage information on the WWW to address this very problem. In the public sector, researchers have spent a great deal of time working on solving this complex problem. One of the larger efforts is the Dublin Core Metadata Initiative. Founded in 1994, the effort uses a minimal set of metadata constructs (15 elements) to simplify the discovery of information on the Web. Later in this chapter, we will be exploring this initiative in greater detail. Although work has been done in the area of metadata modeling, and definition in domains such as library science, our present concern is how to apply metadata research to P2P architectures to avoid the current challenges of finding content on the WWW.

World Wide Web, P2P, and Metadata

To reiterate, content on the WWW is difficult to locate because of the lack of support for metadata. Although there is some lightweight support (mainly the <META> tag), the WWW infrastructure doesn't provide much help. Search engines do not require metadata information when registering a site, although if provided it can aid indexing and searching. Domain registrars do require a name for an Internet domain, but the name need not be related to the content found within.

If we return to the example of Napster, where all content is shared and the names or labels of the shared entities themselves provide search criteria, the demand for metadata is low. Yet as we look to P2P frameworks such as JXTA, where many different content types can be exchanged, the need for comprehensive metadata will only increase. Otherwise, we will end up with millions of nodes providing services and content that will not be leveraged because of poor discovery mechanisms. Sound familiar? It should.

HTML and Metadata

The <TITLE> tag is the mainstay of metadata for the WWW world. As someone who spent time providing solutions for search engine ranking optimization, I can confidently say that this tag very rarely describes the content of the page accurately. The <META> tag was intended to address this problem, but it's too simplistic and inflexible (no child elements, for example). So although you could tag metadata for files relating to subjects from health to finances, there is no way to build an ontology within which to define the appropriate content.