14.4 Pacific National Laboratories

The last example is a real case study like the Exchange 2000 discussion, not a hypothetical case. This case study shows how WebDAV can be used to design a custom application and make it extensible and scalable.

14.4.1 Overview

Pacific National Laboratories (PNL) designs problem-solving environments (PSEs) for government research groups. PSEs allow researchers to share and update documents. Researchers author many papers together, even forming teams that cross national boundaries. The final papers are made very widely available, perhaps publicly available on the Web. In contrast to normal multiple-author scenarios, researchers annotate documents more often, adding properties to make searches work better. The information in this chapter is from PNL papers [Schuchardt02a], [Schuchardt02b].

In addition to authoring and publishing papers, researchers now need to share their raw data and semantic data on the Web. For example, a researcher might construct an XML file representing the structure of a specific molecule and publish that semantic data together with raw data from a number of tests involving that molecule. Other researchers can interact with the data and use it more easily than when study results are distributed in paper journals. There are already standard file formats to make data sets or molecule structure information transferable, but there's no standard to share these files.

PSEs have been developed to try to unify these data-sharing functions into one flexible application. The traditional approach to storing information in a PSE has been to store documents in database Binary Large Object (BLOB) tables and to store the metadata in related tables. Then the PSE presents a custom view of those documents and metadata, possibly by generating dynamic Web pages. However, this approach leads to overly rigid schemas because it's difficult to plan in advance precisely what metadata will be used. Fixed table schemas for database storage make it difficult to update the system when researchers need to change the way they refer to documents.

PNL researchers designed a new approach to PSEs that involved a number of separate repositories and a flexible schema. Each repository can extend the base metadata schema in different ways. In their design, no one data store or component needs to know the entire schema. Thus, it's much easier for each repository to evolve to serve the needs of its main users (the researchers directly contributing to that repository) while still allowing access to a large readership.

WebDAV is an ideal protocol for this problem. It defines the syntax for this extensible metadata schema (property names and XML namespaces) as well as the access protocol.

14.4.2 Example Problem-Solving Environment

An example of one of the specific problem-solving domains is molecular science and complex chemical systems. PNL has a Molecular Science Software Suite, including software with advanced computational chemistry techniques, project management assistance, and calculation engines. It allows scientists to construct models of complex molecules, enter research results and analyze them, and launch distributed server-side execution of their computational models for speedy completion.

One component of this suite is called "Ecce," and among other things, Ecce needs to store its data and modify the data at later stages. Originally (1994 2002), Ecce used an Object-Oriented Database (OODB) to store data represented internally as objects. In 2001, Ecce had 70 different kinds of objects it could store in the database, such as Molecule, Task, Experiment, Calculation, and File. Although this approach was successful for years, eventually it began to reach its limitations.

  • Any database schema used by multiple parties requires agreement on all aspects of the schema, and the implementors of the different modules in Ecce found the agreement process arduous.

  • The choice to use OODBs unexpectedly created a dependency on a specific vendor. Although there are some standards in the area, each OODB may have a proprietary binary document storage format and different ties to programming languages.

  • OODBs are not easily and tightly linked to the Web for wide publishing.

  • OODB clients are "fat," not "thin." OODB clients must know the specific data schema in order to consume data in OODBs. Compare this to the Web, where thin clients accept whatever data the Web server is capable of formatting and simply display the data for the user.

Now that Ecce is adding support for molecular dynamics, the problems of the underlying store may be magnified. Database schemas become more and more complex as the system grows to handle more and more kinds of data, and eventually the system becomes quite unmanageable. Although PNL could have addressed some of these problems by spending more money on OODB software and related software, Ecce's designers chose to investigate other technology, such as WebDAV/database interface software.

14.4.3 Solution Requirements

PNL required a solution to meet the following requirements:

  • Allow direct access to raw data. Data should be accessible in its raw format, not just through the libraries that impose an object model on the stored data.

  • Metadata schemas should be discoverable and extensible, rather than fixed in advanced.

  • The storage layer should not have to be aware of the nature of each application object (each document or object that has to be stored). This layer separation allows the metadata schemas to evolve independently of the features of the storage system.

  • A standard protocol should be used to do data management operations, yet this standard protocol should still not have to be aware of individual schemas.

The solution also had to be deployable in a widely distributed manner, with data stores in many locations and managed independently by different groups.

14.4.4 Solution Choices

PNL chose HTTP, XML, and WebDAV as a solution. HTTP provides extremely broad access. XML is an ad hoc and extensible way to marshal object data. WebDAV combines the two, making Web servers capable of storing XML properties on arbitrary documents. The documents themselves are not restricted to the XML format but can be images or raw data sets. New document types and properties can be added at any time, and each application can use whatever set of properties happens to be needed and understood by that application.

PNL selected mod_dav 1.1 as its WebDAV server and data storage implementation. The Apache module was free and easy to extend, and Apache 1.3.11 provided the required security features. PNL used mod_dav's ability to plug in a new repository layer to replace the existing property storage capabilities with a vastly more scalable property storage solution, allowing very large property values (and large numbers of properties per document) to be reliably stored and retrieved. PNL's replacement property storage layer used a hash table in a database manager formatted file. The Gnu DataBase Manager version 1.8 was used to handle the property storage files (one property storage file exists for each WebDAV resource). The Apache Xerces 1.3 XML engine was used to parse and generate XML.

The system was used to create properties as large as 100MB and documents as large as 200MB without problems, sizes that exceeded the expected typical usage.

14.4.5 Mapping Database Schemas to WebDAV Storage

The designers of the new system decided that each data piece that a domain scientist would recognize as a separate object would be stored as a separate WebDAV resource. The designers also considered combining a number of objects inside one WebDAV resource, but this would force clients to download large files and then parse the document body to extract each object. The more granular approach also means that each different object can be annotated individually with metadata of its own. For example, a Molecule object would be stored as a single WebDAV resource with its own properties.

The WebDAV hierarchical storage model was useful in the new system. For example, although each task object in a calculation is a separate WebDAV resource, all the tasks involved in a single calculation can be stored in the same WebDAV collection. This is convenient when using a regular WebDAV browser to look at the repository. However, since each task is tied to each other task in a specified order, there are dependencies or relationships between resources. The implementors planned to represent such relationships between resources through additional property values.

Where data format standards existed (like Molecule representations in Protein Data Bank format), the implementors chose to comply with the data format standard by keeping that data together in one resource body or property value. Otherwise, the implementors broke down metadata into small chunks to maximize flexibility. Many other researchers are working on standard data formats and MIME types for chemistry information, and these standards are easily integrated into the WebDAV data model.

A namespace was defined for all Ecce properties, and the same namespace was used throughout. However, the implementors envisioned that once the basic system was established, extension work would begin in multiple areas independently, and these extensions would define and use their own namespaces.

14.4.6 Results

The Ecce implementors decided that their new design did alleviate the schema sclerosis that was affecting the old system. With a more flexible XML-based schema, changes in one area of the PSE did not affect other areas as much. The entire system involved fewer up-front costs (in paying for commercial software), as well as allowing for more rapid feature development.

Since WebDAV, HTTP, and XML are such open standards, much more powerful and cheaper middleware software and infrastructure can now be used to deploy a PSE. The system benefits from additional independent layers (storage, transport, data format, data manipulation, and semantics) because most layers can now be constructed out of standard and replaceable software components.

The researchers were happy to report that after Ecce was redesigned to use cheaper middleware, deployment costs became low enough that Ecce was able to reach a wider user base.

14.4.7 Lessons Learned

Higher Performance

The implementors of this solution did performance tests to compare the new solution to the network performance of the original OODB-based solution, which had used FTP to transfer files after pulling them out of the database. The file upload performance on the server tested was as good as FTP. The implementors were concerned that the Ecce software modules might perform worse after conversion, and their goal was modest: to avoid a significant performance decrease in these objects. However, overall performance improved, exceeding expectations.

Additional performance optimizations, including HTTP pipelining, multiple TCP connections, and bundling requests, could still be implemented.

Greater Disk Space Required

The basic cost of the flexibility in the new system appears to be storage size. Disk requirements increased by up to 25 percent when files and database property files replaced blobs and tables. Part of the bloat, however, was due to the Gnu DataBase Manager software used to store properties: Each property database file has a minimum size of 25KB, even if most of this space is empty. PNL could replace or reconfigure the property database software to improve disk usage. Since the disk storage tests were run with small average resource sizes, the property storage requirements were a significant fraction of the total storage. In a usage scenario with larger documents, the property storage would involve a smaller relative increase in storage space required.

The PNL system could also be optimized to use less storage if the server software used techniques like compression or avoiding file duplication. These kinds of optimizations can be added without disrupting the way the WebDAV repository is used.

Lower Cost Than OODB

The implementors also found significant cost benefits resulting from the new architecture. Apache and mod_dav were not only free but also cheaper to maintain than OODB systems, even if each department had to have its own storage server. With WebDAV, departments can share storage servers without having to manually harmonize their schemas, so costs are reduced even further. The departments and laboratories could even contract the WebDAV repository hosting to a third-party generic WebDAV hosting service.

Support for Web Access

Since all the Ecce data is on the Web, it is now possible for researchers to access their data directly through a Web browser or WebDAV explorer, indirectly through Web server extensions that manipulate the data (results viewed in a Web browser), or through specialized tools.

Note that basic support for Web access means that the system is naturally compatible with MIME and XML standards. Many researchers use text or XML data formats, and there are specific MIME types for many kinds of science-related data files. Chemical Markup Language and Chemical Structure Markup Language, the Math Markup Language, and the Extensible Scientific Interchange Language all use XML. Since they all use XML, they can easily be stored as property values, as well as file bodies. Any format that has a MIME type is automatically labeled and handled by a WebDAV repository.

WebDAV Is Easy to Work With

The Ecce developers found that the new architecture was much more conducive to debugging. Web browsers and WebDAV explorer tools became debugging tools, enabling the developers to independently verify the sizes, locations, and names of resources stored and accessed by Ecce. Raw XML data can be inspected in text editors or in XML editors.

Security Was Maintained

HTTP security is mature, flexible, and scalable. In addition to meeting the minimum security requirements, the Apache-based system supports many alternative security features. New security features are frequently implemented for Apache, allowing the system to keep pace with recent developments in security functionality.

No Transaction Support

The PNL implementors felt that the lack of transaction support in WebDAV was a drawback. Although it did not block deployment in this case, other custom applications could find otherwise.

Continual Improvement Possible

A database-backed solution has a certain lack of flexibility. It's a slight pain to add columns to existing tables, and it's more difficult to change columns when data must be upgraded. When functionality changes are large, it's extremely hard to reengineer a set of interrelated tables to handle the new functionality. In comparison, WebDAV makes it easy to change what properties exist on which resources and where those resources are stored.

WebDAV also makes it possible for clients and servers to be improved and upgraded independently. The core WebDAV engine can be upgraded to a new version that will provide backward compatibility for client requests. The custom rules and logic on the server can be updated. New properties can be added to certain types of files, and because of the way PROPFIND is designed, client software that is already deployed will still be able to view those files and retrieve the known properties.



WebDAV. Next Generation Collaborative Web Authoring
WebDAV. Next Generation Collaborative Web Authoring
ISBN: 130652083
EAN: N/A
Year: 2003
Pages: 146

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net