The Data Preservation Challenge | Inescapable Data: Harnessing the Power of Convergence (paperback)

Yes, we can store massive amounts of data for increasingly longer periods of time, and we may never run out of storage space, even if we save absolutely everything forever. Storing data bits that accumulate into data mountains over time is easy to do. Tape and new forms of optical media will store petabytes for centuriesas long as you do not want to retrieve it. It is when you want to get data back so that you can actually use it that things get increasingly difficult and expensive; As mentioned earlier, the storage industry has yet to devise a storage media that lasts longer than one or two decades at the most. Long-term data preservation requires periodically refreshing at least the storage media, and usually the underlying hardware as well. In normal practice, data centers replace disk arrays every 3 to 5 years and tape drives every 5 to 7 years. Fermi Lab migrates 3 terabytes of data per day from old tape cartridges to new just so that they can be assured of being able to retrieve the data if and when they need it.

Data preservation is becoming a hot topic among many information technology administrators within companies, universities, and government agencies such as the Library of Congress. Although our world has been digital for many decades, only recently has this problem of long-term data storage and retrieval become more acute, driven by two conflicting forces:

A broader need, imposed by regulatory agencies for more organizations to keep records online and available for longer time periods, and the need for historical preservation
Changes in computer "formats," both physical and data formats, that tend to more and more rapidly obsolete the very processes we need to retrieve and interpret that data

Suppose that you live in an old house in a historic New England town and you discover in your attic a cache of historic artifacts. In a sheaf of thin, yellowish-brown pages, you find letters, proclamations, debt notices, and arrest warrants, some of which date back to the early eighteenth century. They are handwritten in a script and syntax that gives you pause. But they are intelligible. As you leave your attic to announce your discovery, you accidentally knock over a box that contains a collection of old floppy disks that dates back to your first PC that you trashed long ago. You suddenly realize that to extract anything intelligible from them, you would have to carry them down to a computer museum. Even so, there is no guarantee that the museum will have the right version of WordPerfect that you used to create many of the documents stored on those floppy disks or run the right version of DOS, all on a machine with a 5.25-inch floppy disk drive that no one makes any more. In fact, even if you were somehow able to find the right combination of hardware and software, the floppy disks themselves may have degraded to the point where their data is no longer readable.

If we want to store data for very long periods of time (hundreds of years in the cases of historians), we are now exposed to two significant problems that the storage industry is still years away from solving:

Media degradation and the increasingly rapid obsolescence of "readers"
Lack of a universal digital format for encoding

The first problem is about physical media: disks, diskettes, and tapes. These all have a physical lifespan that ranges from three to seven years. After that, they degrade (tape dries out and gets brittle if not stored properly, for example) and lose data. CDs buy more time, perhaps a few decades if properly protected, but keeping the hardware that reads these CDs functioning for the same time period is a significant challenge.

The problem of media format obsolescence was more apparent a few years ago and more confined to removable media (tape, for example). This will be less of an issue going forward because the storage industry is trending toward adopting a smaller number of standard formats that can be adhered to for longer periods of time, and toward depending more heavily on online storage for even long-term storage needs and for quick retrieval. For some applications, it is better to have the data online and stored in more than one location than to have a single copy stored offline in some removable form that is subject to physical decay and data loss. The cost per gigabyte of Serial ATA disk (SATA), for example, now allows this luxury. However, there is no free lunch. The higher density the disk media, the more the media is susceptible to data loss over time, meaning the data almost literally "evaporates" from the disk after a period of about five years. Therefore, if you want to keep data online and on disk for more than five years, you must migrate it from one online media format to a newer one. Then, there is the cost of electricity required to keep these mountains of data online and "spinning" continuously. For some, however, this seems a sensible tradeoff.

The real problem in data preservation lies within the actual encoding format. The problem is known as semantic continuity. Suppose that you have somehow managed to keep your 1988 Turbo Tax file online and available. (Go to the head of the long-term data storage class!) Do you have any mechanism for actually making use of it? Perhaps you can print out the return on hardware you have managed to preserve for all these years and store it away for safekeeping. The odds that your current Turbo Tax version can use this data electronically (meaning it would have to support a nearly 20-year-old format) are slim to none.

Getting Back Your BitsCan You Make Sense of Them?

The Harvard University Library has two archival storage facilities, one analog (papers, books, photos, microfiche, etc.) and one digital, known as the Online Computer Library Center (OCLC). Here's how OCLC charges users for its services:

OCLC's current prices are for "bit-preservation" services only. In other words, OCLC will preserve the data bits for as long as the OCLC remains in businesspresumably foreverbut it is then up to you to turn those bits back into information when you want to retrieve them 50 years from now.

Bit-preservation services include data management and backup, ongoing virus checks, periodic media refreshment, disaster recovery, and support of administrative tools for owners to update metadata and generate reports. Prices have not yet been set for "full preservation," wherein OCLC would be obligated to provide standard bit-preservation services, plus the capability to render intellectual content accurately, regardless of technology changes over time.

"Full preservation," that is the really hard part. You can have your bits back anytime you want. If you want them back many years hence in the same format and with the same information content as before, however, well, perhaps you should generate a hard copy and walk it over to the analog preservation side of Harvard's archival storage library just in case.

Semantic continuity is really at the heart of the data preservation problem. When seen in this light, it is not really a problem storage vendors can solve. Solutions must come from the applications vendors (Microsoft and Oracle, for example) or from other sources within the scientific and engineering communities. It is an extremely important problem, perhaps not so much for the individual PC user as for commercial industries, research in stitutions, and governments. Legal documents, reports, digital pictures, presentations, and the list goes onall need to be preserved. If we want decades of data available for Inescapable Data utility, we need to have mechanisms to ensure that we can extract information from saved data well into the future. Many organizations are starting to grapple with this challenge, but as of yet, there are no clear solutions.

State Government and Preservation

Document preservation is an acute problem for state governments. "We believe that we have the sacred responsibility to preserve history for future generations. That creates unique challenges in the digital world," says Peter Quinn, CIO of the State of Massachusetts. "Nobody has yet been able to figure out how to preserve this much information in the same way that history has used paper to preserve information. This is one of the top two or three issues in all of government that we have to deal with."

For decades, data was stored on a style of tape that changed slowly over time. The formats for accessing that data changed very little over time as well. Quinn points out that the computer language Cobol has withstood the test of time and lasted many decades. In contrast, in the past 10 years, he has seen dozens of hopeful replacement technologies come and go. Organizations with a high dependency on Cobol (or any platform for that matter) are in a quandary about where to go next. Too many promises for the next dominant programming language or platform have come and gone, leaving remnants of programs along the way that need continual maintenance.

To make matters worse, Massachusetts state government workers are aging, and Quinn believes this represents a significant threat to data preservation. Quinn states that a significant number of his IT staff members are between the ages of 50 and 65, add that number to those in their 40s and the majority of Quinn's work force is over 40 years old. "We're going to loose 30% of our workforce in the next 5 years due to retirement and maybe another 20% due to other attrition reasons. That's 50%, and they're taking with them the skills that built our tools that run the government," exclaims Quinn.

Other industries are experiencing the same problem, but it is more pronounced at the state and federal government level because turnover is typically lower and governmental agencies can therefore maintain legacy architectures for longer periods. As governments hire new computer engineering talent, they find that these new workers are more familiar with creating and maintaining Linux and Windows than with Cobol. Therefore, the crumbling IT infrastructure problem self-perpetuates and worsens as time goes on. An important requirement for data preservation is preservation of people skills to maintain and carry forward the "information" locked within the data.

To keep certain electronic documents safe from the ravages of time and planned obsolescence, some companies are choosing to convert Microsoft Word and Excel documents into image formats (TIFF, GIF, PDF) that are standard and well known. Although they too will age as current formats fall out of prevalent use, migration tools will not only copy the data to the newer-faster-cheaper-denser storage media, but also "convert" the image format to whatever is the prevailing standard. This approach has some merit, but has some important drawbacks as well. Namely, images of a Word document are orders of magnitude larger in terms of the storage capacity required than the original document stored in its original Word format. Second, although the text and embedded graphics are preserved, formatting attributes and file metadata are lost. For example, if you convert a spreadsheet to a picture, you lose the cell formulas and associations that reside behind each spreadsheet cell. If a spreadsheet is archived so as to preserve the underlying math, image or PDF conversion will not help. Another approach is to use XML, which is wonderful for allowing attributes and other metadata to be expressed in a more universal format. XML enables you to store much of the nonvisual information that many documents contain (such as relationships within a document such as spreadsheet cell formulas). However, XML might not fully describe the "look" of a document in the exact same way that a picture (or perfect re-rendering) can; sometimes, the purpose of retrieving an archived document is to prove (perhaps in court) that the information was laid out in a nonmisleading way (consumer label cases, for example).Therefore, not having a image-based rendering of the document could be troublesome for some industries and applications. Storing both an XML version and an image version of a document has been suggested as a workaroundhowever, it is one that consumes mega amounts of storage.

Another approach is to use a more versatile display format, such as Adobe PDF, with some added extensions. If both the format and the rendering logic are similarly well specified, it is conceivable that storing more of the raw contents of a document would be possible. Even so, you would also need to store much ancillary data with the file to be 100 percent sure that the "picture" and all of its associated data can be reconstituted. For example, the actual text fonts would have to be stored in the document (as opposed to merely referenced). Similarly, all color and layout information would have to be specified in some device-independent manner. Likely, there will be other blocks of information that would need to be stored as well to enable perfect reproduction. Finally, it will be of paramount importance to store the document and all of its associations as a 100 percent self-contained object so that it can be migrated and copied without external dependencies. For small documents, these requirements could exact a disproportionately high premium for storage capacity, making the approach impractical for many applications. However, it may well be a practical approach for aircraft manufacturers, for example, that must maintain engineering documentsnormally large filesfor as long as a plane is still flying.

There is some hope for limiting the runaway accumulation of stored digital documents. Intelligent storage devices can tease-out redundancies that occur across many files (as opposed to "compressing" within a file). If a given storage device is holding thousands or even millions of files, in its "spare time"i.e., when not responding to normal read/write operationsit can be examining its contents for redundant data. For example, if you routinely use PowerPoint to create your corporate presentations and typically use the same base template, a healthy percentage of the storage space required by each of those presentations is allocated to data that is identical for every presentation that uses the same template. Similarly, if our new document-saving approach has to embed such things as fonts (over and over again, for each document), such a cross-file examination could lead to impressive compression. This significant amount of purely redundant date can be teased-out of each document and applied back to the document when it is retrieved.

In time and as Inescapable Data devices force companies to store and manage petabytes as opposed to terabytes of data, storage devices will become available that use embedded intelligence to "mine-out" repetitive patterns and redundancies. Fat files will be transparently replaced by smaller ones that embody the unique contents only plus pointers to the "shared" or redundant information common to many files. The user of the file will be unaware of the magic going on behind the scenes. This process is not unlike what a RAID (acronym for Redundant Array of Independent Disks) storage system does today in that that takes a file and breaks it up into thousands of data "pieces" that are "sprayed" or "striped" (in the disk storage vernacular) across a number of physical disks in a RAID group. Intelligence in the RAID system reconstructs the file when it is recalled. Should one of the physical disks fail, the RAID system reconstructs the missing data by virtue of parity data stored on the other disk members of the RAID group, all done transparently to the application user. Intelligent storage devices are in use today. More intelligence is coming.

Walking Information Through Time: The Universal Virtual Computer

Imagine that you are a twenty-fourth-century archeologist. Digging among the ruins of what was once the house of an eminent twenty-first-century scientist and professor in residence at the university you now call home, you find a sealed metal box that contains an odd looking piece of plastic. When you bring it back to your university home base, a fellow researcher identifies it as a holographic storage "disk"a once popular but now extinct form of computer storage. Wow, you think. Maybe the old professor preserved his research notes and other important documents, and deposited them in some sort of digital time capsule. What a find!

Luckily, the university computer museum has a device that can still extract data from the disk. However, you still have a problem. How do you decode the data bits and turn them into something you can understand? For that, you turn to another friend in the university's computer lab who has a universal virtual computer (UVC). This digital version of the ancient Rosetta Stone discovers that the old professor anticipated that future generations would be able to create such a machine, and saved his documents in an appropriate format. VoilE0. The digital time capsule has been truly unlocked.

Here's how the UVC works: It is built from coded instructions (a program) contained in a paper document of about 15 written pages. (Paper is still a perfectly acceptable analog storage medium in the twenty-fourth century, as it was in the sixteenth century.) The program simulates the most basic components common to all computers to create a processing environment that is essentially the same, no matter when or who made the computer. All it has to do is something all computers dorun a program. So, as long as you know "the code," you can build a UVC that will retrieve digitized information that was stored in a format that can be understood by the UVC.

The UVC is in fact the brainchild of an IBM researcher named Raymond Lorie. He and a UVC project team have already created a prototype UVC that can decode text and image formats such as JPEG and Adobe PDF. The goal is to propagate the UVC as a feature of a modern culture that now understands the present fragility of alls thing digital and wants to preserve its information stores for the benefit of future generations.