Yes, we can store massive amounts of data for increasingly longer periods of time, and we may never run out of storage space, even if we save absolutely everything forever. Storing data bits that accumulate into data mountains over time is easy to do. Tape and new forms of optical media will store petabytes for centuriesas long as you do not want to retrieve it. It is when you want to get data back so that you can actually use it that things get increasingly difficult and expensive; As mentioned earlier, the storage industry has yet to devise a storage media that lasts longer than one or two decades at the most. Long-term data preservation requires periodically refreshing at least the storage media, and usually the underlying hardware as well. In normal practice, data centers replace disk arrays every 3 to 5 years and tape drives every 5 to 7 years. Fermi Lab migrates 3 terabytes of data per day from old tape cartridges to new just so that they can be assured of being able to retrieve the data if and when they need it. Data preservation is becoming a hot topic among many information technology administrators within companies, universities, and government agencies such as the Library of Congress. Although our world has been digital for many decades, only recently has this problem of long-term data storage and retrieval become more acute, driven by two conflicting forces:
Suppose that you live in an old house in a historic New England town and you discover in your attic a cache of historic artifacts. In a sheaf of thin, yellowish-brown pages, you find letters, proclamations, debt notices, and arrest warrants, some of which date back to the early eighteenth century. They are handwritten in a script and syntax that gives you pause. But they are intelligible. As you leave your attic to announce your discovery, you accidentally knock over a box that contains a collection of old floppy disks that dates back to your first PC that you trashed long ago. You suddenly realize that to extract anything intelligible from them, you would have to carry them down to a computer museum. Even so, there is no guarantee that the museum will have the right version of WordPerfect that you used to create many of the documents stored on those floppy disks or run the right version of DOS, all on a machine with a 5.25-inch floppy disk drive that no one makes any more. In fact, even if you were somehow able to find the right combination of hardware and software, the floppy disks themselves may have degraded to the point where their data is no longer readable. If we want to store data for very long periods of time (hundreds of years in the cases of historians), we are now exposed to two significant problems that the storage industry is still years away from solving:
The first problem is about physical media: disks, diskettes, and tapes. These all have a physical lifespan that ranges from three to seven years. After that, they degrade (tape dries out and gets brittle if not stored properly, for example) and lose data. CDs buy more time, perhaps a few decades if properly protected, but keeping the hardware that reads these CDs functioning for the same time period is a significant challenge. The problem of media format obsolescence was more apparent a few years ago and more confined to removable media (tape, for example). This will be less of an issue going forward because the storage industry is trending toward adopting a smaller number of standard formats that can be adhered to for longer periods of time, and toward depending more heavily on online storage for even long-term storage needs and for quick retrieval. For some applications, it is better to have the data online and stored in more than one location than to have a single copy stored offline in some removable form that is subject to physical decay and data loss. The cost per gigabyte of Serial ATA disk (SATA), for example, now allows this luxury. However, there is no free lunch. The higher density the disk media, the more the media is susceptible to data loss over time, meaning the data almost literally "evaporates" from the disk after a period of about five years. Therefore, if you want to keep data online and on disk for more than five years, you must migrate it from one online media format to a newer one. Then, there is the cost of electricity required to keep these mountains of data online and "spinning" continuously. For some, however, this seems a sensible tradeoff. The real problem in data preservation lies within the actual encoding format. The problem is known as semantic continuity. Suppose that you have somehow managed to keep your 1988 Turbo Tax file online and available. (Go to the head of the long-term data storage class!) Do you have any mechanism for actually making use of it? Perhaps you can print out the return on hardware you have managed to preserve for all these years and store it away for safekeeping. The odds that your current Turbo Tax version can use this data electronically (meaning it would have to support a nearly 20-year-old format) are slim to none.
Semantic continuity is really at the heart of the data preservation problem. When seen in this light, it is not really a problem storage vendors can solve. Solutions must come from the applications vendors (Microsoft and Oracle, for example) or from other sources within the scientific and engineering communities. It is an extremely important problem, perhaps not so much for the individual PC user as for commercial industries, research in stitutions, and governments. Legal documents, reports, digital pictures, presentations, and the list goes onall need to be preserved. If we want decades of data available for Inescapable Data utility, we need to have mechanisms to ensure that we can extract information from saved data well into the future. Many organizations are starting to grapple with this challenge, but as of yet, there are no clear solutions.
To keep certain electronic documents safe from the ravages of time and planned obsolescence, some companies are choosing to convert Microsoft Word and Excel documents into image formats (TIFF, GIF, PDF) that are standard and well known. Although they too will age as current formats fall out of prevalent use, migration tools will not only copy the data to the newer-faster-cheaper-denser storage media, but also "convert" the image format to whatever is the prevailing standard. This approach has some merit, but has some important drawbacks as well. Namely, images of a Word document are orders of magnitude larger in terms of the storage capacity required than the original document stored in its original Word format. Second, although the text and embedded graphics are preserved, formatting attributes and file metadata are lost. For example, if you convert a spreadsheet to a picture, you lose the cell formulas and associations that reside behind each spreadsheet cell. If a spreadsheet is archived so as to preserve the underlying math, image or PDF conversion will not help. Another approach is to use XML, which is wonderful for allowing attributes and other metadata to be expressed in a more universal format. XML enables you to store much of the nonvisual information that many documents contain (such as relationships within a document such as spreadsheet cell formulas). However, XML might not fully describe the "look" of a document in the exact same way that a picture (or perfect re-rendering) can; sometimes, the purpose of retrieving an archived document is to prove (perhaps in court) that the information was laid out in a nonmisleading way (consumer label cases, for example).Therefore, not having a image-based rendering of the document could be troublesome for some industries and applications. Storing both an XML version and an image version of a document has been suggested as a workaroundhowever, it is one that consumes mega amounts of storage. Another approach is to use a more versatile display format, such as Adobe PDF, with some added extensions. If both the format and the rendering logic are similarly well specified, it is conceivable that storing more of the raw contents of a document would be possible. Even so, you would also need to store much ancillary data with the file to be 100 percent sure that the "picture" and all of its associated data can be reconstituted. For example, the actual text fonts would have to be stored in the document (as opposed to merely referenced). Similarly, all color and layout information would have to be specified in some device-independent manner. Likely, there will be other blocks of information that would need to be stored as well to enable perfect reproduction. Finally, it will be of paramount importance to store the document and all of its associations as a 100 percent self-contained object so that it can be migrated and copied without external dependencies. For small documents, these requirements could exact a disproportionately high premium for storage capacity, making the approach impractical for many applications. However, it may well be a practical approach for aircraft manufacturers, for example, that must maintain engineering documentsnormally large filesfor as long as a plane is still flying. There is some hope for limiting the runaway accumulation of stored digital documents. Intelligent storage devices can tease-out redundancies that occur across many files (as opposed to "compressing" within a file). If a given storage device is holding thousands or even millions of files, in its "spare time"i.e., when not responding to normal read/write operationsit can be examining its contents for redundant data. For example, if you routinely use PowerPoint to create your corporate presentations and typically use the same base template, a healthy percentage of the storage space required by each of those presentations is allocated to data that is identical for every presentation that uses the same template. Similarly, if our new document-saving approach has to embed such things as fonts (over and over again, for each document), such a cross-file examination could lead to impressive compression. This significant amount of purely redundant date can be teased-out of each document and applied back to the document when it is retrieved. In time and as Inescapable Data devices force companies to store and manage petabytes as opposed to terabytes of data, storage devices will become available that use embedded intelligence to "mine-out" repetitive patterns and redundancies. Fat files will be transparently replaced by smaller ones that embody the unique contents only plus pointers to the "shared" or redundant information common to many files. The user of the file will be unaware of the magic going on behind the scenes. This process is not unlike what a RAID (acronym for Redundant Array of Independent Disks) storage system does today in that that takes a file and breaks it up into thousands of data "pieces" that are "sprayed" or "striped" (in the disk storage vernacular) across a number of physical disks in a RAID group. Intelligence in the RAID system reconstructs the file when it is recalled. Should one of the physical disks fail, the RAID system reconstructs the missing data by virtue of parity data stored on the other disk members of the RAID group, all done transparently to the application user. Intelligent storage devices are in use today. More intelligence is coming.
|