Breakthrough Compression | Inescapable Data: Harnessing the Power of Convergence (paperback)

It may be impractical to store "thick files" until some sort of storage breakthrough occurs in the storage space. Compression softwaresoftware that compresses files down to one-third or even one-fourth of their original size storage-wisecan achieve wonderfully high compression ratios. However, achieving these higher compression ratios typically engenders some form of data losshowever slight. "Lossless" schemes typically only achieve 2:1 or slightly better compression ratios.

For legal, financial, and archival document preservation, one may not be able to tolerate any data loss and must be able to reconstitute 100 percent correctly a document or record. (Just 1 data bit missing from a bank record could mean the difference between being solvent or in debt, for example. On the other hand, a digital video file could lose a data bit without anyone ever noticing the missing pixel.)

The data compression world is presently divided into two distinct camps. One tolerates "lossy" techniques for digital imaging applications (videos and pictures) as a tradeoff for saving huge amounts of storage capacity. The other "lossless" camp wants high storage savings ratios coupled with a guarantee of data consistency when a compressed document is retrieved. What is needed is something in the middle: a technique that achieves 100 storage capacity savings (even if it eats a lot of CPU cycles to get there) while at the same time perfectly re-assembles the data upon recall.

Wavelets are a newer form of compression, lossy however, that looks at a much broader sampling of data "chunks" and analyzes them more deeply in an effort to find and extract greater amounts of redundancythe magic of data compression. Any compression method simply applied to a single file will probably not yield significant storage capacity savings, but as the scope of input data is expanded, the value of high-ratio data compression dramatically increases.

The promise of using wavelet-based compression is, that given plenty of processing horsepower, there may be no limit to the amount of redundancy that can be squeezed out of a stream of data. Imagine the sequence of characters ABCDEFGHIJ. Such a sequence is hard, if not impossible, to compress. If, on the other hand, it were ABCCCCCHIJ, the redundant Cs could be replaced by a count digit followed by a single C. Most compression algorithms look for such obvious opportunities. A less obvious solution (requiring more CPU time to detect) would be to realize that the string of letters may contain "ramps" where successive letters are one higher (or lower) than the previous letter. The compression scheme could then, for example, encode the starting value (A) and a ramp rate up or down of (1). A further less-obvious solution would be to realize that the majority of that pattern exists in other documents and, therefore, it might not be necessary to re-store all of it, once again at the cost of many more CPU cycles.

Sometimes, data can be "folded," like bed linens (or even more creatively rearranged), and when folded more redundancy appears. The point is this: Myopically looking at small streams of data misses the larger redundancies that can be found by looking through huge chunks of data for commonalities among a great many filesa broader type of foldingor whose differences can be "described" (like the ramp example). To get to this elimination of greater amounts of redundancy that can be extracted when data is stored, and added back when its retrieved, storage devices need to have more "extra time" on their hands (to be able to find and catalog all their data), more clues about the contents of the files they contain (for more efficient searches), more advanced compression algorithms, and, most importantly, terabytes of data to sift through in order to find redundant patterns. But once there, 100 x improvements in density might just be the beginning.