Breakthrough
Compression
It may be
impractical
to store "thick files" until some
sort
of storage breakthrough occurs in the storage space. Compression softwaresoftware that compresses files down to one-third or even one-fourth of their original
size
storage-wisecan achieve wonderfully high compression ratios. However, achieving these higher compression ratios typically engenders some form of data losshowever
slight
. "Lossless" schemes typically only achieve 2:1 or slightly better compression ratios.
For legal, financial, and archival document preservation, one may not be able to
tolerate
any data loss and must be able to reconstitute 100 percent correctly a document or record. (Just 1 data bit missing from a bank record could mean the difference between being solvent or in debt, for example. On the other hand, a digital video file could lose a data bit without
anyone
ever noticing the missing pixel.)
The data compression world is presently divided into two distinct camps. One tolerates "lossy" techniques for digital imaging applications (
videos
and pictures) as a
tradeoff
for saving huge amounts of storage capacity. The other "lossless" camp wants high storage savings ratios
coupled
with a guarantee of data consistency when a compressed document is retrieved. What is needed is something in the middle: a technique that achieves 100 storage capacity savings (even if it eats a lot of CPU cycles to get there) while at the same time
perfectly
re-assembles the data upon recall.
Wavelets are a
newer
form of compression, lossy however, that looks at a much broader sampling of data "
chunks
" and analyzes them more deeply in an effort to find and extract greater amounts of redundancythe magic of data compression. Any compression method simply applied to a single file will probably not yield significant storage capacity savings, but as the scope of input data is expanded, the value of high-ratio data compression dramatically
increases
.
The promise of using wavelet-based compression is, that given plenty of processing horsepower, there may be no limit to the amount of redundancy that can be squeezed out of a stream of data. Imagine the sequence of
characters
ABCDEFGHIJ. Such a sequence is hard, if not
impossible
, to compress. If, on the other hand, it were ABCCCCCHIJ, the redundant Cs could be
replaced
by a count digit followed by a single C. Most compression algorithms look for such obvious opportunities. A less obvious solution (requiring more CPU time to detect) would be to realize that the string of letters may contain "ramps" where successive
letters
are one higher (or lower) than the previous letter. The compression scheme could then, for example, encode the starting value (A) and a ramp rate up or down of (1). A further less-obvious solution would be to realize that the majority of that pattern exists in other documents and, therefore, it might not be necessary to re-store all of it, once again at the cost of many more CPU cycles.
Sometimes, data can be "folded," like bed linens (or even more creatively rearranged), and when folded more redundancy appears. The point is this: Myopically looking at small streams of data misses the larger redundancies that can be found by looking through huge chunks of data for commonalities among a great many filesa broader type of foldingor whose differences can be "described" (like the ramp example). To get to this
elimination
of greater amounts of redundancy that can be extracted when data is stored, and added back when its retrieved, storage devices need to have more "extra time" on their hands (to be able to find and catalog all their data), more clues about the contents of the files they contain (for more efficient searches), more advanced compression algorithms, and, most importantly, terabytes of data to sift through in order to find redundant patterns. But once there, 100 x improvements in density might just be the beginning.
|