There are three major approaches to text compression:
Dictionary-based (LZ stands for Lempel and Ziv )
Block sorting-based (BWT, or Burrows-Wheeler Transform)
Symbol probability prediction-based (PPM, or Prediction by Partial Matching)
Most file-compression formats like arj, zip, gz, lzh, and rar (including GIF and PNG) are based on dictionary algorithms of previously occurring phrases. By substituting the distance to the last occurrence and the length of the phrase, they save space. The LZW (Lempel, Ziv, and Welch) algorithm used in the GIF format is differentit substitutes a dictionary index for the phrase. LZ-based algorithms are very fast, with moderate compression ratios and modest memory requirements (< 100KB).
BWT algorithms make a block-sorting transform over the text. The result of the transform is text of the same size , but letters are magically grouped. It can then be efficiently compressed with a very fast and simple coding technique. Block-sorting transforms make sense when they are applied to a big data block. BWT algorithms are fast, with high compression ratios and moderate memory requirements (1MB+).
PPM algorithms calculate the probability distribution for every symbol and then optimally encode it. Most PPM implementations are slow, sophisticated, and memory intensive (8MB+).
The efficiency of these lossless algorithms is measured in bits per character (bpc), or the number of bits needed to encode each character. The English language has an effective bits/char of 1.3, [1] so theoretically if a compression algorithm "knew" all the idioms and -structure of this language, it could approach this figure. To give you an idea of how effective the current web compression algorithms are, here is a table of lossless compression algorithms and their efficiencies (see Table 18.1).
Date | Implementation | Algorithm | Author | Ratio, bpc |
---|---|---|---|---|
1977 | LZ77 | LZ | Ziv, Lempel | 3.94 |
1984 | LZW (GIF, V42/44b) | LZ | Welch | 3.27 |
1987 | Deflate (PNG, gzip) | LZ | Katz | 2.63 |
1997 | BOA | PPM | Sutton | 1.99 |
2000 | RK | PPM | Taylor | 1.82 |
2002 | SBC | BWT | Taylor | 1.88 |
Other than multimedia- or plug-inbased formats, the compression algorithms currently used on the Internet are pretty old and far from state-of-the-art. Will they be replaced with newer and better ones? Not likely anytime soon. Even if somebody came up with the world's-best compression algorithm and the algorithm was patent-free, it will be hard to convince the major browser developers to add this algorithm to their browser code. But even if they do, it will be hard to convince webmasters to use compression that is not supported by older browsers.
So we've got to work with what we have. Let's see how we can squeeze the maximum advantage out of these algorithms.