Text Compression Algorithms

There are three major approaches to text compression:

  • Dictionary-based (LZ stands for Lempel and Ziv )

  • Block sorting-based (BWT, or Burrows-Wheeler Transform)

  • Symbol probability prediction-based (PPM, or Prediction by Partial Matching)

Most file-compression formats like arj, zip, gz, lzh, and rar (including GIF and PNG) are based on dictionary algorithms of previously occurring phrases. By substituting the distance to the last occurrence and the length of the phrase, they save space. The LZW (Lempel, Ziv, and Welch) algorithm used in the GIF format is differentit substitutes a dictionary index for the phrase. LZ-based algorithms are very fast, with moderate compression ratios and modest memory requirements (< 100KB).

BWT algorithms make a block-sorting transform over the text. The result of the transform is text of the same size , but letters are magically grouped. It can then be efficiently compressed with a very fast and simple coding technique. Block-sorting transforms make sense when they are applied to a big data block. BWT algorithms are fast, with high compression ratios and moderate memory requirements (1MB+).

PPM algorithms calculate the probability distribution for every symbol and then optimally encode it. Most PPM implementations are slow, sophisticated, and memory intensive (8MB+).

The efficiency of these lossless algorithms is measured in bits per character (bpc), or the number of bits needed to encode each character. The English language has an effective bits/char of 1.3, [1] so theoretically if a compression algorithm "knew" all the idioms and -structure of this language, it could approach this figure. To give you an idea of how effective the current web compression algorithms are, here is a table of lossless compression algorithms and their efficiencies (see Table 18.1).

[1] Guy E. Blelloch, "Introduction to Data Compression" [online], (2001 [cited 12 November 2002]), available from the Internet at http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/pscico-guyb/realworld/www/compression.pdf. A draft of a chapter on data compression.
Table 18.1. Lossless Compression Ratios for Text Compression on the Calgary Corpus [1] , [2]
Date Implementation Algorithm Author Ratio, bpc
1977 LZ77 LZ Ziv, Lempel 3.94
1984 LZW (GIF, V42/44b) LZ Welch 3.27
1987 Deflate (PNG, gzip) LZ Katz 2.63
1997 BOA PPM Sutton 1.99
2000 RK PPM Taylor 1.82
2002 SBC BWT Taylor 1.88
[2] Timothy Bell, Ian H. Witten, and John W. Cleary, "Modeling for Text Compression," ACM Computing Surveys 21, no. 4 (1989): 557591.

Other than multimedia- or plug-inbased formats, the compression algorithms currently used on the Internet are pretty old and far from state-of-the-art. Will they be replaced with newer and better ones? Not likely anytime soon. Even if somebody came up with the world's-best compression algorithm and the algorithm was patent-free, it will be hard to convince the major browser developers to add this algorithm to their browser code. But even if they do, it will be hard to convince webmasters to use compression that is not supported by older browsers.

So we've got to work with what we have. Let's see how we can squeeze the maximum advantage out of these algorithms.

 



Speed Up Your Site[c] Web Site Optimization
Speed Up Your Site[c] Web Site Optimization
ISBN: 596515081
EAN: N/A
Year: 2005
Pages: 135

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net