Text Compression Algorithms | Speed Up Your Site[c] Web Site Optimization

There are three major approaches to text compression:

Dictionary-based (LZ stands for Lempel and Ziv )
Block sorting-based (BWT, or Burrows-Wheeler Transform)
Symbol probability prediction-based (PPM, or Prediction by Partial Matching)

Most file-compression formats like arj, zip, gz, lzh, and rar (including GIF and PNG) are based on dictionary algorithms of previously occurring phrases. By substituting the distance to the last occurrence and the length of the phrase, they save space. The LZW (Lempel, Ziv, and Welch) algorithm used in the GIF format is differentit substitutes a dictionary index for the phrase. LZ-based algorithms are very fast, with moderate compression ratios and modest memory requirements (< 100KB).

BWT algorithms make a block-sorting transform over the text. The result of the transform is text of the same size , but letters are magically grouped. It can then be efficiently compressed with a very fast and simple coding technique. Block-sorting transforms make sense when they are applied to a big data block. BWT algorithms are fast, with high compression ratios and moderate memory requirements (1MB+).

PPM algorithms calculate the probability distribution for every symbol and then optimally encode it. Most PPM implementations are slow, sophisticated, and memory intensive (8MB+).

The efficiency of these lossless algorithms is measured in bits per character (bpc), or the number of bits needed to encode each character. The English language has an effective bits/char of 1.3, ^[1] so theoretically if a compression algorithm "knew" all the idioms and -structure of this language, it could approach this figure. To give you an idea of how effective the current web compression algorithms are, here is a table of lossless compression algorithms and their efficiencies (see Table 18.1).

^[1] Guy E. Blelloch, "Introduction to Data Compression" [online], (2001 [cited 12 November 2002]), available from the Internet at http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/pscico-guyb/realworld/www/compression.pdf. A draft of a chapter on data compression.

Table 18.1. Lossless Compression Ratios for Text Compression on the Calgary Corpus ^[1] , ^[2]
Date	Implementation	Algorithm	Author	Ratio, bpc
1977	LZ77	LZ	Ziv, Lempel	3.94
1984	LZW (GIF, V42/44b)	LZ	Welch	3.27
1987	Deflate (PNG, gzip)	LZ	Katz	2.63
1997	BOA	PPM	Sutton	1.99
2000	RK	PPM	Taylor	1.82
2002	SBC	BWT	Taylor	1.88

^[2] Timothy Bell, Ian H. Witten, and John W. Cleary, "Modeling for Text Compression," ACM Computing Surveys 21, no. 4 (1989): 557591.

Other than multimedia- or plug-inbased formats, the compression algorithms currently used on the Internet are pretty old and far from state-of-the-art. Will they be replaced with newer and better ones? Not likely anytime soon. Even if somebody came up with the world's-best compression algorithm and the algorithm was patent-free, it will be hard to convince the major browser developers to add this algorithm to their browser code. But even if they do, it will be hard to convince webmasters to use compression that is not supported by older browsers.

So we've got to work with what we have. Let's see how we can squeeze the maximum advantage out of these algorithms.