13.3 Data Compression

Unless data is purely random, it contains a degree of redundancy. Data compression can be considered a process that uses one or more algorithms to reduce or eliminate data redundancies.

13.3.1 Compression Ratio

The efficiency of compression is referred to as the compression ratio, which represents the ratio of the number of original bytes in a frame to the number of compressed bytes. A high compression ratio represents either efficient compression, data with a high level of redundancy, or both.

Because data carried by information frames varies over time with respect to content and the amount of data redundancy, the compression ratio will vary with time. Normally, you can expect a compression performing remote bridge or router to have an average compression ratio between 2.0 and 2.5. Allowing for the fact that the WAN transmission protocol has a 10 to 20 percent overhead, the effect of compression typically results in the doubling of the information carrying capacity of a circuit. That is, as a general rule, you can expect a 56-Kbps digital circuit to transfer an average of 112 Kbps of LAN frame data when compression performing remote bridges or routers are used.

13.3.2 Compression Methods

There are three main categories into which compression methods fall: byte oriented, statistical, and table lookup or dictionary based.

13.3.2.1 Byte-Oriented Compression

Byte-oriented compression techniques examine the bit composition of each byte, comparing the composition of the current byte to one or more succeeding bytes. If redundancies are noted, the algorithm replaces a series of bytes with a new sequence that eliminates those redundancies. A few of the more popular byte-oriented compression techniques include run length encoding, half-byte encoding, and null suppression.

Null suppression simply results in the substitution of a compression indicating character and null count character for a sequence of nulls. In comparison, run length encoding permits any character sequence to be reduced by the use of a three-character sequence to replace a repeated run of any character. The three-character sequence includes a compression-indicating character, the character in the run, and a count character.

Half-byte encoding is a compression technique that results in the replacement of a sequence of characters that has the same half-byte bit composition, such as a string of numerics, by encoding two numerics into a byte. Then, the reduced byte sequence is prefixed by a compression-indicating character and a count byte.

13.3.2.2 Statistical Compression

Statistical compression algorithms replace frequently occurring bytes by short codes and less frequently occurring bytes by longer codes. The net result of this replacement process is to reduce the average number of bits required to represent a character from 8 to a lesser number, typically between 4 and 5, resulting in a compression ratio of 2:1 or less. In replacing each fixed-length byte of 8 bits, the replacement process results in the use of a variable-length code. This variable-length code is constructed based on a statistical compression algorithm, with the Huffman coding technique being the most popular method employed. This technique results in the encoded compressed data bits having a prefix property that makes it instantaneously decodable. Thus, decompression performed according to the Huffman algorithm minimizes delays in comparison to the use of other compression methods. While this was an important consideration during the era of Intel 8088 microprocessors, the widespread availability of Intel Pentium and other modern high-performance microprocessors has reduced the decompression times associated with other compression techniques to the point where they do not adversely affect the throughput of data.

13.3.2.3 Dictionary-Based Compression

Because the replacement of frequently occurring bytes by short codes and less frequently occurring bytes by longer codes results in a reduction in the average number of bits required to represent a character, it is logical to assume that greater efficiencies can be obtained by substituting codes for strings of characters. This logic resulted in the development of dictionary-based compression algorithms in which transmitted data is placed into a dictionary and the location of characters and strings of characters (pointers) is transmitted instead of the actual data.

The most popular dictionary-based compression algorithm dates to the work of Jacob Ziv and Abraham Lempel at Haifa University during the late 1970s. More modern versions of their algorithm are referred to as Lempel-Ziv or LZ-based and differ in the method used to flush the dictionary, the number of pointers and size of the dictionary, and how pointers are coded in the compressed output.

One popular version of LZ coding is more commonly known as Lempel-Ziv-Welsh and is the method used in the CCITT V.42bis modem compression standard.

The LZW algorithm initially considers the character set as 256 individual string table entries whose codes range from 0 to 255. Then, the algorithm operates on the next character in the input string of characters as follows :

If the character is in the table, get the next character.
If the character is not in the table, output the last known string's encoding and add the new string to the table.

In the LZW algorithm, characters from the input data source are read and used to progressively form larger and larger strings until a string is formed that is not in the dictionary. When this occurs, the last known string's encoding is output and the new string is added to the table. The beauty of this technique is that the compressor and expander know that the initial string table consists of the 256 characters in the character set. Because the algorithm uses a numeric code as a token to indicate the position of a string in the dictionary's string table, this technique minimizes the token length as the dictionary begins to fill. To illustrate the simplicity of the operation of the LZW algorithm, let us view an example of its use.

Let us call the string previously read the prefix of the output string. Similarly, the last byte read becomes the suffix , where:

Once a new string is formed, the suffix becomes the prefix (prefix = suffix).

Initially, each character in the character set is assigned a code value equivalent to its character code. Thus, in ASCII, "a" would have the code value 97, "b" would have the code value 98, etc.

Now assume that the input string is "ababc ." The first operation assumes that the prefix has a value of the null string, which we will indicate as the symbol "_." Thus, the first operation results in the addition of the null string to "a," which forms the new string "a." Because the "a" is in the dictionary, we do not output anything. However, because the suffix becomes the prefix, "a" now becomes the prefix for the next operation. This is illustrated in Table 13.1.

Table 13.1: Using LZW Compression for Encoding the String "ababc"
Prefix	Suffix	New String	Output
”	a	a	”
a	b	ab	97
b	a	ba	98
a	b	ab	”
ab	c	abc	256
c	”	c	99

Processing the next character in the input string (b) results in the addition of the prefix (a) and the suffix (b) to form the new string (ab). Because this represents a new string that is not in the dictionary, we follow rule 2. That is, we output the last known string's encoding, which is 97 for the character "a" and then add the new string to the dictionary or string table. Concerning the latter, because codes 0 through 255 represent the individual characters in an 8-bit character set, we can use the code 256 to represent the string "ab." However, doing so obviously requires the token to be extended. Most LZW implementations use between 9 and 14 bits to represent a token whose length corresponds to the size of the dictionary. Next, "b," which was the suffix when generating the string "ab," becomes the prefix for the next operation, as indicated in row 3 in Table 13.1. Because the next character in the string to be encoded is "a," it functions as the suffix in creating the new string "ba." As that string is not presently in the string table, the last known string (b) is output using its ASCII code value of 98. Next, the string "ba" is added to the string table using the next available code, which is 256.

Because "a" was the suffix in creating the string "ab," it now becomes the prefix for the next string operation, as illustrated in row 4 in Table 13.1. The fourth character in the input string (b) is processed as the suffix to forming a new string. This results in the new string "ab," which was previously added to the string table. Because it is already in the string table, no output is generated and the string "ab" becomes the prefix for creating the next string.

Row 5 in Table 13.1 illustrates the creation of the next string. Here, the previously created string "ab" that was in the string table becomes the prefix for generating the next string, while the last character in the input data stream (c) becomes the suffix in forming the next string. The resulting string (abc) is not in the string table. Thus, the last known string "ab" has its code value output. Because "ab" was previously assigned the value 256, that value is output and the suffix (c) becomes the prefix for the creation of the next string. Because "c" was the last character in the input string, we simply output its code value of 99.

13.3.3 Performance Considerations

Although compression is standardized for use in modems, the same is not true for its use in remote bridges and routers. This means not only that vendor compression performing equipment will more likely than not fail to interoperate but, in addition, you may have to rely on vendor performance data. Unfortunately, some vendors have a tendency to place their equipment performance in the best possible light by transmitting files with a considerable degree of data redundancy and then claiming a high compression ratio, such as 4:1 or 5:1. If compression is an important equipment acquisition consideration, the author suggests that you create a set of files that are representative of your network traffic and have each vendor under consideration transfer those files and provide you with either their transmission time at a fixed WAN rate or their compression ratio.