B.7 Lempel Ziv-Compression | Text Processing in Python

Probably the most significant lossless-compression technique is Lempel-Ziv. What is explained here is LZ78, but LZ77 and other variants work in a similar fashion. The idea in LZ78 is to encode a streaming byte sequence using a dynamic table. At the start of compressing a bit stream, the LZ table is filled with the actual symbol set, along with some blank slots. Various size tables are used, but for our (whitespace-compressed) telephone number example above, let's suppose that we use a 32-entry table (this should be OK for our example, although much too small for most other types of data). First thing, we fill the first ten slots with our alphabet (digits). As new bytes come in, we first output an existing entry that grabs the longest sequence possible, then fill the next available slot with the N+1 length sequence. In the worst case, we are using 5-bits instead of 4-bits for a single symbol, but we'll wind up getting to use 5-bits for multiple symbols in a lot of cases. For example, the machine might do this (a table slot is noted with square brackets):

 7 --> Lookup: 7 found      --> nothing to add   --> keep looking 7 --> Lookup: 77 not found --> add '77' to [11] --> output [7]=00111 2 --> Lookup: 72 not found --> add '72' to [12] --> output [7]=00111 7 --> Lookup: 27 not found --> add '27' to [13] --> output [2]=00010 6 --> Lookup: 76 not found --> add '76' to [14] --> output [7]=00111 2 --> Lookup: 62 not found --> add '62' to [15] --> output [6]=00110 8 --> Lookup: 28 not found --> add '28' to [16] --> output [2]=00010

So far, we've got nothing out of it, but let's continue with the next phone number:

 7 --> Lookup: 87 not found  --> add '87' to [17]  --> output [8]=00100 7 --> Lookup: 77 found      --> nothing to add    --> keep looking 2 --> Lookup: 772 not found --> add '772' to [18] --> output [11]=01011 8 --> Lookup: 28 found      --> nothing to add    --> keep looking 6 --> Lookup: 286 not found --> add '286' to [19] --> output [16]=10000 ...

The steps should suffice to see the pattern. We have not achieved any net compression yet, but notice that we've already managed to use slot 11 and slot 16, thereby getting two symbols with one output in each case. We've also accumulated the very useful byte sequence 772 in slot 18, which would prove useful later in the stream.

What LZ78 does is fill up one symbol table with (hopefully) helpful entries, then write it, clear it, and start a new one. In this regard, 32 entries is still probably too small a symbol table, since that will get cleared before a lot of reuse of 772 and the like is achieved. But the small symbol table is easy to illustrate.

In typical data sets, Lempel-Ziv variants achieve much better compression rates than Huffman or RLE. On the other hand, Lempel-Ziv variants are very pricey cycle-wise and can use large tables in memory. Most real-life compression tools and libraries use a combination of Lempel-Ziv and Huffman techniques.