B.8 Solving the Right Problem

Just as choosing the right algorithm can often create orders-of-magnitude improvements over even heavily optimized wrong algorithms, choosing the right data representation is often even more important than compression methods (which are always a sort of post hoc optimization of desired features). The simple data set example used in this appendix is a perfect case where reconceptualizing the problem would actually be a much better approach than using any of the compression techniques illustrated.

Think again about what our data represents. It is not a very general collection of data, and the rigid a priori constraints allow us to reformulate our whole problem. What we have is a maximum of 30,000 telephone numbers (7720000 through 7749999), some of which are active, and others of which are not. We do not have a "duty," as it were, to produce a full representation of each telephone number that is active, but simply to indicate the binary fact that it is active. Thinking of the problem this way, we can simply allocate 30,000 bits of memory and storage, and have each bit say "yes" or "no" to the presence of one telephone number. The ordering of the bits in the bit-array can be simple ascending order from the lowest to the highest telephone number in the range.

This bit-array solution is the best in almost every respect. It allocates exactly 3750 bytes to represent the data set; the various compression techniques will use a varying amount of storage depending both on the number of telephone numbers in the set and the efficiency of the compression. But if 10,000 of the 30,000 possible telephone numbers are active, and even a very efficient compression technique requires several bytes per telephone number, then the bit-array is an order-of-magnitude better. In terms of CPU demands, the bit-array is not only better than any of the discussed compression methods, it is also quite likely to be better than the naive noncompression method of listing all the numbers as strings. Stepping through a bit-array and incrementing a "current-telephone-number" counter can be done quite efficiently and mostly within the on-chip cache of a modern CPU.

The lesson to be learned from this very simple example is certainly not that every problem has some magic shortcut (like this one does). A lot of problems genuinely require significant memory, bandwidth, storage, and CPU resources, and in many of those cases compression techniques can help ease or shift those burdens. But a more moderate lesson could be suggested: Before compression techniques are employed, it is a good idea to make sure that one's starting conceptualization of the data representation is a good one.

Text Processing in Python
Text Processing in Python
ISBN: 0321112547
EAN: 2147483647
Year: 2005
Pages: 59
Authors: David Mertz

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net