8.1. Sorting
Doing expensive computations inside the block of a sort is inefficient. By default, the Perl interpreter now uses merge-sorting to implement sort[*], which means that every sort will call the sort block O(N log N) times. For example, suppose you needed to set up a collection of script files for binary-chop searching. In that case, you might need to sort a set of scripts by their SHA-512 digests. Doing that the obvious way is needlessly slow, because each script is likely to be re-SHA'd several times:
Use Digest::SHA qw( sha512 ); # Sort by SHA-512 digest of scripts @sorted_scripts = sort { sha512($a) cmp sha512($b) } @scripts;
A better solution is to cache the SHA values you've computed and avoid recalculating them. There are various standard techniques for doing that. The most straightforward is known as the Orcish Man?uvre[*]:
The sort block uses a hash (%sha512_of) to record each digest it computes. So, when comparing two scripts, it checks to see whether it already knows their SHA-512 values ($sha512_of{$a} cmp $sha512_of{$b}). If either look-up fails to find a cached value for the digest of its key, it returns undef instead, in which case the ||= will be evaluated. The sha512( ) call to its right then computes the appropriate value, which is assigned to the corresponding entry of the look-up table, for future reference. Note the use of a do block to limit the scope of the %sha512_of cache to the particular call to sort. If two or more sorts were likely to be applied to some of the same scripts, it would be more efficient to preserve the cache between sorts by moving the declaration of %sha512_of to some suitable outer scope, in which case the do block would no longer be required: Alternatively, rather than building the cache on the fly, you could precompute all the digests and store them in a look-up table via a sliced assignment: The resulting code is much cleaner and just as efficient, unless there is a chance that the sort block might throw an exception and prematurely abort the sorting, in which case you'll have done some of that precomputation needlessly. Another approach would be to prebuild a list of digest/script pairs, sort on the digests, and then keep only the original scripts: This pipelined solution is known as the Schwartzian Transform. Note the special layout, with the three steps lined up under each other. This format is used because it emphasizes the characteristic map-sort-map structure of the transform, making it much easier to identify when the technique is being used. Probably the cleanest and most maintainable solution (albeit slightly slower than direct caching or precomputation) is to memoize the sha512( ) subroutine itself: use Digest::SHA qw( sha512 ); Memoizing a subroutine causes it to remember every value it ever returns and to immediately return that same value (without recomputing it) the next time the subroutine is called with the same arguments. The Memoize module is standard in Perl 5.8 and later, and available from the CPAN for earlier versions of Perl. It's a very clean way to introduce caching into your code without littering that code with the ugly mechanics of an explicit cache. See Chapter 19 for a more detailed discussion of caching and memoization. All of the previous solutions have distinct performance characteristics and trade-offs, which are likely to vary depending on the size of the list to be sorted, the complexity of the function you're using to compare keys, and even the platform you're running on. In some cases, it might even be faster to just leave the (re-)computations in the sort block. So, if you decide to optimize your sort using one of these techniques, it's vitally important to benchmark the performance of whichever approach you decide to use. See Chapter 19 for more discussion of benchmarking. The "Automating Sorts" guideline later in this chapter suggests another easy way to build optimized sorts. |