Item 14: Learn the myriad ways of sorting. | Effective Perl Programming: Writing Better Programs with Perl

ASCIIbetical sorting

At its most basic, sorting in Perl is simplicity itself. Perl's sort operator takes a list of elements and returns a copy of that list in sorted order:

 @elements = sort qw(    hydrogen    helium    lithium  );

Yields

helium

hydrogen

lithium

The sort order is "ASCIIbetical," meaning that the items are sorted by comparing the ASCII values (well, really just the numeric values) of the first, second, third, et cetera characters of each element as necessary. ^[3] This leads to some interesting results:

^[3] The 8-bit aware reader may prefer to say "sorted characterwise" instead of "ASCIIbetically," but neither that nor "characterbetically" rolls off the tongue. I prefer the more colorful term .

print join ' ', sort 1 .. 10;	`1 10 2 3 4 5 6 7 8 9` , since `'10' lt '2'` .
print join ' ', sort qw(my Dog has Fleas);	`Dog Fleas has my` , since uppercase `lt` lowercase.

Hmm. If ASCIIbetical comparison isn't what you want, you need to write a sort subroutine.

Comparison (sort) subroutines

A Perl sort subroutine is not an entire sorting algorithm. A better name might be " comparison subroutine."

Sort subroutines are different from ordinary subroutines in that the arguments are passed in via the hard-coded variable names $a and $b rather than as elements of @_ . ^[4] $a and $b are localized for the sort subroutine, as if there were an implicit local($a, $b) at the beginning of the sort operation.

^[4] @_ may be present inside a sort subroutine, but if so it refers to the @_ from the most recently entered subroutine scopenot exactly what you would expect!

$a and $b get a "bye" from use strict vars they do not have to be declared explicitly (see Item 36). They belong to the current package, not ( necessarily ) the main package.

The sort subroutine will be called repeatedly during sorting. Its job is to compare $a and $b and return -1 , , or 1 depending upon whether $a sorts less than, equal to, or greater than $b . If this reminds you of C's qsort () , well, it should, since Perl typically uses the qsort() built into the C standard library.

Perl's built-in sorting behavior is as though the elements were compared ASCIIbetically with the cmp operator:

 sub ASCIIbetically { $a cmp $b }  @list = sort ASCIIbetically @list;

A sort subroutine.

Same as @list = sort @list .

Here we have used a named subroutine, ASCIIbetically , to specify the sorting order. To change the sorting order, change the way that $a and $b are compared. For example, replace the cmp operator with <=> (see Item 4) to sort numerically :

Sort numerically using the <=> (spaceship) operator.

sub numerically { $a <=> $b } @list = sort numerically (16, 1, 8, 2, 4, 32);	A numerical sort subroutine.
	`(1, 2, 4, 8, 16, 32)`

A more concise way of using a sort subroutine is to write it as a sort block . Just place the body of the subroutine right where the name of the sort subroutine would go:

Write sorts more concisely using sort blocks.

 @list = sort { $a <=> $b }    (16, 1, 8, 2, 4, 32);

A sort block.

(1, 2, 4, 8, 16, 32)

Here are some more examples:

 @list = sort { uc($a) cmp uc($b) }    qw(This is a test);

Sort ignoring case:

('a', 'is', 'test', 'This')

 @list = sort { $b cmp $a } @list;

Sort in reverse orderjust switch $a and $b .

 @list = sort { -M $a <=> -M $b }    @files

Sort by file mod timesee more efficient version later.

Don't modify $a and $b . They are aliased to the original elements, and modifying them will change the original values. Modifying them may also produce an inconsistent sorting order in addition to other unpleasant side effects, and inconsistent sorting orders have a way of producing core dumps or crashes in Perl. Here is an example of just such a bad thing:

Don't modify $a or $b inside a sort subroutine.

 @list = sort {    $a =~ tr/A-Z/a-z/;    $b =~ tr/A-Z/a-z/;    $a cmp $b  } qw(This is a test);

Modifies $a BAD.

Modifies $b BAD.

Something that comes up from time to time is the need to sort the keys of a hash according to their corresponding values. There is a neat idiom for doing this:

Use $hash{$a} and $hash{$b} to sort hash keys by their corresponding values.

 %elems = (B => 5, Be => 4,    H => 1, He => 2, Li => 3);

 @list = sort { $elems{$a} <=> $elems{$b} }    keys %elems;

This sorts keys by their values, numerically:

('H', 'He', 'Li', 'Be', 'B')

Finally, you may want to sort on multiple keys. There is a standard idiom using the or operator for this. Here's a slightly contrived example:

Use the or operator to sort on multiple keys.

 @first = qw(John Jane Bill Sue Carol);  @last = qw(Smith Smith Jones Jones Smith);

Some data to get us started.

 @index = sort {    $last[$a] cmp $last[$b] or    $first[$a] cmp $first[$b]  } 0 .. $#first;

Sort by

last name, then

first name.

 for (@index) {    print "$last[$_], $first[$_]\n";  }

Jones, Bill

Jones, Sue

Smith, Carol , etc.

Here we are actually sorting a list of indicesa fairly common thing to do. The part I want you to note is the use of the short-circuit Boolean operator or in the sort subroutine. Each time the sort subroutine is called, the part of the expression to the left of the or is evaluated first. If it evaluates to a non-zero "true" valuein this case, meaning that $a doesn't compare equal to $b then or returns that value and we're done. Otherwise Perl evaluates the comparison on the right-hand side and returns that value.

Note that this wouldn't work if Perl's or operator (or its higher-precedence cousin ) returned only 1 or . It's the fact that or returns the actual value computed on the left- or right-hand side that makes this work.

Advanced sorting, the mundane ways

Sometimes it takes a significant amount of processing to compare two keys. For example, let's sort on the third field of a password entry:

 open PASSWD, "/etc/passwd" or die;  @by_uid = sort {    (split /:/, $a)[2] <=>    (split /:/, $b)[2]  } <PASSWD>;

The third field contains the numeric user id, so remember to compare with <=> .

This seems okay at first glance. It does indeed sort the lines in the required order. However, it also performs the relatively complex split operation twice each time the sort subroutine is called.

A sort subroutine will typically be called many times for each key. Comparison-based sorting algorithms perform on the order of n log n comparisons per sort, where n is the number of elements sorted. In order to make your sorts run quickly, you have to make the comparisons run quickly. If your keys require significant transformation before comparison, this means that you will have to find a way to cache the result of the transformation.

In the case above, you could do something like this:

open PASSWD, "/etc/passwd" or die; @passwd = <PASSWD>;	Read lines into `@passwd` .
%key = map { $_, (split /:/)[2] } @passwd;	Create a hashkeys are entire lines from the password file; values are user ids.
@by_uid = sort { $key{$a} <=> $key{$b} } @passwd;	Now, sort the keys (lines) by values (user ids). No expensive `split` used here now.

Note the use of map to return a list that becomes the comparison hash. If you are map -averse, you can write it like this instead:

 for (@passwd) { $key{$_} = (split /:/)[2] }

As you look at this, you will realize that the keys of this hash are entire lines from the password file! If this is not to your liking, you can get the same effect by using array indices, but it requires a little more code and is a little harder to read:

open PASSWD, "/etc/passwd" or die; @passwd = <PASSWD>; @key = map { (split /:/)[2] } @passwd;	Create an array `@key` containing just the keys (user ids).
@by_uid_index = sort { $key[$a] <=> $key[$b] } 0 .. $#key;	Sort indices, using `@key` array.
@by_uid = @passwd[@by_uid_index];	Now, rearrange the contents of `@passwd` into sorted order.

Or just combine the last two statements:

 @by_uid = @passwd[sort { $key[$a] <=> $key[$b] } 0 .. $#key];

Well, this will work, and it's efficient enough, but there are prettier, more Perl-ish ways to present the solution.

Advanced sorting, the cool ways

Through design, or perhaps trial, or maybe just accident , Perl programmers have come up with some convenient idioms for implementing complex sorting transforms.

One of the things that is less than ideal in the examples above is the need for a separate statement to set up an array or hash of transformed keys. One way of getting around this is something I have nicknamed the " Orcish Maneuver." ^[5] It uses the little-known = operator. Let's revisit the example of sorting filenames by modification date that we saw a little earlier:

^[5] The "or-cache" ... Arrgh.

Use the Orcish Maneuver to cache key transformations.

Here's the old way:
@sorted = sort { -M $a <=> -M $b } @files;	Too many uses of `-M` .
Here it is using the Orcish Maneuver:
@sorted = sort { ($m{$a} = -M $a) <=> ($m{$b} = -M $b) } @files;	The `=` operator caches the values returned from `-M` in the hash `%m` as the sort takes place.

Wow! What the heck is going on here?

First of all, note that:

 $m{$a} = -M $a

has the same semantics as:

 $m{$a} = $m{$a}  -M $a

The first time that the sort subroutine encounters a particular filename $a , $m{$a} has the value undef , which is false, and thus the right-hand side of the , -M $a , has to be evaluated. Because this is Perl, not C, the operator returns the actual result of the right-hand side, not a simple 0 or 1 "true or false" value. This value now gets assigned to $m{$a} . Subsequent tests against the same filename will use the modification time value cached in $m{$a} .

The hash %m is a temporary variable and should be empty or undefined when this sort statement is encountered . You may want to wrap this line of code in braces and make %m a my variable, something like:

 ... { my %m; @sorted = sort ... };

The most concise all-around sorting technique, though, is the " Schwartzian Transform," named, of course, after Randal Schwartz. ^[6] A Schwartzian Transform is a sort bracketed by map s.

^[6] But not named by Randal Schwartzit's a long story.

Note

the Schwartzian Transform uses referencesif you aren't familiar with them, you may find it helpful to read Item 30 before continuing with this Item.

The best way to show a Schwartzian Transform is to build it up by pieces. Let's take the prior example of the modification time sorting and do it more efficiently . First, let's start with the filenames:

 @names = <*>;

And now, let's turn the names into a same-length list of two-element anonymous lists:

 @names_and_ages = map { [$_, -M] } @names;

Each element is now a reference to a two-element lista "tuple." The first element of each tuple is the original name (from $_ ), while the second element is the modification age in days (from -M , implied $_ argument).

In the next step, we sort this list of references using a sort with a sort block:

 @sorted_names_and_ages = sort {    $a->[1] <=> $b->[1]  } @names_and_ages;

Within the sort block, $a and $b represent elements of the array @names_and_ages . Thus, $a and $b are array references, and $a->[1] represents the second element of a selected tuple, containing the age in days. The net result is that the tuples will be sorted numerically (note the spaceship <=> operator) by ascending ages.

That gets us most of the way therenow all we need is to extract the original names from each tuple. Simple enough, one more map :

 @sorted_names = map { $_->[0] } @sorted_names_and_ages;

And that's it. But that's much too wordy for the seasoned Perl hacker, so here it is all put together as a Schwartzian Transform:

Use the Schwartzian Transform for sorts with expensive key transformations.

 @sorted_names =    map { $_->[0] }    sort { $a->[1] <=> $b->[1] }    map { [$_, -M] }    @files;

4. Extract original names.

3. Sort [ name, key ] tuples.

2. Create [ name, key ] tuples.

1. The input data.

Read this from bottom to top and you'll see that it does the same thing as the individual statements abovebut now all the steps are strung together.

Simple sorts involving a single key and transformation can use the above pattern, changing only the rightmost map and, if necessary, the comparison operator. Here's the password file sorted by the third field using a Schwartzian Transform:

 open PASSWD, "/etc/passwd" or die;

Open up the file.

 @by_uid =    map { $_->[0] }    sort { $a->[1] <=> $b->[1] }    map { [$_, (split /:/)[2]] }    <PASSWD>;

Use <=> to compare ids.

Create [ line, uid ] tuples.

Note how much more concise this is. We got rid of the hash %key , used to temporarily store the transformed keys. We were also able to eliminate the @passwd array and take the input to the Schwartzian Transform directly from the filehandle. The best part about the Schwartzian Transform, though, is that it tends to be the fastest way to perform complicated sorts like these. Fast and concisenow that's a nice combination!