Section 13.4. Advanced Sorting | Learning Perl, 5th Edition

13.4. Advanced Sorting

In Chapter 3, we showed you could sort a list in ascending ASCIIbetical order by using the built-in sort operator. What if you want a numeric sort or a case-insensitive sort? Maybe you want to sort items according to information stored in a hash. Well, Perl lets you sort a list in whatever order you need; you'll see all of those examples by the end of the chapter.

You tell Perl what order you want by making a sort-definition subroutine, or sort subroutine for short. When you hear the term "sort subroutine," if you've been through any computer science courses, visions of bubble sort, shell sort, and quick sort race through your head, and you say, "No, never again!" Don't worry because it's not that bad. In fact, it's simple. Perl knows how to sort a list of items, but it doesn't know which order you want. So the sort subroutine tells it the order.

Why is this necessary? Well, if you think about it, sorting is putting a bunch of things in order by comparing them all. Since you can't compare them all at once, you need to compare two at a time, eventually using what you find out about each pair's order to put the whole kit and caboodle in line. Perl understands all of those steps except for the part about how you'd like to compare the items, so that's all you have to write.

This means that the sort subroutine doesn't need to sort many items after all. It merely has to be able to compare two items. If it can put two items in the proper order, Perl will be able to tell (by repeatedly consulting the sort subroutine) what order you want for your data.

The sort subroutine is defined like an ordinary subroutine (well, almost). This routine will be called repeatedly, each time checking on a pair of elements from the list you're sorting.

Now, if you were writing a subroutine that's expecting to get two parameters that need sorting, you might write something like this to start:

     sub any_sort_sub {  # It doesn't really work this way       my($a, $b) = @_;  # Get and name the two parameters       # start comparing $a and $b here       ...     }

But the sort subroutine will be called repeatedly, often hundreds or thousands of times. Declaring the variables $a and $b and assigning them values at the top of the subroutine will take just a little time, but multiply that by the thousands of times the routine will be called, and you can see it will contribute significantly to the overall execution speed.

We don't do it like that. (In fact, if you did it that way, it wouldn't work.) Instead, it is as if Perl has done this for us, before our subroutine's code has started. You'll write a sort subroutine without that first line; $a and $b have been assigned for you. When the sort subroutine starts running, $a and $b are two elements from the original list.

The sort subroutine returns a coded value describing how the elements compare (like C's qsort(3) does, but it's Perl's own internal sort implementation). If $a should appear before $b in the final list, the sort subroutine returns -1 to say so. If $b should appear before $a, it returns 1.

If the order of $a and $b doesn't matter, the subroutine returns 0. Why would it not matter? Perhaps you're doing a case-insensitive sort and the two strings are fred and Fred. Perhaps you're doing a numeric sort, and the two numbers are equal.

We could write a numeric sort subroutine like this:

     sub by_number {       # a sort subroutine, expect $a and $b       if ($a < $b) { -1 } elsif ($a > $b) { 1 } else { 0 }     }

To use the sort subroutine, put its name (without an ampersand) between the keyword sort and the list you're sorting. This example puts a numerically sorted list of numbers into @result:

     my @result = sort by_number @some_numbers;

We called the subroutine by_number because that describes how it's sorting. But more importantly, you can read the line of code that uses it with sort as saying "sort by number," as you would in English. Many sort subroutine names begin with by_ to describe how they sort. We could have called this one numerically for a similar reason, but that's more typing and more chance to mess something up.

We don't have to do anything in the sort subroutine to declare $a and $b, and to set their values. If we did, the subroutine wouldn't work right. We let Perl set up $a and $b for us, and all we need to write is the comparison.

In fact, we can make it simpler and more efficient. Since this kind of three-way comparison is frequent, Perl has a convenient shortcut to use to write it. In this case, we use the spaceship operator (<=>).^[*] This operator compares two numbers and returns -1, 0, or 1 as needed to sort them numerically. So, we could have written that sort subroutine better like this:

^[*] We call it that because it looks like one of the Tie-fighters from Star Wars. Well, it looks like that to us.

     sub by_number { $a <=> $b }

Since the spaceship compares numbers, you may have guessed there's a corresponding three-way string comparison operator: cmp. These two are easy to remember and keep straight. The spaceship has a family resemblance to the numeric comparison operators like >=, but it's three characters long instead of two because it has three possible return values instead of two. And cmp has a family resemblance to the string comparison operators like ge, but it's three characters long instead of two because it has three possible return values instead of two.^[] Of course, cmp provides the same order as the default sort. Youll never need to write this subroutine, which yields merely the default sort order:^[]

] This is no accident. Larry does things like this on purpose to make Perl easier to learn and remember. Hes a linguist at heart, so he's studied how people think of languages.

^[] Youd never need to write this unless you were writing an introductory Perl book and needed it for an example.

     sub ASCIIbetically { $a cmp $b } my @strings = sort ASCIIbetically @any_strings;

But you can use cmp to build a more complex sort order, like a case-insensitive sort:

     sub case_insensitive { "\L$a" cmp "\L$b" }

In this case, we're comparing the string from $a (forced to lowercase) against the string from $b (forced to lowercase), giving a case-insensitive sort order.

We're not modifying the elements themselves; we're using their values. That's important: for efficiency reasons, $a and $b aren't copies of the data items. They're new, temporary aliases for elements of the original list, so if we changed them we'd be mangling the original data. Don't do thatit's neither supported nor recommended.

When your sort subroutine is as simple as the ones we show here (most of the time, it is), you can make the code simpler by replacing the name of the sort routine with the entire sort routine "in line," like this:

     my @numbers = sort { $a <=> $b } @some_numbers;

In modern Perl, you'll hardly ever see a separate sort subroutine, but you'll frequently find sort routines written inline as we've done here.

Suppose you want to sort in descending numeric order. That's easy enough to do with the help of reverse:

     my @descending = reverse sort { $a <=> $b } @some_numbers;

But here's a neat trick. The comparison operators (<=> and cmp) are nearsighted in that they can't see which operand is $a and which is $b. They can only see which value is on the left and which is on the right. If $a and $b were to swap places, the comparison operator would get the results backward every time. That means this is another way to get a reversed numeric sort:

     my @descending = sort { $b <=> $a } @some_numbers;

With a little practice, you can read this at a glance. It's a descending-order comparison (because $b comes before $a, which is descending order), and it's a numeric comparison (because it uses the spaceship instead of cmp). So, it's sorting numbers in reverse order. In modern Perl versions it doesn't matter much which one of those you do, because reverse is recognized as a modifier to sort, and special shortcuts are taken to avoid sorting it one way just to have to turn it around the other way.

13.4.1. Sorting a Hash by Value

Once you've been sorting lists for a while, you'll run into a situation where you want to sort a hash by value. For example, three of our characters went out bowling last night, and we've got their bowling scores in the following hash. You want to be able to print out the list in the proper order, with the game winner at the top, so we want to sort the hash by score:

     my %score = ("barney" => 195, "fred" => 205, "dino" => 30);     my @winners = sort by_score keys %score;

You aren't really going to be able to sort the hash by score; that's just a verbal shortcut. You can't sort a hash. But when we've used sort with hashes before, we sorted the keys of the hash (in ASCIIbetical order). Now you're going to sort the keys of the hash, but the order will be defined by their corresponding values from the hash. In this case, the result should be a list of our three characters' names in order, according to their bowling scores.

Writing this sort subroutine is fairly easy. What we want is to use a numeric comparison on the scores rather than the names. Instead of comparing $a and $b (the players' names), we want to compare $score{$a} and $score{$b} (their scores). If you think of it that way, it almost writes itself, as in:

      sub by_score { $score{$b} <=> $score{$a} }

Let's step through this and see how it works. Imagine the first time it's called, Perl has set $a to barney and $b to fred. The comparison is $score{"fred"} <=> $score{"barney"}, which (as we can see by consulting the hash) is 205 <=> 195. The spaceship is nearsighted, so when it sees 205 before 195, it says: "No, that's not the right numeric order; $b should come before $a." It tells Perl that fred should come before barney.

Maybe the next time the routine is called, $a is barney again and $b has become dino. The nearsighted numeric comparison sees 30 <=> 195, so it reports that they're in the right order: $a does indeed sort in front of $b. That is, barney comes before dino. At this point, Perl has enough information to put the list in order: fred is the winner, and barney is in second place, and dino is in third place.

Why did the comparison use the $score{$b} before the $score{$a} instead of the other way around? That's because we want bowling scores arranged in descending order, from the highest score of the winner down. You can (after a little practice) read this one at sight as well: $score{$b} <=> $score{$a} means to sort according to the scores in reversed numeric order.

13.4.2. Sorting by Multiple Keys

We forgot to mention the fourth player bowling last night with the other three, so the hash looked like this:

     my %score = (       "barney" => 195, "fred" => 205,       "dino" => 30, "bamm-bamm" => 195,     );

bamm-bamm has the same score as barney. Which one will be first in the sorted list of players? There's no telling because the comparison operator (seeing the same score on both sides) will have to return zero when checking those two.

Maybe that doesn't matter, but we prefer to have a well-defined sort. If several players have the same score, we want them to be together in the list, but within that group, the names should be in ASCIIbetical order. How can we write the sort subroutine to say that? Again, this turns out to be easy:

     my @winners = sort by_score_and_name keys %score;           sub by_score_and_name {       $score{$b} <=> $score{$a}  # by descending numeric score         or       $a cmp $b                  # ASCIIbetically by name     }

How does this work? If the spaceship sees two different scores, that's the comparison we want to use. It returns -1 or 1, a true value, so the low-precedence short-circuit or will mean that the rest of the expression will be skipped, and the comparison we want is returned. (Remember, the short-circuit or returns the last expression evaluated.) If the spaceship sees two identical scores, it returns 0, a false value, and the cmp operator gets its turn at bat, returning an appropriate ordering value considering the keys as strings. That is, if the scores are the same, the string-order comparison breaks the tie.

We know that when we use the by_score_and_name sort subroutine like this, it will never return 0. (Do you see why it won't? The answer is in the footnote.^[*]) We know the sort order is always well-defined; that is, we know the result today will be the same as the result with the same data tomorrow.

^[*] The only way it could return 0 would be if the two strings were identical, and (since the strings are keys of a hash) we know they're different. If you passed a list with duplicate (identical) strings to sort, it would return 0 when comparing those, but we're passing a list of hash keys.

Your sort subroutine does not have to be limited to two levels of sorting. Here the Bedrock library program puts a list of patron ID numbers in order according to a five-level sort.^[] This example sorts according to the amount of each patrons outstanding fines (as calculated by a subroutine &fines, not shown here), the number of items they currently have checked out (from %items), their name (in order by family name, then by personal name, both from hashes), and finally by the patron's ID number, in case everything else is the same:

^[] Its not unusual in the modern world to need a five-level sort like this, although it was quite infrequent in prehistoric times.

     @patron_IDs = sort {       &fines($b) <=> &fines($a) or       $items{$b} <=> $items{$a} or       $family_name{$a} cmp $family_name{$a} or       $personal_name{$a} cmp $family_name{$b} or       $a <=> $b     } @patron_IDs;