Section 6.1. What Is a Hash? | Learning Perl, 5th Edition

6.1. What Is a Hash?

A hash is a data structure like an array, in that it can hold any number of values and retrieve these values at will. However, instead of indexing the values by number, as we did with arrays, we'll look up the values by name. That is, the indices (here, we'll call them keys) aren't numbers but are arbitrary unique strings (see Figure 6-1).

The keys are strings, first of all, so instead of getting element number 3 from an array, we'll be accessing the hash element named wilma.

These keys are arbitrary stringsyou can use any string expression for a hash key. And they are unique strings; as there's only one array element numbered 3, there's only one hash element named wilma.

Another way to think of a hash is that it's like a barrel of data (see Figure 6-2), where each piece of data has a tag attached. You can reach into the barrel and pull out any tag and see what piece of data is attached. But there's no "first" item in the barrel; it's just a jumble. In an array, we'd start with element 0 and then element 1, element 2, and so on. But in a hash, there's no fixed order, no first element. It's just a collection of key/value pairs.

Figure 6-1. Hash keys and values

Figure 6-2. A hash as a barrel of data

The keys and values are both arbitrary scalars, but the keys are always converted to strings. So, if you used the numeric expression 50/20 as the key,^[*] it would be turned into the three-character string "2.5", which is one of the keys shown in the diagram above.

^[*] That's a numeric expression, not the five-character string "50/20". If we used that five-character string as a hash key, it would stay the same five-character string.

As usual, Perl's no-unnecessary-limits philosophy applies: a hash may be of any size, from an empty hash with zero key/value pairs, up to whatever fills up your memory.

Some implementations of hashes (such as in the original awk language, from where Larry borrowed the idea) slow down as the hashes get larger. This is not the case in Perlit has a good, efficient, scalable algorithm.^[*] So, if a hash has only three key/value pairs, it's quick to "reach into the barrel" and pull out any one of those. If the hash has 3 million key/value pairs, it should be about as quick to pull out any one of those. A big hash is nothing to fear.

^[*] Technically, Perl rebuilds the hash table as needed for larger hashes. The term "hashes" comes from the fact that a hash table is used for implementing them.

Keys are unique, though the values can be duplicated. The values of a hash may be all numbers, strings, undef values, or a mixture,^[] but the keys are arbitrary, unique strings.

] Or, in fact, any scalar values, including scalar types other than the ones well see in this book.

6.1.1. Why Use a Hash?

When you first hear about hashes, especially if you've lived a long and productive life as a programmer using languages that don't have hashes, you may wonder why anyone would want one of these strange beasts. Well, the general idea is that you'll have one set of data "related to" another set of data. For example, here are some hashes you might find in typical applications of Perl:

Given name, family name

The given name (first name) is the key, and the family name is the value. This requires unique given names; if two people were named randal, this wouldn't work. With this hash, you can look up anyone's given name, and find the corresponding family name. If you use the key tom, you get the value phoenix.

Hostname, IP address

You may know that each computer on the Internet has a hostname (such as www.stonehenge.com) and an IP address number (such as 123.45.67.89) because machines like working with the numbers, but we humans have an easier time remembering the names. The hostnames are unique strings, so they can be used to make this hash. With this hash, you could look up a hostname and find the corresponding IP address.

IP address, hostname

Or you could go in the opposite direction. We generally think of the IP address as a number, but it's also a unique string, so it's suitable for use as a hash key. In this hash, we can use the IP address to look up the corresponding hostname. This is not the same hash as the previous example: hashes are a one-way street, running from key to value; there's no way to look up a value in a hash and find the corresponding key. So, these two are a pair of hashes, one for storing IP addresses, one for hostnames. It's easy enough to create one of these given the other, though, as we'll see below.

Word, count of number of times that word appears

This is a common use of a hash. It's so common, in fact, that it just might turn up in the exercises at the end of the chapter.

The idea here is that you want to know how often each word appears in a given document. Perhaps you're building an index to a number of documents, so when a user searches for fred, you'll know that a certain document mentions fred five times, another mentions fred seven times, and another doesn't mention fred at all. The index shows which documents the user is likely to want. As the index-making program reads through a given document, each time it sees a mention of fred, it adds one to the value filed under the key of fred. That is, if we had seen fred twice already in this document, the value would be 2, but now we'll increment it to 3. If we had never seen fred before, we'd change the value from undef (the implicit, default value) to 1.

Username, number of disk blocks they are using [wasting]

System administrators like this one. The usernames on a given system are unique strings, so they can be used as keys in a hash to look up information about that user.

Driver's license number, name

There may be many people named John Smith, but we hope each one has a different driver's license number. That number makes for a unique key, and the person's name is the value.

Another way to think of a hash is as a simple database, in which one piece of data may be filed under each key. If your task description includes phrases like "finding duplicates," "unique," "cross-reference," or "lookup table," it's likely that a hash will be useful in the implementation.