A Supermarket Price Lookup System

Introduction

In this chapter we will use a supermarket price lookup system to illustrate how to save storage by using a restricted character set and how to speed up access to records by employing hash coding (or "scatter storage") and caching (or keeping copies of recently accessed records in memory). We will look items up by their UPC (Universal Product Code), which is printed in the form of a "bar code" on virtually all supermarket items other than fresh produce. We will emphasize rapid retrieval of prices, as maintenance of such files is usually done after hours, when speed would be less significant.

Algorithms Discussed

Algorithms discussed: Hash Coding, Radix40 Data Representation, BCD Data Representation, Caching

Up the Down Staircase

To begin, let us assume that we can describe each item by the information in the structure definition in Figure initstruct.

Item information (Figure initstruct)

typedef struct { 

char upc[10];

char description[21];

float price;

} ItemRecord;

One solution to our price-retrieval problem would be to create a file with one record for each item, sorted by UPC code. This would allow us to use a binary search to locate the price of a particular item. How long would it take to find a record in such a file containing 10,000 items?

To answer this question, we have to analyze the algorithm of the binary search in some detail. We start the search by looking at the middle record in the file. If the key we are looking for is greater than the key of the middle record, we know that the record we are looking for must be in the second half of the file (if it is in the file at all). Likewise, if our key is less than the one in the middle record, the record we are looking for must be in the first half of the file (again, if it is there at all). Once we have decided which half of the file to examine next, we look at the middle record in that half and proceed exactly as we did previously. Eventually, either we will find the record we are looking for or we will discover that we can no longer divide the segment we are looking at, as it has only one record (in which case the record we are looking for is not there).

Probably the easiest way to figure out the average number of accesses that would be required to find a record in the file is to start from the other end of the problem: how many records could be found with one access? Obviously, only the middle record. With another access, we could find either the record in the middle of the first half of the file or the record in the middle of the second half. The next access adds another four records, in the centers of the first, second, third, and fourth quarters of the file. In other words, each added access doubles the number of added records that we can find.

Binary search statistics (Figure binary.search)

Number of           Number of        Total accesses        Total     

accesses newly accessible to find all records

records records accessible

1 x 1 1 1

2 x 2 4 3

3 x 4 12 7

4 x 8 32 15

5 x 16 80 31

6 x 32 192 63

7 x 64 448 127

8 x 128 1024 255

9 x 256 2304 511

10 x 512 5120 1023

11 x 1024 11264 2047

12 x 2048 24576 4095

13 x 4096 53248 8191

14 x 1809 25326 10000

______ ______

10000 123631

Average number of accesses per record = 12.3631 accesses/record

Figure binary.search shows the calculation of the average number of accesses for a 10,000 item file. Notice that each line represents twice the number of records as the one above, with the exception of line 14. The entry for that line (1809) is the number of 14-access records needed to reach the capacity of our 10,000 record file. As you can see, the average number of accesses is approximately 12.4 per record. Therefore, at a typical hard disk speed of 10 milliseconds per access, we would need almost 125 milliseconds to look up an average record using a binary search. While this lookup time might not seem excessive, remember that a number of checkout terminals would probably be attempting to access the database at the same time, and the waiting time could become noticeable. We might also be concerned about the amount of wear on the disk mechanism that would result from this approach.

Some Random Musings

Before we try to optimize our search, let us define some terms. There are two basic categories of storage devices, distinguished by the access they allow to individual records. The first type is sequential access;1 in order to read record 1000 from a sequential device, we must read records 1 through 999 first, or at least skip over them. The second type is direct access; on a direct access device, we can read record 1000 without going past all of the previous records. However, only some direct access devices allow nonsequential accesses without a significant time penalty; these are called random access devices. Unfortunately, disk drives are direct access devices, but not random access ones. The amount of time it takes to get to a particular data record depends on how close the read/write head is to the desired position; in fact, sequential reading of data may be more than ten times as fast as random access.

Is there a way to find a record in a large file with an average of about one nonsequential access? Yes; in fact, there are several such methods, varying in complexity. They are all variations on hash coding, or address calculation; as you will see, such methods actually can be implemented quite simply, although for some reason they have acquired a reputation for mystery.

Hashing It Out

Let's start by considering a linear or sequential search. That is, we start at the beginning of the file and read each record in the file until we find the one we want (because its key is the same as the key we are looking for). If we get to the end of the file without finding a record with the key we are looking for, the record isn't in the file. This is certainly a simple method, and indeed is perfectly acceptable for a very small file, but it has one major drawback: the average time it takes to find a given record increases every time we add another record. If the file gets twice as big, it takes twice as long to find a record, on the average. So this seems useless.

Divide and Conquer

But what if, instead of having one big file, we had many little files, each with only a few records in it? Of course, we would need to know which of the little files to look in, or we wouldn't have gained anything. Is there any way to know that?

Let's see if we can find a way. Suppose that we have 1000 records to search through, keyed by telephone number. To speed up the lookup, we have divided the records into 100 subfiles, averaging 10 numbers each. We can use the last two digits of the telephone number to decide which subfile to look in (or to put a new record in), and then we have to search through only the records in that subfile. If we get to the end of the subfile without finding the record we are looking for, it's not in the file. That's the basic idea of hash coding.

But why did we use the last two digits, rather than the first two? Because they will probably be more evenly distributed than the first two digits. Most of the telephone numbers on your list probably fall within a few telephone exchanges near where you live (or work). For example, suppose my local telephone book contained a lot of 758 and 985 numbers and very few numbers from other exchanges. Therefore, if I were to use the first two digits for this hash coding scheme, I would end up with two big subfiles (numbers 75 and 98) and 98 smaller ones, thus negating most of the benefit of dividing the file. You see, even though the average subfile size would still be 10, about 90% of the records would be in the two big subfiles, which would have perhaps 450 records each. Therefore, the average search for 90% of the records would require reading 225 records, rather than the five we were planning on. That is why it is so important to get a reasonably even distribution of the data records in a hash-coded file.

Unite and Rule

It is inconvenient to have 100 little files lying around, and the time required to open and close each one as we need it makes this implementation inefficient. But there's no reason we couldn't combine all of these little files into one big one and use the hash code to tell us where we should start looking in the big file. That is, if we have a capacity of 1000 records, we could use the last two digits of the telephone number to tell us which "subfile" we need of the 100 "subfiles" in the big file (records 0-9, 10-19 ... 980-989, 990-999). To help visualize this, let's look at a smaller example: 10 subfiles having a capacity of four telephone numbers each and a hash code consisting of just the last digit of the telephone number (Figure initfile).

Hashing with subfiles, initialized file (Figure initfile)

Subfile #                         +-----------+-----------+-----------+-----------+                 0       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 1       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 2       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 3       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 4       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 5       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 6       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 7       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 8       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 9       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+ 

In order to use a big file rather than a number of small ones, we have to make some changes to our algorithm. When using many small files, we had the end-of-file indicator to tell us where to add records and where to stop looking for a record; with one big file subdivided into small subfiles, we have to find another way to handle these tasks.

Knowing When to Stop

One way is to add a "valid-data" flag to every entry in the file, which is initialized to "I" (for invalid) in the entries in Figure initfile, and set each entry to "valid" (indicated by a "V" in that same position) as we store data in it. Then if we get to an invalid record while looking up a record in the file, we know that we are at the end of the subfile and therefore the record is not in the file (Figure distinct).

Hashing with distinct subfiles (Figure distinct)

Subfile #                         +-----------+-----------+-----------+-----------+                 0       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 1       | V 9876541 | V 2323231 | V 9898981 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 2       | V 2345432 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 3       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 4       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 5       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 6       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 7       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 8       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 9       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+ 

For example, if we are looking for the number "9898981", we start at the beginning of subfile 1 in Figure distinct (because the number ends in 1), and examine each record from there on. The first two entries have the numbers "9876541" and "2323231", which don't match, so we continue with the third one, which is the one we are looking for. But what if we were looking for "9898971"? Then we would go through the first three entries without finding a match. The fourth entry is "I 0000000", which is an invalid entry. This is the marker for the end of this subfile, so we know the number we are looking for isn't in the file.

Now let's add another record to the file with the phone number "1212121". As before, we start at the beginning of subfile 1, since the number ends in 1. Although the first three records are already in use, the fourth (and last) record in the subfile is available, so we store our new record there, resulting in the situation in Figure merged.

Hashing with merged subfiles (Figure merged)

Subfile #                         +-----------+-----------+-----------+-----------+                 0       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 1       | V 9876541 | V 2323231 | V 9898981 | V 1212121 |                         +-----------+-----------+-----------+-----------+                 2       | V 2345432 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 3       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 4       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 5       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 6       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 7       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 8       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 9       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+ 

However, what happens if we look for "9898971" in the above situation? We start out the same way, looking at records with phone numbers "9876541", "2323231", "9898981", and "1212121". But we haven't gotten to an invalid record yet. Can we stop before we get to the first record of the next subfile?

Handling Subfile Overflow

To answer that question, we have to see what would happen if we had added another record that belonged in subfile 1. There are a number of possible ways to handle this situation, but most of them are appropriate only for memory-resident files. As I mentioned above, reading the next record in a disk file is much faster than reading a record at a different place in the file. Therefore, for disk-based data, the most efficient place to put "overflow" records is in the next open place in the file, which means that adding a record with phone number "1234321" to this file would result in the arrangement in Figure overflow.2

Hashing with overflow between subfiles (Figure overflow)

Subfile #                         +-----------+-----------+-----------+-----------+                 0       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 1       | V 9876541 | V 2323231 | V 9898981 | V 1212121 |                         +-----------+-----------+-----------+-----------+                 2       | V 2345432 | V 1234321 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 3       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 4       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 5       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 6       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 7       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 8       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+                 9       | I 0000000 | I 0000000 | I 0000000 | I 0000000 |                         +-----------+-----------+-----------+-----------+ 

So the answer is that we can't stop before the beginning of the next subfile, because the record we are looking for might have overflowed to the next subfile, as "1234321" just did. Where can we stop? Well, we know that the record we are looking for is somewhere between the beginning of the subfile it belongs in and the next invalid record, since we go to the next subfile only when the one we are trying to use is filled up. Therefore, if we get to an invalid record, we know that the record could not be stored later in the file, since the current subfile is not filled up yet; this means that we can stop looking when we get to the first invalid record.

Some Drawbacks of Hashing

This points out one of the drawbacks of standard disk-based hashing (SDBH). We cannot delete a record from the file simply by setting the invalid flag in that record; if we did, any record which overflowed past the one we deleted would become inaccessible, as we would stop looking when we got to the invalid record. For this reason, invalid records must be only those that have never been used and therefore can serve as "end-of-subfile" markers.

While we're on the topic of drawbacks of SDBH, we ought to note that as the file gets filled up, the maximum length of a search increases, especially an unsuccessful search. That is because adding a record to the end of a subfile that has only one invalid entry left results in "merging" that subfile and the next one, so that searches for entries in the first subfile have to continue to the next one. In our example, when we added "1212121" to the file, the maximum length of a search for an entry in subfile 1 increased from four to seven, even though we had added only one record. With a reasonably even distribution of hash codes (which we don't have in our example), this problem is usually not serious until the file gets to be about 80% full.

While this problem would be alleviated by increasing the capacity of the file as items are added, unfortunately SDBH does not allow such incremental expansion, since there would be no way to use the extra space; the subfile starting positions can't be changed after we start writing records, or we won't be able to find records we have already stored. Of course, one way to overcome this problem is to create a new file with larger (or more) subfiles, read each record from the old file and write it to the new one. What we will do in our example is simply to make the file 25% bigger than needed to contain the records we are planning to store, which means the file won't get more than 80% full.

Another problem with SDBH methods is that they are not well suited to storage of variable-length records; the address calculation used to find a "slot" for a record relies on the fact that the records are of fixed length.

Finally, there's the ever-present problem of coming up with a good hash code. Unfortunately, unless we know the characteristics of the input data precisely, it's theoretically possible that all of the records will generate the same hash code, thus negating all of the performance advantages of hashing. Although this is unlikely, it makes SDBH an inappropriate algorithm for use in situations where a maximum time limit on access must be maintained.

When considering these less-than-ideal characteristics, we should remember that other search methods have their own disadvantages, particularly in speed of lookup. All in all, the disadvantages of such a simply implemented access method seem rather minor in comparison with its benefits, at least for the current application. In addition, recent innovations in hashing have made it possible to improve the flexibility of SDBH methods by fairly simple changes. For example, the use of "last-come, first-served" hashing, which stores a newly added record exactly where the hash code indicates, greatly reduces the maximum time needed to find any record; it also makes it possible to determine for any given file what that maximum is, thus removing one of the barriers to using SDBH in a time-critical application.3 Even more recently, a spectacular advance in the state of the art has made it possible to increase the file capacity incrementally, as well as to delete records efficiently. Chapter dynhash.htm provides a C++ implementation of this remarkable innovation in hashing.

Caching out Our Winnings

Of course, even one random disk access takes a significant amount of time, from the computer's point of view. Wouldn't it be better to avoid accessing the disk at all?

While that would result in the fastest possible lookup, it would require us to have the entire database in memory, which is usually not feasible. However, if we had some of the database in memory, perhaps we could eliminate some of the disk accesses.

A cache is a portion of a large database that we want to access, kept in a type of storage that can be accessed more rapidly than the type used to store the whole database. For example, if your system has an optical disk which is considerably slower than its hard disk, a portion of the database contained on the optical disk may be kept on the hard disk. Why don't we just use the hard disk for the whole database? The optical disk may be cheaper per megabyte or may have more capacity, or the removability and long projected life span of the optical diskettes may be the major reason. Of course, our example of a cache uses memory to hold a portion of the database that is stored on the hard disk, but the principle is the same: memory is more expensive than disk storage, and with certain computers, it may not be possible to install enough memory to hold a copy of the entire database even if price were no object.

If a few items account for most of the transactions, the use of a cache speeds up access to those items, since they are likely to be among the most recently used records. This means that if we keep a number of the most recently accessed records in memory, we can reduce the number of disk accesses significantly. However, we have to have a way to locate items in the cache quickly: this problem is very similar to a hash-coded file lookup, except that we have more freedom in deciding how to handle overflows, where the entry we wish to use is already being used by another item. Since the cache is only a copy of data that is also stored elsewhere, we don't have to be so concerned about what happens to overflowing entries; we can discard them if we wish to, since we can always get fresh copies from the disk.

The simplest caching lookup algorithm is a direct-mapped cache. That means each key corresponds to one and only one entry in the cache. In other words, overflow is handled by overwriting the previous record in that space with the new entry. The trouble with this method is that if two (or more) commonly used records happen to map to the same cache entry, only one of them can be in the cache at one time. This requires us to go back to the disk repeatedly for these conflicting records. The solution to this problem is to use a multiway associative cache. In this algorithm, each key corresponds to a cache line, which contains more than one entry. In our example, it contains eight entries. Therefore, if a number of our records have keys that map to the same line, up to eight of them can reside in the cache simultaneously. How did I decide on an eight-way associative cache? By trying various cache sizes until I found the one that yielded the greatest performance.

The performance of a disk caching system is defined by its hit ratio, or the proportion of accesses to the disk that are avoided by using the cache. In order to estimate the performance of this caching algorithm, we have to quantify our assumptions. I have written a program called stestgen.cpp to generate test keys in which 20% of the items account for 80% of the database accesses (Figure stestgen.00), along with another called supinit.cpp that initializes the database (Figure supinit.00), and a third called supert.cpp to read the test keys and look them up (Figure supert.00). The results of this simulation indicated that, using an eight-way associative cache, approximately 44% of the disk accesses that would be needed in a noncached system could be eliminated.

Test program for caching (superm\stestgen.cpp) (Figure stestgen.00)

codelist/stestgen.00 

Database initialization program for caching (superm\supinit.cpp) (Figure supinit.00)

codelist/supinit.00 

Retrieval program for caching (superm\supert.cpp) (Figure supert.00)

codelist/supert.00 

Figure line.size shows the results of my experimentation with various line sizes.

Line size effects (Figure line.size)

Line size        Hit ratio 

1 .344

2 .38

4 .42

8 .44

16 .43

The line size is defined by the constant MAPPING_FACTOR in superm.h (Figure superm.00a).

Header file for supermarket lookup system (superm\superm.h) (Figure superm.00a)

codelist/superm.00a 

Heading for The Final Lookup

Now that we have added a cache to our optimization arsenal, only three more changes are necessary to reach the final lookup algorithm that we will implement. The first is to shrink each of the subfiles to one entry. That is, we will calculate a record address rather than a subfile address when we start trying to add or look up a record. This tends to reduce the length of the search needed to find a given record, as each record (on average) will have a different starting position, rather than a number of records having the same starting position, as is the case with longer subfiles.

The second change is that, rather than having a separate flag to indicate whether a record position in the file is in use, we will create an "impossible" key value to mean that the record is available. Since our key will consist only of decimal digits (compressed to two digits per byte), we can set the first digit to 0xf (the hex representation for 15), which cannot occur in a genuine decimal number. This will take the place of our "invalid" flag, without requiring extra storage in the record.

Finally, we have to deal with the possibility that the search for a record will encounter the end of the file, because the last position in the file is occupied by another record. In this case, we will wrap around to the beginning of the file and keep looking. In other words, position 0 in the file will be considered to follow immediately after the last position in the file.

Saving Storage

Now that we have decided on our lookup algorithm, we can shift our attention to reducing the amount of storage required for each record in our supermarket price lookup program. Without any special encoding, the disk storage requirements for one record would be 35 bytes (10 for the UPC, 21 for the description, and 4 for the price). For a file of 10,000 items, this would require 350 Kbytes; allowing 25% extra space to prevent the hashing from getting too slow means that the file would end up taking up about 437 Kbytes. For this application, disk storage space would not be a problem; however, the techniques we will use to reduce the file size are useful in many other applications as well. Also, searching a smaller file is likely to be faster, because the heads have to move a shorter distance on the average to get to the record where we are going to start our search.

If you look back at Figure initstruct, you will notice that the upc field is ten characters long. Using the ASCII code for each digit, which is the usual representation for character data, takes one byte per digit or 10 bytes in all. I mentioned above that we would be using a limited character set to reduce the size of the records. UPC codes are limited to the digits 0 through 9; if we pack two digits into one byte, by using four bits to represent each digit, we can cut that down to five bytes for each UPC value stored. Luckily, this is quite simple, as you will see when we discuss the BCD (binary-coded decimal) conversion code below. The other data compression method we will employ is to convert the item descriptions from strings of ASCII characters, limited to the sets 0-9, A-Z, and the special characters comma, period, minus, and space, to Radix40 representation, mentioned in Chapter prologue.htm. The main difference between Radix40 conversions and those for BCD is that in the former case we need to represent 40 different characters, rather than just the 10 digits, and therefore the packing of data must be done in a slightly more complicated way than just using four bits per character.

The Code

Now that we have covered the optimizations that we will use in our price lookup system, it's time to go through the code that implements these algorithms. This specific implementation is set up to handle a maximum of FILE_CAPACITY items, defined in superm.h (Figure superm.00a).5 Each of these items, as defined in the ItemRecord structure in the same file, has a price, a description, and a key, which is the UPC code. The key would be read in by a bar-code scanner in a real system, although our test program will read it in from the keyboard.

Some User-Defined Types

Several of the fields in the ItemRecord structure definition require some explanation, so let's take a closer look at that definition, shown in Figure superm.00.

ItemRecord struct definition (from superm\superm.h) (Figure superm.00)

codelist/superm.00 

The upc field is defined as a BCD (binary-coded decimal) value of ASCII_KEY_SIZE digits (contained in BCD_KEY_SIZE bytes). The description field is defined as a Radix40 field DESCRIPTION_WORDS in size; each of these words contains three Radix40 characters.

A BCD value is stored as two digits per byte, each digit being represented by a four-bit code between 0000(0) and 1001(9). Function ascii_to_BCD in bcdconv.cpp (Figure bcdconv.00) converts a decimal number, stored as ASCII digits, to a BCD value by extracting each digit from the input argument and subtracting the code for '0' from the digit value; BCD_to_ascii (Figure bcdconv.01) does the reverse.

ASCII to BCD conversion function (from superm\bcdconv.cpp) (Figure bcdconv.00)

codelist/bcdconv.00 

BCD to ASCII conversion function (from superm\bcdconv.cpp) (Figure bcdconv.01)

codelist/bcdconv.01 

A UPC code is a ten-digit number between 0000000000 and 9999999999, which unfortunately is too large to fit in a long integer of 32 bits. Of course, we could store it in ASCII, but that would require 10 bytes per UPC code. So BCD representation saves five bytes per item compared to ASCII.

A Radix40 field, as mentioned above, stores three characters (from a limited set of possibilities) in 16 bits. This algorithm (like some other data compression techniques) takes advantage of the fact that the number of bits required to store a character depends on the number of distinct characters to be represented.6 The BCD functions described above are an example of this approach. In this case, however, we need more than just the 10 digits. If our character set can be limited to 40 characters (think of a Radix40 value as a "number" in base 40), we can fit three of them in 16 bits, because 403 is less than 216.

Let's start by looking at the header file for the Radix40 conversion functions, which is shown in Figure radix40.00a.

The header file for Radix40 conversion (superm\radix40.h) (Figure radix40.00a)

codelist/radix40.00a 

The legal_chars array, shown in Figure radix40.00 defines the characters that can be expressed in this implementation of Radix40.7 The variable weights contains the multipliers to be used to construct a two-byte Radix40 value from the three characters that we wish to store in it.

The legal_chars array (from superm\radix40.cpp) (Figure radix40.00)

codelist/radix40.00 

As indicated in the comment at the beginning of the ascii_to_radix40 function (Figure radix40.01), the job of that function is to convert a null-terminated ASCII character string to Radix40 representation. After some initialization and error checking, the main loop begins by incrementing the index to the current word being constructed, after every third character is translated. It then translates the current ASCII character by indexing into the lookup_chars array, which is shown in Figure radix40.02. Any character that translates to a value with its high bit set is an illegal character and is converted to a hyphen; the result flag is changed to S_ILLEGAL if this occurs.

The ascii_to_radix40 function (from superm\radix40.cpp) (Figure radix40.01)

codelist/radix40.01 

The lookup_chars array (from superm\radix40.cpp) (Figure radix40.02)

codelist/radix40.02 

In the line radix40_data[current_word_index] += weights[cycle] * j;, the character is added into the current output word after being multiplied by the power of 40 that is appropriate to its position. The first character in a word is represented by its position in the legal_chars string. The second character is represented by 40 times that value and the third by 1600 times that value, as you would expect for a base-40 number.

The complementary function radix40_to_ascii (Figure radix40.03) decodes each character unambiguously. First, the current character is extracted from the current word by dividing by the weight appropriate to its position; then the current word is updated so the next character can be extracted. Finally, the ASCII value of the character is looked up in the legal_chars array.

The radix40_to_ascii function (from superm\radix40.cpp) (Figure radix40.03)

codelist/radix40.03 

Preparing to Access the Price File

Now that we have examined the user-defined types used in the ItemRecord structure, we can go on to the PriceFile structure, which is used to keep track of the data for a particular price file.8 The best way to learn about this structure is to follow the program as it creates, initializes, and uses it. The function main, which is shown in Figure superm.01, after checking that it was called with the correct number of arguments, calls the initialize_price_file function (Figure suplook.00) to set up the PriceFile structure.

The main function (from superm\superm.cpp) (Figure superm.01)

codelist/superm.01 

The initialize_price_file function (from superm\suplook.cpp) (Figure suplook.00)

codelist/suplook.00 

The initialize_price_file function allocates storage for and initializes the PriceFile structure, which is used to control access to the price file. This structure contains pointers to the file, to the array of cached records that we have in memory, and to the array of record numbers of those cached records. As we discussed earlier, the use of a cache can reduce the amount of time spent reading records from the disk by maintaining copies of a number of those records in memory, in the hope that they will be needed again. Of course, we have to keep track of which records we have cached, so that we can tell whether we have to read a particular record from the disk or can retrieve a copy of it from the cache instead.

When execution starts, we don't have any records cached; therefore, we initialize each entry in these arrays to an "invalid" state (the key is set to INVALID_BCD_VALUE). If file_mode is set to CLEAR_FILE, we write such an "invalid" record to every position in the price file as well, so that any old data left over from a previous run is erased.

Now that access to the price file has been set up, we can call the process function (Figure superm.02). This function allows us to enter items and/or look up their prices and descriptions, depending on mode.

The process function (from superm\superm.cpp) (Figure superm.02)

codelist/superm.02 

First, let's look at entering a new item (INPUT_MODE). We must get the UPC code, the description, and the price of the item. The UPC code is converted to BCD, the description to Radix40, and the price to unsigned. Then we call write_record (Figure suplook.01) to add the record to the file.

The write_record function (from superm\suplook.cpp) (Figure suplook.01)

codelist/suplook.01 

In order to write a record to the file, write_record calls lookup_record_number (Figure suplook.02) to determine where the record should be stored so that we can retrieve it quickly later. The lookup_record_number function does almost the same thing as lookup_record (Figure suplook.03), except tha the latter returns a pointer to the record rather than its number. Therefore, they are implemented as calls to a common function: lookup_record_and_number (Figure suplook.04).

The lookup_record_number function (from superm\suplook.cpp) (Figure suplook.02)

codelist/suplook.02 

The lookup_record function (from superm\suplook.cpp) (Figure suplook.03)

codelist/suplook.03 

The lookup_record_and_number function (from superm\suplook.cpp) (Figure suplook.04)

codelist/suplook.04 

After a bit of setup code, lookup_record_and_number determines whether the record we want is already in the cache, in which case we don't have to search the file for it. To do this, we call compute_cache_hash (Figure suplook.05), which in turn calls compute_hash (Figure suplook.06) to do most of the work of calculating the hash code.

The compute_cache_hash function (from superm\suplook.cpp) (Figure suplook.05)

codelist/suplook.05 

The compute_hash function (from superm\suplook.cpp) (Figure suplook.06)

codelist/suplook.06 

This may look mysterious, but it's actually pretty simple. After clearing the hash code we are going to calculate, it enters a loop that first shifts the old hash code one (decimal) place to the left, end around, then adds the low four bits of the next character from the key to the result. When it finishes this loop, it returns to the caller, in this case compute_cache_hash. How did I come up with this algorithm?

Making a Hash of Things

Well, as you will recall from our example of looking up a telephone number, the idea of a hash code is to make the most of variations in the input data, so that there will be a wide distribution of "starting places" for the records in the file. If all the input values produced the same hash code, we would end up with a linear search again, which would be terribly slow. In this case, our key is a UPC code, which is composed of decimal digits. If each of those digits contributes equally to the hash code, we should be able to produce a fairly even distribution of hash codes, which are the starting points for searching through the file for each record. As we noted earlier, this is one of the main drawbacks of hashing: the difficulty of coming up with a good hashing algorithm. After analyzing the nature of the data, you may have to try a few different algorithms with some test data, until you get a good distribution of hash codes. However, the effort is usually worthwhile, since you can often achieve an average of slightly over one disk access per lookup (assuming that several records fit in one physical disk record).

Meanwhile, back at compute_cache_hash, we convert the result of compute_hash, which is an unsigned value, into an index into the cache. This is then returned to lookup_record_and_number as the starting cache index. As mentioned above, we are using an eight-way associative cache, in which each key can be stored in any of eight entries in a cache line. This means that we need to know where the line starts, which is computed by compute_starting_cache_hash (Figure suplook.07) and where it ends, which is computed by compute_ending_cache_hash (Figure suplook.08).9

The compute_starting_cache_hash function (from superm\suplook.cpp) (Figure suplook.07)

codelist/suplook.07 

The compute_ending_cache_hash function (from superm\suplook.cpp) (Figure suplook.08)

codelist/suplook.08 

After determining the starting and ending positions where the key might be found in the cache, we compare the key in each entry to the key that we are looking for, and if they are equal, we have found the record in the cache. In this event, we set the value of the record_number argument to the file record number for this cache entry, and return with the status set to FOUND.

Otherwise, the record isn't in the cache, so we will have to look for it in the file; if we find it, we will need a place to store it in the cache. So we pick a "random" entry in the line (cache_replace_index) by calculating the remainder after dividing the number of accesses we have made by the MAPPING_FACTOR. This will generate an entry index between 0 and the highest entry number, cycling through all the possibilities on each successive access, thus not favoring a particular entry number.

However, if the line has an invalid entry (where the key is INVALID_BCD_VALUE), we should use that one, rather than throwing out a real record that might be needed later. Therefore, we search the line for such an empty entry, and if we are successful, we set cache_replace_index to its index.

Next, we calculate the place to start looking in the file, via compute_file_hash, (Figure suplook.09), which is very similar to compute_cache_hash except that it uses the FILE_SIZE constant in superm.h (Figure superm.00a) to calculate the index rather than the CACHE_SIZE constant, as we want a starting index in the file rather than in the cache.

The compute_file_hash function (from superm\suplook.cpp) (Figure suplook.09)

codelist/suplook.09 

As we noted above, this is another of the few drawbacks of this hashing method: the size of the file must be decided in advance, rather than being adjustable as data is entered. The reason is that to find a record in the file, we must be able to calculate its approximate position in the file in the same manner as it was calculated when the record was stored. The calculation of the hash code is designed to distribute the records evenly throughout a file of known size; if we changed the size of the file, we wouldn't be able to find records previously stored. Of course, different files can have different sizes, as long as we know the size of the file we are operating on currently: the size doesn't have to be an actual constant as it is in our example, but it does have to be known in advance for each file.

Searching the File

Now we're ready to start looking for our record in the file at the position specified by starting_file_index. Therefore, we enter a loop that searches from this starting position toward the end of the file, looking for a record with the correct key. First we set the file pointer to the first position to be read, using position_record (Figure suplook.10), then read the record at that position.

The position_record function (from superm\suplook.cpp) (Figure suplook.10)

codelist/suplook.10 

If the key in that record is the one we are looking for, our search is successful. On the other hand, if the record is invalid, then the record we are looking for is not in the file; when we add records to the file, we start at the position given by starting_file_index and store our new record in the first invalid record we find.10 Therefore, no record can overflow past an invalid record, as the invalid record would have been used to store the overflow record.

In either of these cases, we are through with the search, so we break out of the loop. On the other hand, if the entry is neither invalid nor the record we are looking for, we keep looking through the file until either we have found the record we want, we discover that it isn't in the file by encountering an invalid record, or we run off the end of the file. In the last case we start over at the beginning of the file.

If we have found the record, we copy it to the cache entry we've previously selected and copy its record number into the list of record numbers in the cache so that we'll know which record we have stored in that cache position. Then we return to the calling function, write_record, with the record we have found. If we have determined that the record is not in the file, then we obviously can't read it into the cache, but we do want to keep track of the record number where we stopped, since that is the record number that will be used for the record if we write it to the file.

To clarify this whole process, let's make a file with room for only nine records by changing FILE_SIZE to 6 in superm.h (Figure superm.00a). After adding a few records, a dump looks like Figure initcondition.

Initial condition (Figure initcondition)

Position   Key         Data 

0. INVALID

1. INVALID

2. 0000098765: MINESTRONE:245

3. 0000121212: OATMEAL, 1 LB.:300

4. INVALID

5. INVALID

6. 0000012345: JELLY BEANS:150

7. INVALID

8. 0000099887: POPCORN:99

Let's add a record with the key "23232" to the file. Its hash code turns out to be 3, so we look at position 3 in the file. That position is occupied by a record with key "121212", so we can't store our new record there. The next position we examine, number 4, is invalid, so we know that the record we are planning to add is not in the file. (Note that this is the exact sequence we follow to look up a record in the file as well). We use this position to hold our new record. The file now looks like Figure aftermilk.

After adding "milk" record (Figure aftermilk)

Position  Key          Data 

0. INVALID

1. INVALID

2. 0000098765: MINESTRONE:245

3. 0000121212: OATMEAL, 1 LB.:300

4. 0000023232: MILK:128

5. INVALID

6. 0000012345: JELLY BEANS:150

7. INVALID

8. 0000099887: POPCORN:99

Looking up our newly added record follows the same algorithm. The hash code is still 3, so we examine position 3, which has the key "121212". That's not the desired record, and it's not invalid, so we continue. Position 4 does match, so we have found our record. Now let's try to find some records that aren't in the file.

If we try to find a record with key "98789", it turns out to have a hash code of 8. Since that position in the file is in use, but with a different key, we haven't found our record. However, we have encountered the end of the file. What next?

Wrapping Around at End-of-File

In order to continue looking for this record, we must start over at the beginning of the file. That is, position 0 is the next logical position after the last one in the file. As it happens, position 0 contains an invalid record, so we know that the record we want isn't in the file.11 In any event, we are now finished with lookup_record_and_number. Therefore, we return to lookup_record_number, which returns the record number to be used to write_record (Figure suplook.01), along with a status value of FILE_FULL, FOUND, or NOT_IN_FILE (which is the status we want). FILE_FULL is an error, as we cannot add a record to a file that has reached its capacity. So is FOUND, in this situation, as we are trying to add a new record, not find one that alreadys exists. In either of these cases, we simply return the status to the calling function, process, (Figure superm.02), which gives an appropriate error message and continues execution.

However, if the status is NOT_IN_FILE, write_record continues by positioning the file to the record number returned by lookup_record_number, writing the record to the file, and returns the status NOT_IN_FILE to process, which continues execution normally.

That concludes our examination of the input mode in process. The lookup mode is very similar, except that it uses the lookup_record function (Figure suplook.03) rather than lookup_record_number, since it wants the record to be returned, not just the record number. The lookup mode, of course, also differs from the entry mode in that it expects the record to be in the file, and displays the record data when found.

After process terminates when the user enters "*" instead of a code number to be looked up or entered, main finishes up by calling terminate_price_file (Figure suplook.11) , which closes the price file and returns. All processing complete, main exits to the operating system.

The terminate_price_file function (from superm\suplook.cpp) (Figure suplook.11)

codelist/suplook.11 

Summary

In this chapter, we have covered ways to save storage by using a restricted character set and to gain rapid access to data by an exact key, using hash coding and caching. In the next chapter we will see how to use bitmaps and distribution sorting to aid in rearranging information by criteria that can be specified at run-time.

Problems

  1. What modifications to the program would be needed to support:
    1. Deleting records?
    2. Handling a file that becomes full, as an off-line process?
    3. Keeping track of the inventory of each item?
  2. How could hash coding be applied to tables in memory?
  3. How could caching be applied to reduce the time needed to look up an entry in a table in memory?

(You can find suggested approaches to problems in Chapter artopt.htm).

Footnotes

  1. Tape drives are the most commonly used sequential access devices.
  2. In general, the "next open place" is not a very good place to put an overflow record if the hash table is kept in memory rather than on the disk; the added records "clog" the table, leading to slower access times. A linked list approach is much better for tables that are actually memory resident. Warning: this does not apply to tables in virtual memory, where linked lists provide very poor performance. For more discussion on overflow handling, see the dynamic hashing algorithm in Chapter dynhash.htm.
  3. For a description of "last-come, first-served" hashing, see Patricio V. Poblete and J. Ian Munro, "Last-Come-First-Served Hashing", in Journal of Algorithms 10, 228-248, or my article "Galloping Algorithms", in Windows Tech Journal, 2(February 1993), 40-43.
  4. This is a direct-mapped cache.
  5. Increasing the maximum number of records in the file by increasing FILE_CAPACITY would also increase the amount of memory required for the cache unless we reduced the cache size as a fraction of the file size by reducing the value .20 in the calculation of APPROXIMATE_CACHE_SIZE.
  6. The arithmetic coding data compression algorithm covered in Chapter compress.htm, however, does not restrict the characters that can be represented; rather, it takes advantage of the differing probabilities of encountering each character in a given situation.
  7. Note that this legal_chars array must be kept in synchronization with the lookup_chars array, shown in Figure radix40.02.
  8. While we could theoretically have more than one of these files active at a time, our example program uses only one such file.
  9. This function actually returns one more than the index to the last entry in the line because the standard C loop control goes from the first value up to one less than the ending value.
  10. Of course, we might also find a record with the same key as the one we are trying to add, but this is an error condition, since keys must be unique.
  11. If we were adding a new record with this key rather than trying to find one, we would use position 0.


Optimizing C ++
Optimizing C ++
ISBN: 0139774300
EAN: 2147483647
Year: 2005
Pages: 12
Authors: Steve Heller

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net