In the Real WorldComputer-Based Sorting and Searching | Special Edition Using Microsoft Office Access 2003

In the Real World Computer-Based Sorting and Searching

Donald E. Knuth's Sorting and Searching, volume 3 of his The Art of Computer Programming series, is the seminal work on computer algorithms (programs) to perform sorts and searches. Dr. Knuth is Professor Emeritus of The Art of Computer Programming at Stanford University. Addison-Wesley published the first edition of Sorting and Searching in 1973. There's a good probability that every student who was granted a computer science degree during and after the mid-1970s has a well-worn copy of Knuth's classic text. Knuth updated Sorting and Searching with a second edition in mid-1998; the book remains required reading for assembly-language programmers, but you need a good foundation in combinatorial mathematics and set theory to fully understand the contents.

The Influence of Computer Power on Knuth's Approach

As Knuth points out in the first page of the chapter on sorting, a better term to describe the process is "ordering." (The 724-page book has only two chapters: Chapter 6, "Sorting," and Chapter 7, "Searching.") Structured Query Language (SQL) takes Knuth's advice and uses ORDER BY clauses to define sort sequences. One of the dictionary definitions of the verb "to sort" is "to arrange according to characteristics," and the definition of "order" includes "arrange" as a synonym.

Both sort and order infer that the process physically moves records; this was the case in the 1970s, a period when punched cards were the dominant means of computer data entry and storage. The advent of magnetic tape drives eliminated the need for punched card sorting and collating machines, but sorting still required individual records be rewritten to tape in the chosen order. Decks of punched cards and magnetic tape use sequential access, so sorting by merging expedites searching assuming that records matching your search criteria appear early in the deck or tape. Thus the "Sorting" chapter precedes "Searching," as it does in this book's chapter. Searching is the foundation for all filtering operations.

Note

SQL Server's clustered indexes physically order records in the order of the primary key to speed execution of multitable queries. When you export Jet tables to SQL Server with Access 2003's Upsizing Wizard, the Wizard automatically creates a clustered index on the primary key field.

Today's PCs are far more powerful than the largest mainframe computers of the 1970s. Multi-gigabyte fixed disk drives in PC clients dwarf the storage capabilities of tape and multi-spindle disk drives of the 1970s and early 1980s. When you apply a sort order to a Jet table or query, records don't change position; Access displays the table records in the desired sequence. If you have plenty of RAM, all the record resequencing occurs in memory because Jet picks those records needed to populate the visible rows of the datasheet, plus some additional records to make page down operations go faster. When Jet runs out of RAM, temporary disk files store the overflow. It's no longer necessary to optimize searching by prior sorting; the brute force approach (searching a random-order file) usually is fast enough for files of moderate (10,000 records) to even large size (1 million or more records).

Knuth and Indexes

Two Russian mathematicians, G. M. Adelson-Velski and E. M. Landis, proposed a balanced binary tree indexing structure in 1963. In a balanced binary tree structure, the length of the search path to any ordered record is never more than 45% longer than the optimum. Jet, like most other desktop RDBMSs, has a balanced binary tree (B-tree) structure; a Jet primary key index orders the records.

One of Knuth's other contributions to computer science is his analysis of binary tree searching on ordered tables. An ordered table's records are physically or logically organized in alphabetic or numeric order by the key field being searched. Binary tree searches optimize the searching process by minimizing the number of comparisons required to zero in on the record(s) with the desired value. Knuth went into great detail on "hashing" algorithms that create a set of unique values to identify each record. Hashing greatly speeds searching on the key field of tables when the key field comprises more than a few characters. The "hash tables" of early databases are called indexes today. SQL Server 2000 still generates temporary hash tables when needed to speed query processing.

When you search on a field that isn't ordered, called a secondary key, search efficiency drops rapidly for large tables. The early approaches used in the 1970s, including a process called combinatorial hashing, have given way to secondary indexes on unordered keys, such as postal codes in a table in which the primary key is a customer name or code. Each secondary key you add decreases the speed at which you can insert new records because of the need to maintain and rebalance the trees of the indexes. Despite the performance of today's PC clients and servers, it's still a good idea to minimize the number of secondary indexes on tables used for online transaction processing (OLTP).

It isn't necessary to understand the underlying details of hashing and balanced B-tree indexes to take full advantage of Access's searching and sorting features. Familiarity with the surprisingly efficient methods used in the early days of computing, however, offers a useful perspective on the dramatic improvements in database design and implementation that has occurred in the 30 years since Knuth published the first edition of Searching and Sorting.