Armed with the tools outlined in the previous three sections, we now consider the analysis of sequential search and binary search, two basic algorithms for determining whether or not any of a sequence of objects appears among a set of previously stored objects. Our purpose is to illustrate the manner in which we will compare algorithms, rather than to describe these particular algorithms in detail. For simplicity, we assume here that the objects in question are integers. We will consider more general applications in great detail in Chapters 12 through 16. The simple versions of the algorithms that we consider here not only expose many aspects of the algorithm design and analysis problem but also have many direct applications.
For example, we might imagine a credit-card company that has N credit risks or stolen credit cards and wants to check whether any of M given transactions involves any one of the N bad numbers. To be concrete, we might think of N being large (say on the order of 103 to 106) and M being huge (say on the order of 106 to 109) for this application. The goal of the analysis is to be able to estimate the running times of the algorithms when the values of the parameters fall within these ranges.
This function checks whether the number v is among a previously stored set of numbers in a[l], a[l+1], ..., a[r], by comparing against each number sequentially, starting at the beginning. If we reach the end without finding the number sought, then we return the value -1. Otherwise, we return the index of the array position containing the number.
static int search(int a[], int v, int l, int r) { int i; for (i = l; i <= r; i++) if (v == a[i]) return i; return -1; }
Program 2.1 implements a straightforward solution to the search problem. It is implemented as a Java method that operates on an array (see Chapter 3) for better compatibility with other code that we will examine for the same problem in Part IV. It is not necessary to understand these details to understand the algorithm: We store all the objects in an array; then, for each transaction, we look through the array sequentially, from beginning to end, checking each to see whether it is the one that we seek.
To analyze the algorithm, we note immediately that the running time depends on whether or not the object sought is in the array. We can determine that the search is unsuccessful only by examining each of the N objects, but a search could end successfully at the first, second, or any one of the objects.
Therefore, the running time depends on the data. If all the searches are for the number that happens to be in the first position in the array, then the algorithm will be fast; if they are for the number that happens to be in the last position in the array, it will be slow. We discuss in Section 2.7 the distinction between being able to guarantee performance and being able to predict performance. In this case, the best guarantee that we can provide is that no more than N numbers will be examined.
To make a prediction, however, we need to make an assumption about the data. In this case, we might choose to assume that all the numbers are randomly chosen. This assumption implies, for example, that each number in the table is equally likely to be the object of a search. On reflection, we realize that it is that property of the search that is critical, because with randomly chosen numbers we would be unlikely to have a successful search at all (see Exercise 2.48). For some applications, the number of transactions that involve a successful search might be high; for other applications, it might be low. To avoid confusing the model with properties of the application, we separate the two cases (successful and unsuccessful) and analyze them independently. This example illustrates that a critical part of an effective analysis is the development of a reasonable model for the application at hand. Our analytic results will depend on the proportion of searches that are successful; indeed, it will give us information that we might need if we are to choose different algorithms for different applications based on this parameter.
Property 2.1
Sequential search examines N numbers for each unsuccessful search and about N/2 numbers for each successful search on the average.
If each number in the table is equally likely to be the object of a search, then
is the average cost of a search.
Property 2.1 implies that the running time of Program 2.1 is proportional to N, subject to the implicit assumption that the average cost of comparing two numbers is constant. Thus, for example, we can expect that if we double the number of objects, we double the amount of time required for a search.
We can speed up sequential search for unsuccessful search by putting the numbers in the table in order. Sorting the numbers in the table is the subject of Chapters 6 through 11. A number of the algorithms that we will consider get that task done in time proportional to N log N, which is insignificant by comparison to the search costs when M is huge. In an ordered table, we can terminate the search immediately on reaching a number that is larger than the one that we seek. This change reduces the cost of sequential search to about N/2 numbers examined for unsuccessful search, the same as for successful search.
This program has the same functionality as Program 2.1, but it is much more efficient.
static int search(int a[], int v, int l, int r) { while (r >= l) { int m = (l+r)/2; if (v == a[m]) return m; if (v < a[m]) r = m-1; else l = m+1; } return -1; }
Property 2.2
Sequential search in an ordered table examines N numbers for each search in the worst case and about N/2 numbers for each search on the average.
We still need to specify a model for unsuccessful search. This result follows from assuming that the search is equally likely to terminate at any one of the N + 1 intervals defined by the N numbers in the table, which leads immediately to the expression
The cost of an unsuccessful search ending before or after the Nth entry in the table is the same: N.
Another way to state the result of Property 2.2 is to say that the running time of sequential search is proportional to MN for M transactions, on the average and in the worst case. If we double either the number of transactions or the number of objects in the table, we can expect the running time to double; if we double both, we can expect the running time to go up by a factor of 4. The result also tells us that the method is not suitable for huge tables. If it takes c microseconds to examine a single number, then, for M = 109 and N = 106, the running time for all the transactions would be at least (c/2)109 seconds or, by Figure 2.1, about 16c years, which is prohibitive.
Program 2.2 is a classical solution to the search problem that is much more efficient than sequential search. It is based on the idea that if the numbers in the table are in order, we can eliminate half of them from consideration by comparing the one that we seek with the one at the middle position in the table. If it is equal, we have a successful search. If it is less, we apply the same method to the left half of the table. If it is greater, we apply the same method to the right half of the table. Figure 2.7 is an example of the operation of this method on a sample set of numbers.
To see whether or not 5025 is in the table of numbers in the left column, we first compare it with 6504; that leads us to consider the first half of the array. Then we compare against 4548 (the middle of the first half); that leads us to the second half of the first half. We continue, always working on a subarray that would contain the number being sought, if it is in the table. Eventually, we get a subarray with just one element, which is not equal to 5025, so 5025 is not in the table.
Property 2.3
Binary search never examines more than lg N + 1 numbers.
The proof of this property illustrates the use of recurrence relations in the analysis of algorithms. If we let TN represent the number of comparisons required for binary search in the worst case, then the way in which the algorithm reduces search in a table of size N to search in a table half the size immediately implies that
To search in a table of size N, we examine the middle number, then search in a table of size no larger than N/2 . The actual cost could be less than this value either because the comparison might cause us to terminate a successful search or because the table to be searched might be of size N/2 - 1 (if N is even). As we did in the solution of Formula 2.2, we can prove immediately that TN n + 1 if N = 2n and then verify the general result by induction.
Property 2.3 says that we can solve a huge search problem with up to 1 billion numbers with at most 30 comparisons per transaction, and that is likely to be much less than the time it takes to read the request or write the result on typical computers. The search problem is so important that several techniques have been developed that are even faster than this one, as we shall see in Chapters 12 through 16.
Note that we express Property 2.1 and Property 2.2 in terms of the operations that we perform most often on the data. As noted in the commentary following Property 2.1, we expect that each operation should take a constant amount of time, and we can conclude that the running time of binary search is proportional to lg N as compared to N for sequential search. As we double N, the running time of binary search hardly changes, but the running time of sequential search doubles. As N grows, the gap between the two methods becomes a chasm.
We can verify the analytic evidence of Properties 2.1 and 2.2 by implementing and testing the algorithms. For example, Table 2.4 shows running times for binary search and sequential search for M searches in a table of size N (including, for binary search, the cost of sorting the table) for various values of M and N. We will not consider the implementation of the program to run these experiments in detail here because it is similar to those that we consider in full detail in Chapters 6 and 11, and because we consider the use of library methods and other details of putting together programs from constituent pieces in Chapter 3. For the moment, we simply stress that doing empirical testing is an integral part of evaluating the efficiency of an algorithm.
Table 2.4 validates our observation that the functional growth of the running time allows us to predict performance for huge cases on the basis of empirical studies for small cases. The combination of mathematical analysis and empirical studies provides persuasive evidence that binary search is the preferred algorithm, by far.
This example is a prototype of our general approach to comparing algorithms. We use mathematical analysis of the frequency with which algorithms perform critical abstract operations, then use those results to deduce the functional form of the running time, which allows us to verify and extend empirical studies. As we develop algorithmic solutions to computational problems that are more and more refined, and as we develop mathematical analyses to learn their performance characteristics that are more and more refined, we call on mathematical studies from the literature, so as to keep our attention on the algorithms themselves in this book. We cannot do thorough mathematical and empirical studies of every algorithm that we encounter, but we strive to identify essential performance characteristics, knowing that, in principle, we can develop a scientific basis for making informed choices among algorithms in critical applications.
2.47 Give the average number of comparisons used by Program 2.1 in the case that aN of the searches are successful, for 0 a 1.
Table 2.4. Empirical study of sequential and binary search
These relative timings validate our analytic results that sequential search takes time proportional to MN and binary search takes time proportional to M lg N for M searches in a table of N objects. When we increase N by a factor of 2, the time for sequential search increases by a factor of 2 as well, but the time for binary search hardly changes. Sequential search is infeasible for huge M as N increases, but binary search is fast even for huge tables.
M = 1000
M = 10000
M = 100000
N
S
B
S
B
S
B
125
3
3
36
12
357
126
250
6
2
63
13
636
130
500
13
1
119
14
1196
137
1250
28
1
286
15
2880
146
2500
57
1
570
16
154
5000
113
1
1172
17
164
12500
308
2
3073
17
173
25000
612
1
19
183
50000
1217
2
20
196
100000
2682
2
21
209
Key:
S sequential search (Program 2.1)
B binary search (Program 2.2)
2.48 Estimate the probability that at least one of M random 10-digit numbers matches one of a set of N given values, for M = 10, 100, and 1000 and N = 103, 104, 105, and 106.
2.49 Write a driver program that generates M random integers and puts them in an array, then counts the number of N random integers that matches one of the numbers in the array, using sequential search. Run your program for M = 10, 100, and 1000 and N = 10, 100, and 1000.
2.50 State and prove a property analogous to Property 2.3 for binary search.
Top |