Searching Algorithms | Searching and Sorting

Looking up a phone number, accessing a Web site and checking the definition of a word in a dictionary all involve searching large amounts of data. Searching algorithms all accomplish the same goalfinding an element that matches a given search key, if such an element does, in fact, exist. There are, however, a number of things that differentiate search algorithms from one another. The major difference is the amount of effort they require to complete the search. One way to describe this effort is with Big O notation. For searching and sorting algorithms, this is particularly dependent on the number of data elements.

In Chapter 7, we discussed the linear search algorithm, which is a simple and easy-to-implement searching algorithm. We will now discuss the efficiency of the linear search algorithm as measured by Big O notation. Then, we will introduce a searching algorithm that is relatively efficient but more complex and difficult to implement.

20.2.1. Efficiency of Linear Search

Suppose an algorithm simply tests whether the first element of a vector is equal to the second element of the vector. If the vector has 10 elements, this algorithm requires one comparison. If the vector has 1000 elements, the algorithm still requires one comparison. In fact, the algorithm is completely independent of the number of elements in the vector. This algorithm is said to have a constant runtime, which is represented in Big O notation as O(1). An algorithm that is O(1) does not necessarily require only one comparison. O(1) just means that the number of comparisons is constantit does not grow as the size of the vector increases. An algorithm that tests whether the first element of a vector is equal to any of the next three elements will always require three comparisons, but in Big O notation it is considered O(1). O(1) is often pronounced "on the order of 1" or more simply "order 1."

An algorithm that tests whether the first element of a vector is equal to any of the other elements of the vector requires at most n 1 comparisons, where n is the number of elements in the vector. If the vector has 10 elements, this algorithm requires up to nine comparisons. If the vector has 1000 elements, this algorithm requires up to 999 comparisons. As n grows larger, the n part of the expression "dominates," and subtracting one becomes inconsequential. Big O is designed to highlight these dominant terms and ignore terms that become unimportant as n grows. For this reason, an algorithm that requires a total of n 1 comparisons (such as the one we described in this paragraph) is said to be O(n). An O(n) algorithm is referred to as having a linear runtime. O(n) is often pronounced "on the order of n" or more simply "order n."

Now suppose you have an algorithm that tests whether any element of a vector is duplicated elsewhere in the vector. The first element must be compared with every other element in the vector. The second element must be compared with every other element except the first (it was already compared to the first). The third element must be compared with every other element except the first two. In the end, this algorithm will end up making (n 1) + (n 2) + ... + 2 + 1 or n2/2 n/2 comparisons. As n increases, the n2 term dominates and the n term becomes inconsequential. Again, Big O notation highlights the n2 term, leaving n2/2. As we will soon see, however, constant factors are omitted in Big O notation.

Big O is concerned with how an algorithm's runtime grows in relation to the number of items processed. Suppose an algorithm requires n2 comparisons. With four elements, the algorithm will require 16 comparisons; with eight elements, 64 comparisons. With this algorithm, doubling the number of elements quadruples the number of comparisons. Consider a similar algorithm requiring n2/2 comparisons. With four elements, the algorithm will require eight comparisons; with eight elements, 32 comparisons. Again, doubling the number of elements quadruples the number of comparisons. Both of these algorithms grow as the square of n, so Big O ignores the constant, and both algorithms are considered to be O(n2), which is referred to as quadratic runtime and pronounced "on the order of n-squared" or more simply "order n-squared."

When n is small, O(n2) algorithms (running on today's billion-operation-per-second personal computers) will not noticeably affect performance. But as n grows, you will start to notice the performance degradation. An O(n2) algorithm running on a million-element vector would require a trillion "operations" (where each could actually require several machine instructions to execute). This could require a few hours to execute. A billion-element vector would require a quintillion operations, a number so large that the algorithm could take decades! O(n2) algorithms are easy to write, as you have seen in previous chapters. In this chapter, you will see algorithms with more favorable Big O measures. These efficient algorithms often take a bit more cleverness and effort to create, but their superior performance can be well worth the extra effort, especially as n gets large and as algorithms are compounded into larger programs.

The linear search algorithm runs in O(n) time. The worst case in this algorithm is that every element must be checked to determine whether the search key exists in the vector. If the size of the vector is doubled, the number of comparisons that the algorithm must perform is also doubled. Note that linear search can provide outstanding performance if the element matching the search key happens to be at or near the front of the vector. But we seek algorithms that perform well, on average, across all searches, including those where the element matching the search key is near the end of the vector.

Linear search is the easiest search algorithm to implement, but it can be slow compared to other search algorithms. If a program needs to perform many searches on large vectors, it may be better to implement a different, more efficient algorithm, such as the binary search which we present in the next section.

Performance Tip 20.1

Sometimes the simplest algorithms perform poorly. Their virtue is that they are easy to program, test and debug. Sometimes more complex algorithms are required to realize maximum performance.

20.2.2. Binary Search

The binary search algorithm is more efficient than the linear search algorithm, but it requires that the vector first be sorted. This is only worthwhile when the vector, once sorted, will be searched a great many timesor when the searching application has stringent performance requirements. The first iteration of this algorithm tests the middle element in the vector. If this matches the search key, the algorithm ends. Assuming the vector is sorted in ascending order, then if the search key is less than the middle element, the search key cannot match any element in the second half of the vector and the algorithm continues with only the first half of the vector (i.e., the first element up to, but not including, the middle element). If the search key is greater than the middle element, the search key cannot match any element in the first half of the vector and the algorithm continues with only the second half of the vector (i.e., the element after the middle element through the last element). Each iteration tests the middle value of the remaining portion of the vector. If the element does not match the search key, the algorithm eliminates half of the remaining elements. The algorithm ends either by finding an element that matches the search key or by reducing the subvector to zero size.

As an example, consider the sorted 15-element vector

 2 3 5 10 27 30 34 51 56 65 77 81 82 93 99

and a search key of 65. A program implementing the binary search algorithm would first check whether 51 is the search key (because 51 is the middle element of the vector). The search key (65) is larger than 51, so 51 is discarded along with the first half of the vector (all elements smaller than 51.) Next, the algorithm checks whether 81 (the middle element of the remainder of the vector) matches the search key. The search key (65) is smaller than 81, so 81 is discarded along with the elements larger than 81. After just two tests, the algorithm has narrowed the number of elements to check to three (56, 65 and 77). The algorithm then checks 65 (which indeed matches the search key), and returns the index (9) of the vector element containing 65. In this case, the algorithm required just three comparisons to determine whether an element of the vector matched the search key. Using a linear search algorithm would have required 10 comparisons. [Note: In this example, we have chosen to use a vector with 15 elements, so that there will always be an obvious middle element in the vector. With an even number of elements, the middle of the vector lies between two elements. We implement the algorithm to choose the larger of those two elements.]

Figures 20.220.3 define class BinarySearch and its member functions respectively. Class BinarySearch is similar to LinearSearch (Section 7.7)it has a constructor, a search function (binarySearch), a displayElements function, two private data members and a private utility function (displaySubElements). Lines 1828 of Fig. 20.3 define the constructor. After initializing the vector with random ints from 1099 (lines 2425), line 27 calls the Standard Library function sort on the vector data. Function sort takes two random-access iterators and sorts the elements in vector data in ascending order. A random-access iterator is an iterator that allows access to any data item in the vector at any time. In this case, we use data.begin() and data.end() to encompass the entire vector. Recall that the binary search algorithm will work only on a sorted vector.

Figure 20.2. BinarySearch class definition.

(This item is displayed on page 980 in the print version)

 1 // Fig 20.2: BinarySearch.h
 2 // Class that contains a vector of random integers and a function
 3 // that uses binary search to find an integer.
 4 #include 
 5 using std::vector;
 6
 7 class BinarySearch
 8 {
 9 public:
10 BinarySearch( int ); // constructor initializes vector
11 int binarySearch( int ) const; // perform a binary search on vector
12 void displayElements() const; // display vector elements
13 private:
14 int size; // vector size
15 vector< int > data; // vector of ints
16 void displaySubElements( int, int ) const; // display range of values
17 }; // end class BinarySearch

Figure 20.3. BinarySearch class member-function definition.

(This item is displayed on pages 980 - 981 in the print version)

 1 // Fig 20.3: BinarySearch.cpp
 2 // BinarySearch class member-function definition.
 3 #include 
 4 using std::cout;
 5 using std::endl;
 6
 7 #include  // prototypes for functions srand and rand
 8 using std::rand;
 9 using std::srand;
10
11 #include  // prototype for function time
12 using std::time;
13
14 #include  // prototype for sort function
15 #include "BinarySearch.h" // class BinarySearch definition
16
17 // constructor initializes vector with random ints and sorts the vector
18 BinarySearch::BinarySearch( int vectorSize )
19 {
20 size = ( vectorSize > 0 ? vectorSize : 10 ); // validate vectorSize
21 srand( time( 0 ) ); // seed using current time
22
23 // fill vector with random ints in range 10-99
24 for ( int i = 0; i < size; i++ )
25 data.push_back( 10 + rand() % 90 ); // 10-99
26
27 std::sort( data.begin(), data.end() ); // sort the data
28 } // end BinarySearch constructor
29
30 // perform a binary search on the data 
31 int BinarySearch::binarySearch( int searchElement ) const
32 {
33 int low = 0; // low end of the search area 
34 int high = size - 1; // high end of the search area 
35 int middle = ( low + high + 1 ) / 2; // middle element
36 int location = -1; // return value; -1 if not found 
37
38 do // loop to search for element
39 { 
40 // print remaining elements of vector to be searched
41 displaySubElements( low, high ); 
42
43 // output spaces for alignment 
44 for ( int i = 0; i < middle; i++ )
45  cout << " "; 
46
47 cout << " * " << endl; // indicate current middle
48
49 // if the element is found at the middle 
50 if ( searchElement == data[ middle ] ) 
51  location = middle; // location is the current middle 
52 else if ( searchElement < data[ middle ] ) // middle is too high
53  high = middle - 1; // eliminate the higher half 
54 else // middle element is too low 
55  low = middle + 1; // eliminate the lower half 
56
57  middle = ( low + high + 1 ) / 2; // recalculate the middle
58 } while ( ( low <= high ) && ( location == -1 ) ); 
59
60 return location; // return location of search key
61 } // end function binarySearch
62
63 // display values in vector
64 void BinarySearch::displayElements() const
65 {
66 displaySubElements( 0, size - 1 );
67 } // end function displayElements
68
69 // display certain values in vector
70 void BinarySearch::displaySubElements( int low, int high ) const
71 {
72 for ( int i = 0; i < low; i++ ) // output spaces for alignment
73 cout << " ";
74
75 for ( int i = low; i <= high; i++ ) // output elements left in vector
76 cout << data[ i ] << " ";
77
78 cout << endl;
79 } // end function displaySubElements

Lines 3161 define function binarySearch. The search key is passed into parameter searchElement (line 31). Lines 3335 calculate the low end index, high end index and middle index of the portion of the vector that the program is currently searching. At the beginning of the function, the low end is 0, the high end is the size of the vector minus 1 and the middle is the average of these two values. Line 36 initializes the location of the found element to -1the value that will be returned if the search key is not found. Lines 3858 loop until low is greater than high (this occurs when the element is not found) or location does not equal -1 (indicating that the search key was found). Line 50 tests whether the value in the middle element is equal to searchElement. If this is TRue, line 51 assigns middle to location. Then the loop terminates and location is returned to the caller. Each iteration of the loop tests a single value (line 50) and eliminates half of the remaining values in the vector (line 53 or 55).

Lines 2541 of Fig. 20.4 loop until the user enters the value -1. For each other number the user enters, the program performs a binary search on the data to determine whether it matches an element in the vector. The first line of output from this program is the vector of ints, in increasing order. When the user instructs the program to search for 38, the program first tests the middle element, which is 67 (as indicated by *). The search key is less than 67, so the program eliminates the second half of the vector and tests the middle element from the first half of the vector. The search key equals 38, so the program returns the index 3.

Figure 20.4. BinarySearch test program.

(This item is displayed on pages 983 - 984 in the print version)

 1 // Fig 20.4: Fig20_04.cpp
 2 // BinarySearch test program.
 3 #include 
 4 using std::cin;
 5 using std::cout;
 6 using std::endl;
 7
 8 #include "BinarySearch.h" // class BinarySearch definition
 9
10 int main()
11 {
12 int searchInt; // search key
13 int position; // location of search key in vector
14
15 // create vector and output it
16 BinarySearch searchVector ( 15 );
17 searchVector.displayElements();
18
19 // get input from user
20 cout << "
Please enter an integer value (-1 to quit): ";
21 cin >> searchInt; // read an int from user
22 cout << endl;
23
24 // repeatedly input an integer; -1 terminates the program
25 while ( searchInt != -1 )
26 {
27 // use binary search to try to find integer 
28 position = searchVector.binarySearch( searchInt );
29
30 // return value of -1 indicates integer was not found
31 if ( position == -1 )
32 cout << "The integer " << searchInt << " was not found.
";
33 else
34 cout << "The integer " << searchInt
35 << " was found in position " << position << ".
";
36
37 // get input from user
38 cout << "

Please enter an integer value (-1 to quit): ";
39 cin >> searchInt; // read an int from user
40 cout << endl;
41 } // end while
42
43 return 0;
44 } // end main

 26 31 33 38 47 49 49 67 73 74 82 89 90 91 95

 Please enter an integer value (-1 to quit): 38

 26 31 33 38 47 49 49 67 73 74 82 89 90 91 95
 *
 26 31 33 38 47 49 49
 *
 26 31 33 38 47 49 49 67 73 74 82 89 90 91 95

 Please enter an integer value (-1 to quit): 38

 26 31 33 38 47 49 49 67 73 74 82 89 90 91 95
 *
 26 31 33 38 47 49 49
 *
 The integer 38 was found in position 3.


 Please enter an integer value (-1 to quit): 91

 26 31 33 38 47 49 49 67 73 74 82 89 90 91 95
 *
 73 74 82 89 90 91 95
 *
 90 91 95
 *
 The integer 91 was found in position 13.


 Please enter an integer value (-1 to quit): 25

 26 31 33 38 47 49 49 67 73 74 82 89 90 91 95
 *
 26 31 33 38 47 49 49
 *
 26 31 33
 *
 26
 *
 The integer 25 was not found.


 Please enter an integer value (-1 to quit): -1

Efficiency of Binary Search

In the worst-case scenario, searching a sorted vector of 1023 elements will take only 10 comparisons when using a binary search. Repeatedly dividing 1023 by 2 (because, after each comparison, we are able to eliminate half of the vector) and rounding down (because we also remove the middle element) yields the values 511, 255, 127, 63, 31, 15, 7, 3, 1 and 0. The number 1023 (210 1) is divided by 2 only 10 times to get the value 0, which indicates that there are no more elements to test. Dividing by 2 is equivalent to one comparison in the binary search algorithm. Thus, a vector of 1,048,575 (220 1) elements takes a maximum of 20 comparisons to find the key, and a vector of over one billion elements takes a maximum of 30 comparisons to find the key. This is a tremendous improvement in performance over the linear search. For a one-billion-element vector, this is a difference between an average of 500 million comparisons for the linear search and a maximum of only 30 comparisons for the binary search! The maximum number of comparisons needed for the binary search of any sorted vector is the exponent of the first power of 2 greater than the number of elements in the vector, which is represented as log2 n. All logarithms grow at roughly the same rate, so in Big O notation the base can be omitted. This results in a Big O of O(log n) for a binary search, which is also known as logarithmic runtime and pronounced "on the order of log n" or more simply "order log n."

Introduction to Computers, the Internet and World Wide Web

Introduction to C++ Programming

Introduction to Classes and Objects

Control Statements: Part 1

Control Statements: Part 2

Functions and an Introduction to Recursion

Arrays and Vectors

Pointers and Pointer-Based Strings

Classes: A Deeper Look, Part 1

Classes: A Deeper Look, Part 2

Operator Overloading; String and Array Objects

Object-Oriented Programming: Inheritance

Object-Oriented Programming: Polymorphism

Templates

Stream Input/Output

Exception Handling

File Processing

Class string and String Stream Processing

Web Programming

Searching and Sorting

Data Structures

Bits, Characters, C-Strings and structs

Standard Template Library (STL)