Section 10.6. Using Containers: Text-Query Program


10.6. Using Containers: Text-Query Program

To conclude this chapter, we'll implement a simple text-query program.

Our program will read a file specified by the user and then allow the user to search the file for words that might occur in it. The result of a query will be the number of times the word occurs and a list of lines on which it appears. If a word occurs more than once on the same line, our program should be smart enough to display that line only once. Lines should be displayed in ascending orderthat is, line 7 should be displayed before line 9, and so on.

Exercises Section 10.5.2

Exercise 10.26:

Write a program that populates a multimap of authors and their works. Use find to find an element in the multimap and erase that element. Be sure your program works correctly if the element you look for is not in the map.

Exercise 10.27:

Repeat the program from the previous exercise, but this time use equal_range to get iterators so that you can erase a range of elements.

Exercise 10.28:

Using the multimap from the previous exercise, write a program to generate the list of authors whose name begins with the each letter in the alphabet. Your output should look something like:

      Author Names Beginning with 'A':      Author, book, book, ...      ...      Author Names Beginning with 'B':      ... 

Exercise 10.29:

Explain the meaning of the operand pos.first->second used in the output expression of the final program in this section.


For example, we might read the file that contains the input for this chapter and look for the word "element." The first few lines of the output would be:

      element occurs 125 times         (line 62) element with a given key.         (line 64) second element with the same key.         (line 153) element |==| operator.         (line 250) the element type.         (line 398) corresponding element. 

followed by the remaining 120 or so lines in which the word "element" occurs.

10.6.1. Design of the Query Program

A good way to start the design of a program is to list the program's operations. Knowing what operations we need to provide can then help us see what data structures we'll need and how we might implement those actions. Starting from requirements, the tasks our program needs to support include:

  1. It must allow the user to indicate the name of a file to process. The program will store the contents of the file so that we can display the original line in which each word appears.

  2. The program must break each line into words and remember all the lines in which each word appears. When it prints the line numbers, they should be presented in ascending order and contain no duplicates.

  3. The result of querying for a particular word should be the line numbers on which that word appeared.

  4. To print the text in which the word appeared, it must be possible to fetch the line from the input file corresponding to a given line number.

Data Structure

We'll implement our program as a simple class that we'll name TextQuery. Our requirements can be met quite neatly by using various containers:

  1. We'll use a vector<string> to store a copy of the entire input file. Each line in the input file will be an element in this vector. That way, when we want to print a line, we can fetch it by using the line number as an index.

  2. We'll store each word's line numbers in a set. Using a set will guarantee that there is only one entry per line and that the line numbers will be automatically stored in ascending order.

  3. We'll use a map to associate each word with the set of line numbers on which the word appears.

Our class will have two data members: a vector to hold the input file and a map to associate each input word with the set of line numbers on which it appears.

Operations

The requirements also lead fairly directly to an interface for our class. However, we have one important design decision to make first: The function that does the query will need to return a set of line numbers. What type should it use to do so?

We can see that doing the query will be simple: We'll index into the map to obtain the associated set. The only question is how to return the set that we find. The safest design is to return a copy of that set. However, doing so means that each element in the set must be copied. Copying the set could be expensive if we process a very large file. Other possible return values are a pair of iterators into the set, or a const reference to the set. For simplicity, we'll return a copy, noting that this decision is one that we might have to revisit if the copy is too expensive in practice.

The first, third, and fourth tasks are actions programmers using our class will perform. The second task is internal to the class. Mapping these tasks to member functions, we'll have three public functions in the class interface:

  • read_file takes an ifstream&, which it reads a line at a time, storing the lines in the vector. Once it has read all the input, read_file will create the map that associates each word to the line numbers on which it appears.

  • run_query takes a string and returns the set of line numbers on which that string appears.

  • text_line takes a line number and returns the corresponding text for that line from the input file.

Neither run_query nor text_line changes the object on which it runs, so we'll define these operations as const member functions (Section 7.7.1, p. 260).

To do the work of read_file, we'll also define two private functions to read the input file and build the map:

  • store_file will read the file and store the data in our vector.

  • build_map will break each line into words and build the map, remembering the line number on which each word appeared.

10.6.2. TextQuery Class

Having worked through our design, we can now write our TextQuery class:

      class TextQuery {      public:          // typedef to make declarations easier          typedef std::vector<std::string>::size_type line_no;          /* interface:           * read_file builds internal data structures for the given file           * run_query finds the given word and returns set of lines on which it appears           * text_line returns a requested line from the input file           */          void read_file(std::ifstream &is)                     { store_file(is); build_map(); }          std::set<line_no> run_query(const std::string&) const;          std::string text_line(line_no) const;      private:          // utility functions used by read_file          void store_file(std::ifstream&); // store input file          void build_map(); // associated each word with a set of line numbers          // remember the whole input file          std::vector<std::string> lines_of_text;          // map word to set of the lines on which it occurs          std::map< std::string, std::set<line_no> > word_map;      }; 

The class directly reflects our design decisions. The only part we hadn't described is the typedef that defines an alias for size_type of vector.

For the reasons described on page 80, this class definition uses fully qualified std:: names for all references to library entities.



The read_file function is defined inside the class. It calls store_file to read and store the input file and build_map to build the map from words to line numbers. We'll define the other functions in Section 10.6.4 (p. 385). First, we'll write a program that uses this class to solve our text-query problem.

Exercises Section 10.6.2

Exercise 10.30:

The member functions of TextQuery use only capabilities that we have already covered. Without looking ahead, write your own versions of the member functions. Hint: The only tricky part is what to return from run_query if the line number set is empty. The solution is to construct and return a new (temporary) set.


10.6.3. Using the TextQuery Class

The following main program uses a TextQuery object to perform a simple query session with the user. Most of the work in this program involves managing the interaction with the user: prompting for the next search word and calling the print_results functionwhich we shall write nextto print the results.

      // program takes single argument specifying the file to query      int main(int argc, char **argv)      {          // open the file from which user will query words          ifstream infile;          if (argc < 2 || !open_file(infile, argv[1])) {              cerr << "No input file!" << endl;              return EXIT_FAILURE;          }          TextQuery tq;          tq.read_file(infile); // builds query map          // iterate with the user: prompt for a word to find and print results          // loop indefinitely; the loop exit is inside the while          while (true) {              cout << "enter word to look for, or q to quit: ";              string s;              cin >> s;              // stop if hit eof on input or a 'q'is entered              if (!cin || s == "q") break;              // get the set of line numbers on which this word appears              set<TextQuery::line_no> locs = tq.run_query(s);              // print count and all occurrences, if any              print_results(locs, s, tq);           }          return 0;      } 

Preliminaries

This program checks that argv[1] is valid and then uses the open_file function (Section 8.4.3, p. 299) to open the file we're given as an argument to main. We test the stream to determine whether the input file is okay. If not, we generate an appropriate message and exit, returning EXIT_FAILURE (Section 7.3.2, p. 247) to indicate that an error occurred.

Once we have opened the file, it is a simple matter to build up the map that will support queries. We define a local variable named tq to hold the file and associated data structures:

      TextQuery tq;      tq.read_file(infile);   builds query map 

We call the read_file operation on tq, passing it the file opened by open_file.

After read_file completes, tq holds our two data structures: the vector that corresponds to the input file and the map from word to set of line numbers. That map contains an entry for each unique word in the input file. The map associates with each word the set of line numbers on which that word appeared.

Doing the Search

We want the program to let the user look for more than one word in each session, so we wrap the prompt in a while loop:

       // iterate with the user: prompt for a word to find and print results       // loop indefinitely; the loop exit is inside the while        while (true) {           cout << "enter word to look for, or q to quit: ";           string s;           cin >> s;           // stop if hit eof on input or a 'q' is entered           if (!cin || s == "q") break;           // get the set of line numbers on which this word appears           set<TextQuery::line_no> locs = tq.run_query(s);           // print count and all occurrences, if any           print_results(locs, s, tq);       } 

The test in the while is the boolean literal true, which means that the test always succeeds. We exit the loop through the break after the test on cin and the value read into sought. The loop exits when cin hits an error or end-of-file or when the user enters a 'q' to quit.

Once we have a word to look for, we ask tq for the set of line numbers on which that word appears. We pass that set along with the word we are looking for and the TextQuery object to the print_results function. That function will write the output of our program.

Printing the Results

What remains is to define the print_results function:

       void print_results(const set<TextQuery::line_no>& locs,                          const string& sought, const TextQuery &file)       {           // if the word was found, then print count and all occurrences           typedef set<TextQuery::line_no> line_nums;           line_nums::size_type size = locs.size();           cout << "\n" << sought << " occurs "                << size << " "                << make_plural(size, "time", "s") << endl;           // print each line in which the word appeared           line_nums::const_iterator it = locs.begin();           for ( ; it != locs.end(); ++it) {               cout << "\t(line "                    // don't confound user with text lines starting at 0                    << (*it) + 1 << ") "                    << file.text_line(*it) << endl;           }      } 

The function starts by defining a typedef to simplify the use of the line number set. Our output first reports how many matches were found, which we know from the size of the set. We call make_plural (Section 7.3.2, p. 248) to print time or times, depending on whether that size is equal to one.

The messiest part of the program is the for loop that processes locs to print the line numbers on which the word was found. The only subtlety here is remembering to change the line numbers into more human-friendly counting. When we stored the text, we stored the first line as line number zero, which is consistent with how C++ containers and arrays are numbered. Most users think of the first line as line number 1, so we systematically add one to our stored line numbers to convert to this more common notation.

Exercises Section 10.6.3

Exercise 10.31:

What is the output of main if we look for a word that is not found?


10.6.4. Writing the Member Functions

What remains is to write the definitions of the member functions that were not defined inside the class.

Storing the Input File

Our first task is to read the file that our user wishes to query. Using string and vector operations, this task is handled easily:

      // read input file: store each line as element in lines_of_text      void TextQuery::store_file(ifstream &is)      {          string textline;          while (getline(is, textline))             lines_of_text.push_back(textline);      } 

Because we want to store the file a line at a time, we use getline to read our input. We push each line we read onto the lines_of_text vector.

Building the Word map

Each element in the vector holds a line of text. To build the map from words to line numbers, we need to break each line into its individual words. We again use an istringstream in ways outlined in the program on page 300:

       // finds whitespace-separated words in the input vector       // and puts the word in word_map along with the line number       void TextQuery::build_map()       {           // process each line from the input vector           for (line_no line_num = 0;                        line_num != lines_of_text.size();                        ++line_num)           {               //we'll use line to read the text a word at a time               istringstream line(lines_of_text[line_num]);               string word;               while (line >> word)                   // add this line number to the set;                   // subscript will add word to the map if it's not already there                   word_map[word].insert(line_num);           }      } 

The for loop marches through lines_of_text a line at a time. We start by binding an istringstream object named line to the current line and use the istringstream input operator to read each word on the line. Remember that that operator, like the other istream operators, ignores whitespace. Thus, the while reads each whitespace-separated word from line into word.

The last part of this function is similar to our word-count programs. We use word to subscript the map. If word was not already present, then the subscript operator adds word to the word_map, giving it an inital value that is the empty set. Regardless of whether word was added, the return value from the subscript is a set. We then call insert to add the current line number. If the word occurs more than once in the same line, then the call to insert does nothing.

Supporting Queries

The run_query function handles the actual queries:

      set<TextQuery::line_no>      TextQuery::run_query(const string &query_word) const      {          //Note: must use find and not subscript the map directly          //to avoid adding words to word_map!          map<string, set<line_no> >::const_iterator                                loc = word_map.find(query_word);      if (loc == word_map.end())          return set<line_no>(); // not found, return empty set      else          // fetch and return set of line numbers for this word          return loc->second;      } 

The run_query function takes a reference to a const string and uses that value to index the word_map. Assuming the string is found, it returns the set of line numbers associated with the string. Otherwise, it returns an empty set.

Using the Return from run_query

Once we've run the run_query function, we get back a set of line numbers on which the word we sought appears. In addition to printing how many times each word appears, we also want to print the line on which the word appeared. The text_line function lets us do so:

    string TextQuery::text_line(line_no line) const    {        if (line < lines_of_text.size())            return lines_of_text[line];        throw std::out_of_range("line number out of range");    } 

This function takes a line number and returns the input text line corresponding to that line number. Because the code using our TextQuery class cannot do so lines_of_text is privatewe first check that the line we are asked for is in range. If it is, we return the corresponding line. If it is not, we tHRow an out_of_range exception.

Exercises Section 10.6.4

Exercise 10.32:

Reimplement the text-query program to use a vector instead of a set to hold the line numbers. Note that because lines appear in ascending order, we can append a new line number to the vector only if the last element of the vector isn't that line number. What are the performance and design characteristics of the two solutions? Which do you feel is the preferred design solution? Why?

Exercise 10.33:

Why doesn't the TextQuery::text_line function check whether its argument is negative?




C++ Primer
C Primer Plus (5th Edition)
ISBN: 0672326965
EAN: 2147483647
Year: 2006
Pages: 223
Authors: Stephen Prata

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net