Tokenizing a String | Strings and Text

Problem

You need to break a string into pieces using a set of delimiters.

Solution

Use the find_first_of and first_first_not_of member functions on basic_string to iterate through the string and alternately locate the next tokens and non-tokens. Example 4-12 presents a simple StringTokenizer class that does just that.

Example 4-12. A string tokenizer

#include 
#include 

using namespace std;

// String tokenizer class.
class StringTokenizer {

public:

 StringTokenizer(const string& s, const char* delim = NULL) :
 str_(s), count_(-1), begin_(0), end_(0) {

 if (!delim)
 delim_ = " f

	v"; //default to whitespace
 else
 delim_ = delim;

 // Point to the first token
 begin_ = str_.find_first_not_of(delim_);
 end_ = str_.find_first_of(delim_, begin_);
 }

 size_t countTokens( ) {
 if (count_ >= 0) // return if we've already counted
 return(count_);

 string::size_type n = 0;
 string::size_type i = 0;

 for (;;) {
 // advance to the first token
 if ((i = str_.find_first_not_of(delim_, i)) == string::npos)
 break;
 // advance to the next delimiter
 i = str_.find_first_of(delim_, i+1);
 n++;
 if (i == string::npos)
 break;
 }
 return (count_ = n);
 }
 bool hasMoreTokens( ) {return(begin_ != end_);}
 void nextToken(string& s) {
 if (begin_ != string::npos && end_ != string::npos) {
 s = str_.substr(begin_, end_-begin_);
 begin_ = str_.find_first_not_of(delim_, end_);
 end_ = str_.find_first_of(delim_, begin_);
 }
 else if (begin_ != string::npos && end_ == string::npos)
 {
 s = str_.substr(begin_, str_.length( )-begin_);
 begin_ = str_.find_first_not_of(delim_, end_);
 }

 }

private:
 StringTokenizer( ) {};
 string delim_;
 string str_;
 int count_;
 int begin_;
 int end_;
};

int main( ) {
 string s = " razzle dazzle giddyup ";
 string tmp;

 StringTokenizer st(s);

 cout << "there are " << st.countTokens( ) << " tokens.
";
 while (st.hasMoreTokens( )) {
 st.nextToken(tmp);
 cout << "token = " << tmp << '
';
 }
}

Discussion

Splitting a string with well-defined structure, as in Example 4-10, is nice, but it's not always that easy. Suppose instead that you have to tokenize a string instead of simply break it into pieces based on a single delimiter. The most common incarnation of this is tokenizing based on ignoring whitespace. Example 4-12 gives an implementation of a StringTokenizer class (like the standard Java© class of the same name) for C++ that accepts delimiter characters, but defaults to whitespace.

The most important lines in StringTokenizer use basic_string's find_first_of and find_first_not_of member functions. I describe how they work and when to use them in Recipe 4.9. Example 4-10 produces this output:

there are 3 tokens.
token = razzle
token = dazzle
token = giddyup

StringTokenizer is a more flexible form of the split function in Example 4-10. It maintains state, so you can advance from one token to the next instead of parsing the input string all at once. You can also count the number of tokens.

There are a couple of improvements you can make on StringTokenizer. First, for simplicity, I wrote StringTokenizer to only work with strings, or in other words, narrow character strings. If you want the same class to work for both narrow and wide characters, you can parameterize the character type as I have done in previous recipes. The other thing you may want to do is extend StringTokenizer to allow more friendly interaction with sequences and more extensibility. You can always write all of this yourself, or you can use an existing tokenizer class instead. The Boost project has a class named tokenizer that does this. See www.boost.org for more details.