Usage | Beyond the C++ Standard Library: An Introduction to Boost

To begin using Boost.Regex, you need to include the header "boost/regex.hpp". Regex is one of the two libraries (the other one is Boost.Signals) covered in this book that need to be separately compiled. You'll be glad to know that after you've built Boostthis is a one-liner from the command promptlinking is automatic (for Windows-based compilers anyway), so you're relieved from the tedium of figuring out which lib file to use.

The first thing you need to do is to declare a variable of type basic_regex. This is one of the core classes in the library, and it's the one that stores the regular expression. Creating one is simple; just pass a string to the constructor containing the regular expression you want to use.

 boost::regex reg("(A.*)");

This regular expression contains three interesting features of regular expressions. The first is the enclosing of a subexpression within parenthesesthis makes it possible to refer to that subexpression later on in the same regular expression or to extract the text that matches it. We'll talk about this in detail later on, so don't worry if you don't yet see how that's useful. The second feature is the wildcard character, the dot. The wildcard has a very special meaning in regular expressions; it matches any character. Finally, the expression uses a repeat, *, called the Kleene star, which means that the preceding expression may match zero or more times. This regular expression is ready to be used in one of the algorithms, like so:

 bool b=boost::regex_match(   "This expression could match from A and beyond.",   reg);

As you can see, you pass the regular expression and the string to be parsed to the algorithm regex_match. The result of calling the function is true if there is an exact match for the regular expression; otherwise, it is false. In this case, the result is false, because regex_match only returns true when all of the input data is successfully matched by the regular expression. Do you see why that's not the case for this code? Look again at the regular expression. The first character is a capital A, so that's obviously the first character that could ever match the expression. So, a part of the input"A and beyond."does match the expression, but it does not exhaust the input. Let's try another input string.

 bool b=boost::regex_match(   "As this string starts with A, does it match? ",   reg);

This time, regex_match returns true. When the regular expression engine matches the A, it then goes on to see what should follow. In our regex, A is followed by the wildcard, to which we have applied the Kleene star, meaning that any character is matching any number of times. Thus, the parsing starts to consume the rest of the input string, and matches all the rest of the input.

Next, let's see how we can put regexes and regex_match to work with data validation.

Validating Input

A common scenario where regular expressions are used is in validating the format of input data. Applications often require that input adhere to a certain structure. Consider an application that accepts input that must come in the form "3 digits, a word, any character, 2 digits or the string "N/A," a space, then the first word again." Coding such validations manually is both tedious and error prone, and furthermore, these formats are typically exposed to changing requirements; before you know it, some variation of the format needs to be supported, and your carefully crafted parser suddenly needs to be changed and debugged. Let's assemble a regular expression that can validate such input correctly. First, we need an expression that matches exactly 3 digits. There's a special shortcut for digits, \d, that we'll use. To have it repeated 3 times, there's a special kind of repeat called the bounds operator, which encloses the bounds in curly braces. Putting these two together, here's the first part of our regular expression.

 boost::regex reg("\\d{3}");

Note that we need to escape the escape character, so the shortcut \d becomes \\d in our string. This is because the compiler consumes the first backslash as an escape character; we need to escape the backslash so a backslash actually appears in the regular expression string.

Next, we need a way to define a wordthat is, a sequence of characters, ended by any character that is not a letter. There is more than one way of accomplishing this, but we will do it using the regular expression features character classes (also called character sets) and ranges. A character class is an expression enclosed in square brackets. For example, a character class that matches any one of the characters a, b, and c, looks like this: [abc]. Using a range to accomplish the same thing, we write it like so: [a-c]. For a character class that encompasses all characters, we could go slightly crazy and write it like [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ], but we won't; we'll use ranges instead: [a-zA-Z]. It should be noted that using ranges like this can make one dependent on the locale that is currently in use, if the basic_regex::collate flag is turned on for the regular expression. Using these tools and the repeat +, which means that the preceding expression can be repeated, but must exist at least once, we're now ready to describe a word.

 boost::regex reg("[a-zA-Z]+");

That regular expression works, but because it is so common, there is an even simpler way to represent a word: \w. That operator matches all word characters, not just the ASCII ones, so not only is it shorter, it is better for internationalization purposes. The next character should be exactly one of any character, which we know is the purpose of the dot.

 boost::regex reg(".");

The next part of the input is 2 digits or the string "N/A." To match that, we need to use a feature called alternatives. Alternatives match one of two or more subexpressions, with each alternative separated from the others by |. Here's how it looks:

 boost::regex reg("(\\d{2}|N/A)");

Note that the expression is enclosed in parentheses, to make sure that the full expressions are considered as the two alternatives. Adding a space to the regular expression is simple; there's a shortcut for it: \s. Putting together everything we have so far gives us the following expression:

 boost::regex reg("\\d{3}[a-zA-Z]+.(\\d{2}|N/A)\\s");

Now things get a little trickier. We need a way to validate that the next word in the input data exactly matches the first word (the one we capture using the expression [a-zA-Z]+). The key to accomplish this is to use a back reference, which is a reference to a previous subexpression. For us to be able to refer to the expression [a-zA-Z]+, we must first enclose it in parentheses. That makes the expression ([a-zA-Z]+) the first subexpression in our regular expression, and we can therefore create a back reference to it using the index 1.

That gives us the full regular expression for "3 digits, a word, any character, 2 digits or the string "N/A," a space, then the first word again":

 boost::regex reg("\\d{3}([a-zA-Z]+).(\\d{2}|N/A)\\s\\1");

Good work! Here's a simple program that makes use of the expression with the algorithm regex_match, validating two sample input strings.

 #include <iostream> #include <cassert> #include <string> #include "boost/regex.hpp" int main() {   // 3 digits, a word, any character, 2 digits or "N/A",    // a space, then the first word again   boost::regex reg("\\d{3}([a-zA-Z]+).(\\d{2}|N/A)\\s\\1");      std::string correct="123Hello N/A Hello";   std::string incorrect="123Hello 12 hello";      assert(boost::regex_match(correct,reg)==true);   assert(boost::regex_match(incorrect,reg)==false); }

The first string, 123Hello N/A Hello, is correct; 123 is 3 digits, followed by any character (a space), Hello is a word, then another space, and finally the word Hello is repeated. The second string is incorrect, because the word Hello is not repeated exactly. By default, regular expressions are case-sensitive, and the back reference therefore does not match.

One of the keys in crafting regular expressions is successfully decomposing the problem. When looking at the final expression that you just created, it can seem quite intimidating to the untrained eye. However, when decomposing the expression into smaller components, it's not very complicated at all.

Searching

We shall now take a look at another of Boost.Regex's algorithms, regex_search. The difference from regex_match is that regex_search does not require that all of the input data matches, but only that part of it does. For this exposition, consider the problem of a programmer who expects to have forgotten one or two calls to delete in his program. Although he realizes that it's by no means a foolproof test, he decides to count the number of occurrences of new and delete and see if the numbers add up. The regular expression is very simple; we have two alternatives, new and delete.

 boost::regex reg("(new)|(delete)");

There are two reasons for us to enclose the subexpressions in parentheses: one is that we must do so in order to form the two groups for our alternatives. The other reason is that we will want to refer to these subexpressions when calling regex_search, to enable us to determine which of the alternatives was actually matched. We will use an overload of regex_search that also accepts an argument of type match_results. When regex_search performs its matching, it reports subexpression matches through an object of type match_results. The class template match_results is parameterized on the type of iterator that applies to the input sequence.

 template <class Iterator,   class Allocator=std::allocator<sub_match<Iterator> >     class match_results; typedef match_results<const char*> cmatch; typedef match_results<const wchar_t> wcmatch; typedef match_results<std::string::const_iterator> smatch; typedef match_results<std::wstring::const_iterator> wsmatch;

We will use std::string, and are therefore interested in the typedef smatch, which is short for match_results<std::string::const_iterator>. When regex_search returns true, the reference to match_results that is passed to the function contains the results of the subexpression matches. Within match_results, there are indexed sub_matches for each of the subexpressions in the regular expression. Let's see what we have so far that can help our confused programmer assess the calls to new and delete.

 boost::regex reg("(new)|(delete)"); boost::smatch m; std::string s=   "Calls to new must be followed by delete. \   Calling simply new results in a leak!"; if (boost::regex_search(s,m,reg)) {   // Did new match?   if (m[1].matched)     std::cout << "The expression (new) matched!\n";   if (m[2].matched)     std::cout << "The expression (delete) matched!\n"; }

The preceding program searches the input string for new or delete, and reports which one it finds first. By passing an object of type smatch to regex_search, we gain access to the details of how the algorithm succeeded. In our expression, there are two subexpressions, and we can thus get to the subexpression for new by the index 1 of match_results. We then hold an instance of sub_match, which contains a Boolean member, matched, that tells us whether the subexpression participated in the match. So, given the preceding input, running this code would output "The expression (new) matched!\n". Now, you still have some more work to do. You need to continue applying the regular expression to the remainder of the input, and to do that, you use another overload of regex_search, which accepts two iterators denoting the character sequence to search. Because std::string is a container, it provides iterators. Now, for each match, you must update the iterator denoting the beginning of the range to refer to the end of the previous match. Finally, add two variables to hold the counts for new and delete. Here's the complete program:

 #include <iostream> #include <string> #include "boost/regex.hpp" int main() {   // Are there equally many occurrences of    // "new" and "delete"?   boost::regex reg("(new)|(delete)");   boost::smatch m;   std::string s=     "Calls to new must be followed by delete. \      Calling simply new results in a leak!";   int new_counter=0;   int delete_counter=0;   std::string::const_iterator it=s.begin();   std::string::const_iterator end=s.end();   while (boost::regex_search(it,end,m,reg)) {     // New or delete?     m[1].matched ? ++new_counter : ++delete_counter;     it=m[0].second;   }   if (new_counter!=delete_counter)     std::cout << "Leak detected!\n";   else     std::cout << "Seems ok...\n"; }

Note that the program always sets the iterator it to m[0].second. match_results[0] returns a reference to the submatch that matched the whole regular expression, so we can be sure that the end of that match is always the correct location to start the next run of regex_search. Running this program outputs "Leak detected!", because there are two occurrences of new, and only one of delete. Of course, one variable could be deleted twice, there could be calls to new[] and delete[], and so forth.

By now, you should have a good understanding of how subexpression grouping works. It's time to move on to the final algorithm in Boost.Regex, one that is used to perform substitutions.

Replacing

The third in the family of Regex algorithms is regex_replace. As the name implies, it's used to perform text substitutions. It searches through the input data, finding all matches to the regular expression. For each match of the expression, the algorithm calls match_results::format and outputs the result to an output iterator that is passed to the function.

In the introduction to this chapter, I gave you the example of changing the British spelling of colour to the U.S. spelling of color. Changing the spelling without using regular expressions is very tedious, and extremely error prone. The problem is that there might be different capitalization, and a lot of words that are affectedfor example, colourize. To properly attack this problem, we need to split the regular expression into three subexpressions.

 boost::regex reg("(Colo)(u)(r)",   boost::regex::icase|boost::regex::perl);

We have isolated the villainthe letter uin order to surgically remove it from any matches. Also note that this regex is case-insensitive, which we achieve by passing the format flag boost::regex::icase to the constructor of regex. Note that you must also pass any other flags that you want to be in effect. A common user error when setting format flags is to omit the ones that regex turns on by default, but that don't workyou must always apply all of the flags that should be set.

When calling regex_replace, we are expected to provide a format string as an argument. This format string determines how the substitution will work. In the format string, it's possible to refer to subexpression matches, and that's precisely what we need here. You want to keep the first matched subexpression, and the third, but let the second (u), silently disappear. The expression $N, where N is the index of a subexpression, expands to the match for that subexpression. So our format string becomes "$1$3", which means that the replacement text is the result of the first and the third subexpressions. By referring to the subexpression matches, we are able to retain any capitalization in the matched text, which would not be possible if we were to use a string literal as the replacement text. Here's a complete program that solves the problem.

 #include <iostream> #include <string> #include "boost/regex.hpp" int main() {   boost::regex reg("(Colo)(u)(r)",     boost::regex::icase|boost::regex::perl);      std::string s="Colour, colours, color, colourize";   s=boost::regex_replace(s,reg,"$1$3");   std::cout << s; }

The output of running this program is "Color, colors, color, colorize". regex_replace is enormously useful for applying substitutions like this.

A Common User Misunderstanding

One of the most common questions that I see related to Boost.Regex is related to the semantics of regex_match. It's easy to forget that all of the input to regex_match must match the regular expression. Thus, users often think that code like the following should yield true.

 boost::regex reg("\\d*"); bool b=boost::regex_match("17 is prime",reg);

Rest assured that this call never results in a successful match. All of the input must be consumed for regex_match to return TRue! Almost all of the users asking why this doesn't work should use regex_search rather than regex_match.

 boost::regex reg("\\d*"); bool b=boost::regex_search("17 is prime",reg);

This most definitely yields TRue. It is worth noting that it's possible to make regex_search behave like regex_match, using special buffer operators. \A matches the start of a buffer, and \Z matches the end of a buffer, so if you put \A first in your regular expression, and \Z last, you'll make regex_search behave exactly like regex_matchthat is, it must consume all input for a successful match. The following regular expression always requires that the input be exhausted, regardless of whether you are using regex_match or regex_search.

 boost::regex reg("\\A\\d*\\Z");

Please understand that this does not imply that regex_match should not be used; on the contrary, it should be a clear indication that the semantics we just talked aboutthat all of the input must be consumedare in effect.

About Repeats and Greed

Another common source of confusion is the greediness of repeats. Some of the repeatsfor example, + and *are greedy. This means that they will consume as much of the input as they possibly can. It's not uncommon to see regular expressions such as the following, with the intent of capturing a digit after a greedy repeat is applied.

 boost::regex reg("(.*)(\\d{2})");

This regular expression succeeds, but it might not match the subexpressions that you think it should! The expression .* happily eats everything that following subexpressions don't match. Here's a sample program that exhibits this behavior:

 int main() {   boost::regex reg("(.*)(\\d{2})");   boost::cmatch m;   const char* text = "Note that I'm 31 years old, not 32.";   if(boost::regex_search(text,m, reg)) {     if (m[1].matched)       std::cout << "(.*) matched: " << m[1].str() << '\n';     if (m[2].matched)       std::cout << "Found the age: " << m[2] << '\n';   } }

In this program, we are using another parameterization of match_results, tHRough the type cmatch. It is a typedef for match_results<const char*>, and the reason we must use it rather than the type smatch we've been using before is that we're now calling regex_search with a string literal rather than an object of type std::string. What do you expect the output of running this program to be? Typically, users new to regular expressions first think that both m[1].matched and m[2].matched will be TRue, and that the result of the second subexpression will be "31". Next, after realizing the effects of greedy repeatsthat they consume as much input as possiblethey tend to think that only the first subexpression can be TRuethat is, the .* has successfully eaten all of the input. Finally, new users come to the conclusion that the expression will match both subexpressions, but that the second expression will match the last possible sequence. Here, that means that the first subexpression will match "Note that I'm 31 years old, not" and the second will match "32".

So, what do you do when you actually want is to use a repeat and the first occurrence of another subexpression? Use non-greedy repeats. By appending ? to the repeat, it becomes non-greedy. This means that the expression tries to find the shortest possible match that doesn't prevent the rest of the expression from matching. So, to make the previous regex work correctly, we need to update it like so.

 boost::regex reg("(.*?)(\\d{2})");

If we change the program to use this regular expression, both m[1].matched and m[2].matched will still be true. The expression .*? consumes as little of the input as it can, which means that it stops at the first character 3, because that's what the expression needs in order to successfully match. Thus, the first subexpression matches "Note that I'm" and the second matches "31".

A Look at regex_iterator

We have seen how to use several calls to regex_search in order to process all of an input sequence, but there's another, more elegant way of doing that, using a regex_iterator. This iterator type enumerates all of the regular expression matches in a sequence. Dereferencing a regex_iterator yields a reference to an instance of match_results. When constructing a regex_iterator, you pass to it the iterators denoting the input sequence, and the regular expression to apply. Let's look at an example where we have input data that is a comma-separated list of integers. The regular expression is simple.

 boost::regex reg("(\\d+),?");

Adding the repeat ? (match zero or one times) to the end of the regular expression ensures that the last digit will be successfully parsed, even if the input sequence does not end with a comma. Further, we are using another repeat, +. This repeat ensures that the expression matches one or more times. Now, rather than doing multiple calls to regex_search, we create a regex_iterator, call the algorithm for_each, and supply it with a function object to call with the result of dereferencing the iterator. Here's a function object that accepts any form of match_results due to its parameterized function call operator. All work it performs is to add the value of the current match to a total (in our regular expression, the first subexpression is the one we're interested in).

 class regex_callback {   int sum_; public:   regex_callback() : sum_(0) {}   template <typename T> void operator()(const T& what) {     sum_+=atoi(what[1].str().c_str());   }   int sum() const {     return sum_;   } };

You now pass an instance of this function object to std::for_each, which results in an invocation of the function call operator for every dereference of the iterator itthat is, it is invoked every time there is a match of a subexpression in the regex.

 int main() {   boost::regex reg("(\\d+),?");   std::string s="1,1,2,3,5,8,13,21";   boost::sregex_iterator it(s.begin(),s.end(),reg);   boost::sregex_iterator end;   regex_callback c;   int sum=for_each(it,end,c).sum(); }

As you can see, the past-the-end iterator passed to for_each is simply a default-constructed instance of regex_iterator. Also, the type of it and end is boost::sregex_iterator, which is a typedef for regex_iterator<std::string::const_iterator>. Using regex_iterator this way is a much cleaner way of matching multiple times than what we did previously, where we manually had to advance the starting iterator and call regex_search in a loop.

Splitting Strings with regex_token_iterator

Another iterator type, or to be more precise, an iterator adaptor, is boost::regex_token_iterator. It is similar to regex_iterator, but may also be employed to enumerate each character sequence that does not match the regular expression, which is useful for splitting strings. It is also possible to select which subexpressions are of interest, so that when dereferencing the regex_token_iterator, only the subexpressions that are "subscribed to" are returned. Consider an application that receives input data where the entries are separated using a forward slash. Anything in between constitutes an item that the application needs to process. With regex_token_iterator, splitting the strings is easy. The regular expression is very simple.

 boost::regex reg("/");

The regex matches the separator of items. To use it for splitting the input, simply pass the special index 1 to the constructor of regex_token_iterator. Here is the complete program:

 int main() {   boost::regex reg("/");   std::string s="Split/Values/Separated/By/Slashes,";   std::vector<std::string> vec;   boost::sregex_token_iterator it(s.begin(),s.end(),reg,-1);   boost::sregex_token_iterator end;   while (it!=end)      vec.push_back(*it++);   assert(vec.size()==std::count(s.begin(),s.end(),'/')+1);   assert(vec[0]=="Split"); }

Similar to regex_iterator, regex_token_iterator is a template class parameterized on the iterator type for the sequence it wraps. Here, we're using sregex_token_iterator, which is a typedef for regex_token_iterator<std::string::const_iterator>. Each time the iterator it is dereferenced, it returns the current sub_match, and when the iterator is advanced, it tries to match the regular expression again. These two iterator types, regex_iterator and regex_token_iterator, are very useful; you'll know that you need them when you are considering to call regex_search multiple times!

More Regular Expressions

You have already seen quite a lot of regular expression syntax, but there's still more to know. This section quickly demonstrates the uses of some of the remaining functionality that is useful in your everyday regular expressions. To begin, we will look at the whole set of repeats; we've already looked at *, +, and bounded repeats using {}. There's one more repeat, and that's ?. You may have noted that it is also used to declare non-greedy repeats, but by itself, it means that the expression must occur zero or one times. It's also worth mentioning that the bounded repeats are very flexible; here are three different ways of using them:

 boost::regex reg1("\\d{5}"); boost::regex reg2("\\d{2,4}"); boost::regex reg3("\\d{2,}");

The first regex matches exactly 5 digits. The second matches 2, 3, or 4 digits. The third matches 2 or more digits, without an upper limit.

Another important regular expression feature is to use negated character classes using the metacharacter ^. You use it to form character classes that match any character that is not part of the character class; the complement of the elements you list in the character class. For example, consider this regular expression.

 boost::regex reg("[^13579]");

It contains a negated character class that matches any character that is not one of the odd numbers. Take a look at the following short program, and try to figure out what the output will be.

 int main() {   boost::regex reg4("[^13579]");   std::string s="0123456789";   boost::sregex_iterator it(s.begin(),s.end(),reg4);   boost::sregex_iterator end;   while (it!=end)      std::cout << *it++; }

Did you figure it out? The output is "02468"that is, all of the even numbers. Note that this character class does not only match even numbershad the input string been "AlfaBetaGamma," that would have matched just fine too.

The metacharacter we've just seen, ^, serves another purpose too. It is used to denote the beginning of a line. The metacharacter $ denotes the end of a line.

Bad Regular Expressions

A bad regular expression is one that doesn't conform with the rules that govern regexes. For example, if you happen to forget a closing parenthesis, there's no way the regular expression engine can successfully compile the regular expression. When that happens, an exception of type bad_expression is thrown. As I mentioned before, this name will change in the next version of Boost.Regex, and in the version that's going to be added to the Library Technical Report. The exception type bad_expression will be renamed to regex_error.

If all of your regular expressions are hardcoded into your application, you may be safe from having to deal with bad expressions, but if you're accepting user input in the form of regexes, you must be prepared to handle errors. Here's a program that prompts the user to enter a regular expression, followed by a string to be matched against the regex. As always, when there's user input involved, there's a chance that the input will be invalid.

 int main() {     std::cout << "Enter a regular expression:\n";   std::string s;   std::getline(std::cin, s);   try {     boost::regex reg(s);     std::cout << "Enter a string to be matched:\n";       std::getline(std::cin,s);       if (boost::regex_match(s,reg))       std::cout << "That's right!\n";     else       std::cout << "No, sorry, that doesn't match.\n";   }   catch(const boost::bad_expression& e) {     std::cout <<        "That's not a valid regular expression! (Error: " <<        e.what() << ") Exiting...\n";   } }

To protect the application and the user, a try/catch block ensures that if boost::regex throws upon construction, an informative message will be printed, and the application will shut down gracefully. Putting this program to the test, let's begin with some reasonable input.

 Enter a regular expression: \d{5} Enter a string to be matched: 12345 That's right!

Now, here's grief coming your way, in the form of a very poor attempt at a regular expression.

 Enter a regular expression: (\w*)) That's not a valid regular expression! (Error: Unmatched ( or \() Exiting...

An exception is thrown when the regex reg is constructed, because the regular expression cannot be compiled. Consequently, the catch handler is invoked, and the program prints an error message and exits. There are only three places where you need to be aware of potential exceptions being thrown. One is when constructing a regular expression, similar to the example you just saw; another is when assigning regular expressions to a regex, using the member function assign. Finally, the regex iterators and the algorithms can also throw exceptionsif memory is exhausted or if the complexity of the match grows too quickly.