Regular expressions are powerful tools for validating input, for extracting data from input, and for searching and replacing (see Table 13.1). The regular expression pattern matching language is used to describe regular expressions. A regular expression (regex for short) describes constraints on the way a string is composed.
Pattern |
Meaning |
---|---|
hello |
matches the literal string, hello |
c*at |
(quantifier) zero or more occurrances of c, followed by at |
c?at |
zero or 1 occurrances of c, followed by at |
c.t |
(character set: anychar) c followed by any character followed by t: cat, cot, c3t, ct |
c.*t |
(char set and quantifier) c followed by 0 or more anychar followed by t: ct, caaatttt, carsdfsdft |
ca+t |
(quantifier) + means 1 or more of the preceding "thing," so this matches cat, caat, caaaat, but not ct |
c.*t |
(literals) backslashes precede special characters to "escape them," so this matches the literal: c.*t |
c\.t |
(literals) c.t |
c[0-9a-c]+z |
matches c312abbaz caa211bac2z |
the (cat|dog) ate the (fish|mouse) |
(alternation) the cat ate the fish or the dog ate the mouse or the dog ate the fish, and the obvious last one. |
w+ |
(charset) a sequence of alphanumerics, or a word, same as [a-zA-Z0-9]+ |
W |
a character that is not part of a word (punctuation, whitespace, etc.) |
s{5} |
exactly 5 whitespace chars (tabs, spaces, newlines) |
S{1,5} |
at least 1, at most 5 non-whitespace (printable characters) |
d |
a digit [0-9] (and D is a non-digit, i.e., [^0-9]) |
d{3}-d{4} |
7-digit phone numbers: 555-1234 |
m[A-Z]w+ |
means word boundary: matches mBuffer but not StreamBuffer |
Regular expressions were first available in tools such as vi, emacs, awk, sed, and the POSIX Standard Library. Perl was the first mainstream programming language to integrate regular expressions so tightly into the language that it caused many people to learn regular expressions for the first time. Many enhancements have been made to the version of regular expressions that Perl recognizes. The enhancements are part of what we call Perl-style extended regular expressions. These extended regular expressions are also available in Java and Python.
Qt has a class, QRegExp, that implements most of the Perl-style extended regular expression language.
13.2.1. Regular Expression Syntax
A regular expression can be a simple string, in which case it specifies an exact string match, or it can be a string that includes regular expression meta-characters. A meta-character is a character that describes other characters.
Here are some of the most commonly used meta-characters.
Character sets can also be specified in square brackets:
For convenience, up to 9 groups can be referenced within the regular expression by using the identifiers 1 thru 9.
There is also a QRegExp member function cap(int nth) that returns the nth group (as a QString).
Backslashes are used for escaping special characters in C++ strings as well, so this means that regular expression strings inside C++ strings must be "double-backslashed." In other words, every becomes \, and to match the backslash character itself you will need four: \\ |
There is much more to regular expressions. Time spent learning to use them is well-invested time. The documentation for QRegExp is a good place to start. For a much more extensive discussion, we recommend [Fried198].
13.2.2. Regular Expressions: Phone Number Recognition
The Problem
We want to specify conditions, in a generic way, that must be satisfied by input data at runtime. For example:
How can we impose conditions such as these on incoming data in an object-oriented way?
Suppose that you wanted to write a program that recognized phone number formats, and could accept Dutch or U.S./Canada phone numbers. You would need to take the following things into consideration.
Imagine how you would write this program using the standard tools available to you in C++. It would be necessary to write lengthy parsing routines for each possible format. Example 13.3 shows the desired output of such a program.
Example 13.3. src/regexp/testphone.txt
$> ./testphone Enter a phone Number (or q to quit): 1112223333 validated: (US/Canada) +1 111-222-3333 Enter a phone Number (or q to quit): 20618676017 validated: (Europe) +20 (0)61-86-76-017 Enter a phone Number (or q to quit): 31206472582 validated: (Europe) +31 (0)20-64-72-582 Enter a phone Number (or q to quit): 16175551212 validated: (US/Canada) +1 617-555-1212 Enter a phone Number (or q to quit): +31 20 64 28 258 validated: (Europe) +31 (0)20-64-28-258 Enter a phone Number (or q to quit): 1 (617) 222 3333 validated: (US/Canada) +1 617-222-3333 Enter a phone Number (or q to quit): 31 20 111 1111 validated: (Europe) +31 (0)20-11-11-111 Enter a phone Number (or q to quit): asdf Unknown format Enter a phone Number (or q to quit): 111 2222 333 Unknown format Enter a phone Number (or q to quit): 111223 Unknown format Enter a phone Number (or q to quit): 1112222333 validated: (US/Canada) +1 111-222-2333 Enter a phone Number (or q to quit): q $ > |
A procedural C-style solution that shows how to use QRegExp is shown in Example 13.4.
Example 13.4. src/regexp/testphoneread.cpp
[ . . . . ] QRegExp usformat ("(\+?1[- ]?)?\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4}"); <-- 1 QRegExp nlformat ("(\+|00)?[\s\-]?(31)[\s\-]?(\d\d)[\s\-]?(.*)$"); <-- 2 QRegExp nlformat2 ("(\d\d)(\d\d)(\d{3}"); <-- 3 QRegExp filtercharacters ("[\s-\+\(\)\-]"); <-- 4 QString stdinReadPhone() { <-- 5 QString str; bool format=false; do { <-- 6 cout << "Enter a phone Number (or q to quit): "; cout.flush(); str = cin.readLine(); if (str=="q") return str; if (usformat.exactMatch(str)) { format = true; QString areacode = usformat.cap(2); QString exchange = usformat.cap(3); QString number = usformat.cap(4); str = QString("(US/Canada) +1 %1-%2-%3") .arg(areacode).arg(exchange).arg(number); } [ . . . . ] if (format == false) { cout << "Unknown format" << endl; } } while (format == false) ; return str; } int main() { QString str; do { str = stdinReadPhone(); if (str != "q") cout << "validated: " << str << endl; } while (str != "q"); return 0; } [ . . . . ]
|
Exercises: Regular Expressions: Phone Number Recognition
1. |
Rewrite the birthday reminder application from Section 3.6 so that it accepts dates in any of the following formats:
|
2. |
Many operating systems come with a reasonably good list of words that are used by various programs to check spelling. On *nix systems the word list is generally named (at least indirectly) "words". You can locate the file on your *nix system by typing the command locate words | grep dict Piping the output of this command through grep reduces the output to those lines that contain the string "dict". |
Once you have located your system word list file, write a program that will read lines from the file and, using a suitable regex, display all the words that do the following:
If you cannot find such a suitable word list on your system, you can use the file canadian-english-small.gz, in our dist directory.[1] After you download it, you must uncompress it with the command
[1] http://oop.mcs.suffolk.edu/dist
gunzip canadian-english-small.gz
Part I: Introduction to C++ and Qt 4
C++ Introduction
Classes
Introduction to Qt
Lists
Functions
Inheritance and Polymorphism
Part II: Higher-Level Programming
Libraries
Introduction to Design Patterns
QObject
Generics and Containers
Qt GUI Widgets
Concurrency
Validation and Regular Expressions
Parsing XML
Meta Objects, Properties, and Reflective Programming
More Design Patterns
Models and Views
Qt SQL Classes
Part III: C++ Language Reference
Types and Expressions
Scope and Storage Class
Statements and Control Structures
Memory Access
Chapter Summary
Inheritance in Detail
Miscellaneous Topics
Part IV: Programming Assignments
MP3 Jukebox Assignments
Part V: Appendices
MP3 Jukebox Assignments
Bibliography
MP3 Jukebox Assignments