Regular Expressions

Regular expressions are powerful tools for validating input, for extracting data from input, and for searching and replacing (see Table 13.1). The regular expression pattern matching language is used to describe regular expressions. A regular expression (regex for short) describes constraints on the way a string is composed.

Table 13.1. Examples of Regular Expressions

Pattern

Meaning

hello

matches the literal string, hello

c*at

(quantifier) zero or more occurrances of c, followed by at

c?at

zero or 1 occurrances of c, followed by at

c.t

(character set: anychar) c followed by any character followed by t: cat, cot, c3t, ct

c.*t

(char set and quantifier) c followed by 0 or more anychar followed by t: ct, caaatttt, carsdfsdft

ca+t

(quantifier) + means 1 or more of the preceding "thing," so this matches cat, caat, caaaat, but not ct

c.*t

(literals) backslashes precede special characters to "escape them," so this matches the literal: c.*t

c\.t

(literals) c.t

c[0-9a-c]+z

matches c312abbaz caa211bac2z

the (cat|dog) ate the (fish|mouse)

(alternation) the cat ate the fish or the dog ate the mouse or the dog ate the fish, and the obvious last one.

w+

(charset) a sequence of alphanumerics, or a word, same as [a-zA-Z0-9]+

W

a character that is not part of a word (punctuation, whitespace, etc.)

s{5}

exactly 5 whitespace chars (tabs, spaces, newlines)

S{1,5}

at least 1, at most 5 non-whitespace (printable characters)

d

a digit [0-9] (and D is a non-digit, i.e., [^0-9])

d{3}-d{4}

7-digit phone numbers: 555-1234

m[A-Z]w+

 means word boundary: matches mBuffer but not StreamBuffer

Regular expressions were first available in tools such as vi, emacs, awk, sed, and the POSIX Standard Library. Perl was the first mainstream programming language to integrate regular expressions so tightly into the language that it caused many people to learn regular expressions for the first time. Many enhancements have been made to the version of regular expressions that Perl recognizes. The enhancements are part of what we call Perl-style extended regular expressions. These extended regular expressions are also available in Java and Python.

Qt has a class, QRegExp, that implements most of the Perl-style extended regular expression language.

13.2.1. Regular Expression Syntax

A regular expression can be a simple string, in which case it specifies an exact string match, or it can be a string that includes regular expression meta-characters. A meta-character is a character that describes other characters.

Here are some of the most commonly used meta-characters.

  1. Special characters

    • . (the dot matches any character)
    • (matches the newline character)
    • f (matches the form feed character)
    • (matches the tab character)
    • xhhhh (matches the Unicode character whose code is the hexadecimal number hhhh in the range 0x0000 to 0xFFFF)
  2. Quantifiers: Modifiers that specify the number of occurrences of the preceding character (or group) that may appear in the matching expression.

    • + (1 or more occurrences)
    • ? (0 or 1 occurrences)
    • * (0 or more occurrences)
    • {i,j} (at least i but not more than j occurrences)
  3. Character sets: Sets of allowable values for the character in the specified position of the matching expression. Several character sets are predefined:

    • s (matches any whitespace character)
    • S (matches any non-whitespace character)
    • d (matches any digit character: '0' to '9')
    • D (matches any non-digit character)
    • w (matches any "word" character; i.e., any letter or digit or the underscore '_')
    • W (matches any non-word character)

    Character sets can also be specified in square brackets:

    • [AEIOU] (matches any of the chars ' A ', ' E ', ' I ', ' O ', or ' U ')
    • [a-g] (the dash makes this a range from 'a' to 'g')
    • [^xyz] (matches any character except for ' x ', ' y ', and ' z ')
  4. Grouping and capturing characters (round parentheses): Characters that can be used to form a group. Groups can be back-referenced, meaning that if there is a match, the grouped values can be captured and accessed in various ways.

    For convenience, up to 9 groups can be referenced within the regular expression by using the identifiers 1 thru 9.

    There is also a QRegExp member function cap(int nth) that returns the nth group (as a QString).

  5. Anchoring characters: Assertions that specify the boundaries of a matching effort.

    • The caret (^), if it is the first character in the regex, indicates that the match starts at the beginning of the string.
    • The dollar sign ($), when it is the last character in the regex, means that the effort to match must continue to the end of the string.
    • In addition, there are word boundary () or non-word boundary (B) assertions that help to focus the attention of the regex.

Backslashes are used for escaping special characters in C++ strings as well, so this means that regular expression strings inside C++ strings must be "double-backslashed." In other words, every becomes \, and to match the backslash character itself you will need four: \\

There is much more to regular expressions. Time spent learning to use them is well-invested time. The documentation for QRegExp is a good place to start. For a much more extensive discussion, we recommend [Fried198].

13.2.2. Regular Expressions: Phone Number Recognition

The Problem

We want to specify conditions, in a generic way, that must be satisfied by input data at runtime. For example:

  • In a U.S. address, every zip code can have five digits, followed by an optional dash (-) and four more digits.
  • A U.S. phone number consists of ten digits, usually grouped 3 + 3 + 4, with optional parentheses and dashes and an optional initial 1.
  • A U.S. state abbreviation must be one from the set of 50 approved abbreviations.

How can we impose conditions such as these on incoming data in an object-oriented way?

Suppose that you wanted to write a program that recognized phone number formats, and could accept Dutch or U.S./Canada phone numbers. You would need to take the following things into consideration.

  • For any U.S./Canada format numbers, there must AAA EEE NNNN, where A = area code, E = exchange, and N = number.
  • For Dutch format numbers, there must be CC MM followed by either NN NN NNN or NNN NNNN, where C = country code, M = municipal code, and N = localNumberDigits.
  • There might be dashes or spaces between number clusters.
  • There might be + or 00 in front of the country code.

Imagine how you would write this program using the standard tools available to you in C++. It would be necessary to write lengthy parsing routines for each possible format. Example 13.3 shows the desired output of such a program.

Example 13.3. src/regexp/testphone.txt

$> ./testphone
Enter a phone Number (or q to quit): 1112223333
validated: (US/Canada) +1 111-222-3333
Enter a phone Number (or q to quit): 20618676017
validated: (Europe) +20 (0)61-86-76-017
Enter a phone Number (or q to quit): 31206472582
validated: (Europe) +31 (0)20-64-72-582
Enter a phone Number (or q to quit): 16175551212
validated: (US/Canada) +1 617-555-1212
Enter a phone Number (or q to quit): +31 20 64 28 258
validated: (Europe) +31 (0)20-64-28-258
Enter a phone Number (or q to quit): 1 (617) 222 3333
validated: (US/Canada) +1 617-222-3333
Enter a phone Number (or q to quit): 31 20 111 1111
validated: (Europe) +31 (0)20-11-11-111
Enter a phone Number (or q to quit): asdf
Unknown format
Enter a phone Number (or q to quit): 111 2222 333
Unknown format
Enter a phone Number (or q to quit): 111223
Unknown format
Enter a phone Number (or q to quit): 1112222333
validated: (US/Canada) +1 111-222-2333
Enter a phone Number (or q to quit): q
$ >

A procedural C-style solution that shows how to use QRegExp is shown in Example 13.4.

Example 13.4. src/regexp/testphoneread.cpp

[ . . . . ]
QRegExp usformat
("(\+?1[- ]?)?\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4}"); <-- 1

QRegExp nlformat
("(\+|00)?[\s\-]?(31)[\s\-]?(\d\d)[\s\-]?(.*)$"); <-- 2

QRegExp nlformat2
("(\d\d)(\d\d)(\d{3}"); <-- 3

QRegExp filtercharacters
("[\s-\+\(\)\-]"); <-- 4

QString stdinReadPhone() { <-- 5
 QString str;
 bool format=false;
 do { <-- 6
 cout << "Enter a phone Number (or q to quit): ";
 cout.flush();
 str = cin.readLine();
 if (str=="q")
 return str;
 if (usformat.exactMatch(str)) {
 format = true;
 QString areacode = usformat.cap(2);
 QString exchange = usformat.cap(3);
 QString number = usformat.cap(4);
 str = QString("(US/Canada) +1 %1-%2-%3")
 .arg(areacode).arg(exchange).arg(number);
 }
[ . . . . ]
if (format == false) {
 cout << "Unknown format" << endl;
 }
 } while (format == false) ;
 return str;
}

int main() {
 QString str;
 do {
 str = stdinReadPhone();
 if (str != "q")
 cout << "validated: " << str << endl;
 } while (str != "q");
 return 0;
}
[ . . . . ]
 

(1)All usformat numbers have country code 1, and have 3 + 3 + 4 = 10 digits. White spaces, dashes, and parantheses between these digit groups are ignored, but they help to make the digit groups recognizable.

(2)Netherlands (country code 31) numbers have 2 + 2 + 7 = 11 digits.

(3)The last seven digits will be arranged as 2 + 2 + 3.

(4)These are characters we ignore in the last seven digits of NL numbers.

(5)Ensures that the user-entered phone string complies with a regular expression and extracts the proper components from it.

(6)Keep asking until you get a valid number.

Exercises: Regular Expressions: Phone Number Recognition

1.

Rewrite the birthday reminder application from Section 3.6 so that it accepts dates in any of the following formats:

  • YYYY-MM-DD
  • YYYY/MM/DD
  • MM/DD (year defaults to this year)
  • MMM DD YYYY (year optional, defaults to this year) (MMM = 3-letter abbreviation for month, ignores upper/lower case but does verify validity of month abbreviation)
2.

Many operating systems come with a reasonably good list of words that are used by various programs to check spelling. On *nix systems the word list is generally named (at least indirectly) "words". You can locate the file on your *nix system by typing the command

locate words | grep dict
 

Piping the output of this command through grep reduces the output to those lines that contain the string "dict".

Once you have located your system word list file, write a program that will read lines from the file and, using a suitable regex, display all the words that do the following:

  1. Begin with a pair of repeated letters
  2. End in "gory"
  3. Have more than one pair of repeated letters
  4. Are palindromes
  5. Consist of letters arranged in strictly increasing alphabetic order (e.g., knot)

    If you cannot find such a suitable word list on your system, you can use the file canadian-english-small.gz, in our dist directory.[1] After you download it, you must uncompress it with the command

    [1] http://oop.mcs.suffolk.edu/dist

    gunzip canadian-english-small.gz
    


Part I: Introduction to C++ and Qt 4

C++ Introduction

Classes

Introduction to Qt

Lists

Functions

Inheritance and Polymorphism

Part II: Higher-Level Programming

Libraries

Introduction to Design Patterns

QObject

Generics and Containers

Qt GUI Widgets

Concurrency

Validation and Regular Expressions

Parsing XML

Meta Objects, Properties, and Reflective Programming

More Design Patterns

Models and Views

Qt SQL Classes

Part III: C++ Language Reference

Types and Expressions

Scope and Storage Class

Statements and Control Structures

Memory Access

Chapter Summary

Inheritance in Detail

Miscellaneous Topics

Part IV: Programming Assignments

MP3 Jukebox Assignments

Part V: Appendices

MP3 Jukebox Assignments

Bibliography

MP3 Jukebox Assignments



An Introduction to Design Patterns in C++ with Qt 4
An Introduction to Design Patterns in C++ with Qt 4
ISBN: 0131879057
EAN: 2147483647
Year: 2004
Pages: 268

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net