REALbasic used the Perl Compatible Regular Expression (PCRE) library, which is open source and the library used by many other programming and scripting languages, such as Python and PHP. This is good because if you are familiar with regular expressions using those languages, working with them in REALbasic should be very familiar to you. If you are new to regular expressions and would like to know what exactly a regular expression is, I'll explain that first and then move on to some examples of how to use it. Regular expressions are used to identify patterns in strings of text and, in some cases, to replace one pattern with another. You describe patterns of text using a series of characters that have special meaning in this context. You can download sample applications on the REALbasic website (http://realsoftware.com/), and one of the sample applications lets you test regular expressions. The application is called Regular Expressions. It is an excellent tool to use whenever you need to use a regular expression, because you do not want to be testing them within your application. They are notoriously difficult to get right, especially with complex pattern requirements, so you want to focus on the expression first, and then include it in whichever application you are working on. I used it to test the regular expressions I use as examples in this section. There are special identifiers for individual characters. (When using regular expressions, you will see a lot of \, which is called an escape and which lets the regular expression engine know that the following letter has a special meaning, rather than being a literal representation of the letter.) Here are the references to specific letters:
There is another group of identifiers that refer to classes of characters:
Although there are more identifiers you can use, this is enough to get started. The next step is to combine these characters into a sequence that represents the pattern you are looking for. This can be a combination of literal characters, plus any of the above classes. Suppose you wanted to identify phone numbers that are listed throughout a document, so you want to create a regular expression that would match against a pattern like this: (800) 555-1212 Your first attempt might look something like this: (\d\d\d)\s\d\d\d-\d\d\d\d If you tried to use it, you would find that it didn't work right. The change you need to make relates to the parentheses. They have special meaning in regular expressions, so you need to escape them. You can update this expression accordingly: \(\d\d\d\)\s\d\d\d-\d\d\d\d You can now find phone numbers, but only if they are formatted exactly like the preceding phone number. In real life, things are never formatted that consistently, and being able to accommodate that kind of formatting fluctuations is where regular expressions become very powerful. Before I add additional characters to your repertoire, I want to modify the regular expression I just shared with you. Let's say you wanted to find all the "800" numbers that were in a particular document. To do that, you can use the literal values for 800, along with the other expressions, like so: \(800\)\s\d\d\d-\d\d\d\d Now I want to move on and identify ways of making this regular expression more flexible and accommodating of the various formats that you will likely encounter. First, I'll make a list of ways that I might find phone numbers formatted in a particular document: (800) 555-1212 (800)555-1212 There are a lot of potential variations, but this identifies two. The difference between the two is that the first has an extra space. To figure out how to match this, you'll need some additional tools:
The preceding elements let you specify how many times in sequence a particular character should be. Not only will this help to handle the space that may or may not exist, it can also make the regular expression itself much more compact. \(\d{3}\)\s?\d{3}-\d{4} Instead of using \d for every character that's a digit, I was able to specify how many digits should appear in sequence. For the space character \s, I specified that there should be zero or one space in that location, meaning that both examples will match. Thus far, I have looked at expressions that match individual characters and classes of characters, as well as expressions that determine the number of times in sequence the matching character or character class should appear. You can also group characters, define your own character classes, and specify alternatives, which is like saying match this class or that class. The characters used to define these expressions are as follows:
Grouping is most important when it comes to replacing text, because when you search against the text and get a match, the match is defined in terms of the groups you have defined. Using the same regular expression I was using before, I will now group it: \((\d{3)}\)\s?(\d{3})-(\d{4}) I have created three groups. Now you can see why I had to escape the original parentheses; it's because regular expressions use parentheses to define groups. For it to understand that I mean the actual parenthesis character (, I precede it with \. Every group you define is assigned to a variable that's called a backreference, and that represents the subexpression matched and that is identified by a number, preceded by \ or a $. Suppose we use the preceding regular expression to search for the phone number I've used in the examples. There will be four subexpressions that compose the match. (You will see more of this when I review the RegExMatch class in more detail.) The following table pairs the subexpression match number with the characters that are matched:
I can also use grouping to match a sequence of characters and to treat that sequence of characters as a single unit. For example, suppose I want to search through some text to see if I can find instances of a particular word that repeats itself. I can use regular expressions to find them, like this: (the){2} This expression will find instances of "thethe" in the text. Note how the {2} applies to the group, not simply the preceding character. You can also define your own classes of characters by enclosing them in brackets. You can list individual characters: [aeiou] In that example, I've created a class that matches vowels. If I want a class that matches consonants, I could write out a list of consonants, or I can negate the previous list: [^aeiou] That class means any character that is not a, e, i, o, or u, which technically isn't the same as a consonant because it includes digits and other numbers. So I should modify it to read: [^aeiou\W] This class will not match any vowels or any nonword characters. Word characters are all the letters of the alphabet, plus the underscore (_). Now, if I really want to write a long, hard-to-read class, I can decide to create one that matches all punctuation characters. You will see that a number of them are escaped to account for the fact that they are characters used in constructing regular expressions: [\!"#\$%&'()\*\+,\-\./:;<=>\?@\[\\\]\^`{\|}\~] A character class matches against one character, by default, but you can use the standard modifiers to make it match more than one in a row. Here's a regular expression that matches two consecutive punctuation marks: [\!"#\$%&'()\*\+,\-\./:;<=>\?@\[\\\]\^`{\|}\~]{2} You can also use the alternation character | to identify "or" conditions where you want to match either this or that character, so to speak. Consider this: John|Jack This expression matches either John or Jack. The following expression searches for paragraphs that start with a digit ranging from 1 through 6: "^(1|2|3|4|5|6)\s(.*)\n\n" In this example, note that I grouped the sequence of alternations, which you often need to do to get it to work right. You might also notice the use of ".", which matches any character and has yet to be discussed. This is one that you will use often: . Match any character except newline. You can match newlines as well if by configuring RegExOptions. There are many occasions when you want to match a sequence of characters, regardless of what they are, until you reach a particular string of characters. A good example is XML, where there are elements that surround a lot of text: <p>This is a paragraph.</p> In this case, suppose you want to get the text that sits between the two paragraph tags. A regular expression that can do that for you, as follows: <p>(.*)</p> That's simple enough, but what if you have a document full of paragraphs, like so: <p>This is paragraph one</p> <p>This is paragraph two</p> In the previous example, subexpression one ("\1") would reference the string: This is a paragraph. However, in the second example, it would reference This is paragraph one</p> <p>This is paragraph two The reason for this is that the "." matches all characters except for newline, and the regular expression says to find <p> and match any character zero or more times, and then terminate the match at the appearance of </p>. The regular expression engine is, by default, greedy, which means that it is going to find the longest valid match possible, but this is not always what you want, so you can specify the expression's or subexpression's greediness. There is an option you can set in the RegExOptions class, but you can also write the regular expression itself to do so, which often gives you much more fine-grained control over which parts of the expression you want to be greedy. The non-greedy version of the previous regular expression is this: <p>(.*)?</p> The question mark reverses the greediness of the entire expression, and in this instance means that it will find the shortest match, which is exactly what you are looking for. The first search will match the text of paragraph one, and the second search will match the text of paragraph two. There are some characters that create what are known as zero-width matches. In other words, they match something, but it's not a character that will show up in the match string. The following is a list of such items.
A simple example is to create a regular expression that looks for paragraphs or strings that start with the word "The." It would look like this: ^The.* The ^ matches the beginning of the line, but there is no character that is actually matched. Likewise, \b matches a word boundary, which is not a character but a position between characters. There are also a group of POSIX character classes that you can reference using the following notation: [:alpha:] In practice, you would use it in a regular expression like this: [[:cntrl:]]+ This expression searches for instances of one or more control characters in a given String. Following is a list of all the POSIX classes that PCRE supports. In my testing, all worked in REALbasic except for [:blank:], which raised an error when used.
In more complex regular expressions, there are times when you will need to group characters, but do not want to create a backreference to them. You have already encountered zero-width matches, such as ^, but those matches didn't take up any space because of the nature of the kind of match that they were. If I just want to match a group, but avoid creating any backreferences, I use the following syntax:
The first item in the table is a way to group characters without creating a backreference. The following four items seem to me to be more functional because they are designed to solve a common problem, which is to find a match of a certain pattern, as long as it doesn't precede or follow after a particular pattern. Suppose I want to identify URLs in an HTML document that only appear in character text, and that are not part of an actual link. I would want to match the URL in this sentence: <p>This is a sentence that refers to a site: http://google.com/</p> But I do not want to match the URL in this one: <p>I like to visit <a href="http://google.com/">google</a></p> Here is one approach to solving this problem using a zero-width negative lookbehind: (?<!href=")(http://.*/) Using this regular expression, the URL will be matched only when not preceded by href=" To do the opposite and match only URLs that are part of anchors, do this: (?<=href=")(http://.*/) There is a lot to be said about how to construct regular expressionsmore than I can say here. The language reference is a good place to look, as well, but because REALbasic's RegEx class is based on an open source library, there is also a lot of information available online that goes into much greater detail. You can start your search at the PCRE web page: http://www.pcre.org/. Next, I'll take a look at the regular expression classes that are part of REALbasic. RegEx ClassRegEx is the central class you will use when working with regular expressions. There are two things you can do with regular expressions: search for a pattern and replace a string that matches a pattern with a different pattern. Hence, the two RegEx properties: SearchPattern and ReplacementPattern: RegEx.SearchPattern as String RegEx.ReplacementPattern as String The SearchStartPosition property determines where in the string the search will begin. By default, this value is zero. RegEx.SearchStartPosition as Integer The RegEx class has a property called Options, which is an instance of the RegExOptions class. You use this class to set a number of parameters that affect the way matches are made. RegEx.Options as RegExOptions I will discuss the RegExOptions class separately. The RegEx class has two methods, Search and Replace, which you typically use to perform searches on a string. RegEx.Replace([TargetString as String],[SearchStartPosition as Integer]) as String RegEx.Search([TargetString as String],[SearchStartPosition as Integer]) as RegExMatch The parameters used with Replace and Search are optional, which at first glance may not make much sense. If you are searching for a pattern within a string, which is what the TargetString parameter specifies, how can the TargetString be optional? The answer is that it is the sequence of searches that makes this determination. In other words, you often will perform multiple searches consecutively on the same string. On the second pass, you do not need to specify the TargetString again. Here's an example, using the regular expression I used earlier to match against phone numbers: Dim re as RegEx Dim match as RegExMatch re = New RegEx re.SearchPattern = "\(\d{3}\)\s?\d{3}-\d{4}" match = re.Search("My number is (800) 555-1212. Is your number (800) 555-1212?") I'll talk about the RegExMatch class momentarily, but for now understand that the search that was just executed started at the beginning of the string, by default, and it returned a match at the first instance of a match. Because there are two phone numbers in the String, all I have to do is call Search again to begin searching after the match to find the second number. The only issue is that you need to call Search the next time without any parameters: match = re.Search If you pass the TargetString again, the search starts over. Likewise, if you also set the SearchStartPosition, it overrides this automatic behavior. If you want to loop through the String and find all the relevant matches, you can use a loop. After you've made the first match, and match is not equal to Nil, do this: While match <> Nil match = re.Search // so something with match Wend RegExMatch ClassThe RegExMatch class comes with four properties, as follows: RegexMatch.SubExpressionCount as Integer RegexMatch.SubExpressionString(matchNumber as Integer) as String RegexMatch.SubExpressionStartB(matchNumber as Integer) RegexMatch.Replace(ReplacementPattern as String) This class is returned as the result of a RegEx.Search call. If there was nothing matched, it is Nil; otherwise, it is not Nil and you can use the properties to examine the matches. The SubExpressionCount refers to the groups that were established in the regular expression just searched on. Recall this regular expression used to find phone numbers: \((\d{3})\)\s?(\d{3})-(\d{4}) The RegExMatch object that would get returned for this reference if a match was identified would have a SubExpressionCount of four. Like arrays and other things, the counting starts from zero, so you could refer to the following matches: \0, \1, \2, \3 After you have the number of subexpressions that have been matched, you can use each backreference to get a reference to the string that was matched and to the offset of where the matched subexpression starts. Here's an example: Dim re as RegEx Dim match as RegExMatch re = New RegEx re.Pattern ="\((\d{3})\)\s?(\d{3})-(\d{4})" match = re.Search("(800) 555-1212") MsgBox str(match.SubExpressionCount) // displays "4" MsgBox match.SubExpressionString(1) // displays "800" MsgBox str(match.SubExpressionStartB(1)) // displays 1 MsgBox str(match.SubExpressionStartB(2)) // displays 6 There is one very important caveat to finding the start position of matches, which is that the number the RegExMatch object returns represents the binary offset of where the match begins and not the character position. In this example, SubExpressionStartB(1) starts at byte one and not character position one. Because PCRE regular expressions support searches against characters encoded in UTF-8, this could be a problem if you were expecting to find the character at a certain character position. If you want to do something to that particular passage of text that you have matched, such as set a style for it, you will need to figure out the character position rather than the byte position of the match. The easy way to do it is to take the SubExpressionString for that match and then use the global InStr function to find the occurrence of the matched string in the original string. You could also use a MemoryBlock to identify the string leading up to the start of the SubExpressionStart and then use Len to find the length of the string in terms of characters rather than bytes. You would then be able to calculate the starting position of the matched string. RegExOptions ClassFinally, you can set a number of parameters using the RegExOptions class (the Options property of the RegEx class). Here they are:
If the target string you are searching against does not have any newline characters, you may want to specify that the RegEx class treat the beginning of the string as the start of a line, and the end of the string as the end of the line, even though there is no newline character at all. The following properties allow you to do just that: RegexOptions.StringBeginIsLineBegin as Boolean RegexOptions.StringEndIsLineEnd as Boolean Finally, you can tell it to treat the target string as one line for the purposes of matching ^ or $, causing it to ignore newline instances in the middle of the string. RegexOptions.TreatTargetAsOneLine Replacing TextThe previous examples dealt with matching text, rather than using regular expressions to replace text. Now I want to share some examples of how you can use regular expressions to search and replace text. The replacement pattern uses a combination of backreferences and new characters to let you build the replacement text. The hard part is getting the right match to begin with, with the right groups. After that's done, the rest is fairly straightforward. Listing 6.88. Function Replace(matchstring as String) as String
Text Validation and Variable SubstitutionThe following two examples show a few ways of using regular expressions. The first example is the mwRegex class, which I originally wrote to use to validate data entry into fields in a more complex way than the Mask property of the EditField could do. In addition, some convenience functions make it an easy way to work with regular expressions. The second example comes from a part of the Properties class that I glossed over earlier because it involved regular expressions. One feature of properties files used by Java is variable substitution. I'll get into the details when I get to the actual example, but it shows a slightly more complicated search-and-replace problem than the previous examples because you have to create the replacement pattern based on the current match. Listing 6.89. mwRegex class
Listing 6.90. Sub mwRegex.Constructor(pattern as string, rep_pattern as string, greedy as Boolean)
Listing 6.91. Function mwRegex.isMatch(aTargetString as string) As Boolean
Listing 6.92. Function mwRegex.isMatch(aTargetString as string, offset as integer) As Boolean
Listing 6.93. Function find(aString as String, ByRef os as Integer, ByRef oe as Integer) As Boolean
Listing 6.94. Function findNext(aString as String, ByRef os as Integer, ByRef oe as Integer) As Boolean
Listing 6.95. Function getMatch(matchNumber as Integer) As String
Listing 6.96. Function replace(aString as String) As String
Listing 6.97. Function replaceAll(aString as String) As String
Listing 6.98. Sub setReplacementPattern(rep_pattern as String)
Listing 6.99. Function search(target_string as String) As RegexMatch
Listing 6.100. Function search() As RegexMatch
Listing 6.101. Function searchAll(target_String as String) As RegexMatch()
Variable Substitution in the Properties ClassIt's best to start with an example to see how variable replacement is used in properties files: root_dir=/home/choate doc_dir=${root_dir}/documents prog_dir=${root_dir}/programs book_dir=${doc_dir}/book As you can see, variables are being used to make it easier to write out the properties. The alternative would be to write the property file like this: root_dir=/home/choate doc_dir=/home/choate/documents prog_dir=/home/choate/programs book_dir=/home/choate/documents/book Not only does this second approach mean that you will type more, it's also a gigantic hassle if, for some reason, you decide you want to change the value of root_dir. If you are not using variable substitution, you would have to go back and change every path for every time. However, if you do use variable substitution, you need only change the value of root_dir and be done with it. The logic of variable substitution works like this: when I encounter a variable, I look for a property that has already been defined whose name is the same value as the name that is contained by the ${...} characters. If that property exists, I need to get the value of that property and use that string to replace the variable itself. Follow the logic in the ParseVariable function that uses a regular expression and grouping to find the variable name, and then look up the value to be used to replace the variable. Listing 6.102. Function ParseVariable(aString as String) as String
|