17.1 Elementary Regular Expression Syntax

   

Regular expressions provide a syntax to describe patterns in text. Being able to describe a passage of text gives you power, the power to manipulate that text in ways that would otherwise be impossible. For example, regular expressions enable you to easily perform an operation such as the following:

Find all apparent phone numbers in a free-text, comment column, and reformat those numbers, at least the U.S. and Canadian numbers, so that all appear in the form: (999) 999-9999.

By the end of this chapter, we'll have shown you exactly how to implement this directive. For now, we want to familiarize you with some elementary regular expression syntax.

17.1.1 Matching a Single Character

The most basic regular expression metacharacter is probably the period (.). Use it to match any single character, except for the newline (usually chr(10)). For example, assume that users of our book's example database have taken to entering employee address and phone number information into a free-form text column. Further suppose that you need to identify and extract the phone numbers from that column. All employees are in the States, so you can use the pattern ...-.... to search for the common phone number pattern of three digits, a hyphen, and four digits:

SELECT emp_id, text FROM employee_comment WHERE REGEXP_LIKE(text,'...-....');     EMP_ID TEXT ---------- ------------------------------------------------------------       7369 126 Varnum, Edmore MI 48829, 989 313-5351       7499 1105 McConnell Court            Cedar Lake MI 48812            Home: 989-387-4321            Cell: (237) 438-3333       7521 Her address is: 12646 Bramell. Her city and state are Vestab            urg, MI 48891            Home phone is 387-1698  . . .        7900 Cares for 100-year-old aunt during the day. Schedule only fo            r evening and night shifts.  . . .

REGEXP_LIKE is the regular expression equivalent to the LIKE predicate. You'll notice that it looks like a function, and if you prefer to think of it that way, that's fine, because in PL/SQL, REGEXP_LIKE is, in fact, a function, a Boolean function. However, in SQL, Oracle refers to REGEXP_LIKE as a predicate.

Whatever you choose to call it, REGEXP_LIKE searches the target string to see whether it contains an occurrence of the pattern you specify. In our example, the predicate searches the text column for the phone number pattern, and evaluates to TRUE whenever that pattern is found.

As you can see from our query's output, our users have taken the words "free-form" to heart. Phone numbers have been entered using a variety of formats. The expression used in our query doesn't catch every phone number. It also matches at least one string that is not a phone number: "100-year-old" in the comment for #7900. We'll refine the expression as we present more of Oracle's regular expression syntax.

Before running the examples in this chapter, we used the following SQL*Plus formatting commands: SET RECSEP OFF, COLUMN text FORMAT A60.


When using regular expressions in Oracle, you aren't limited to detecting occurrences of a pattern. After all, once you've determined that a string contains text matching a pattern, you're likely to want to do something with that text. For that Oracle provides the REGEXP_SUBSTR function. REGEXP_SUBSTR extracts text matching a regular expression:

SELECT emp_id, REGEXP_SUBSTR(text,'...-....') text FROM employee_comment WHERE REGEXP_LIKE(text,'...-....');     EMP_ID TEXT ---------- ---------------       7369 313-5351       7499 989-387-       7521 387-1698

Hmm . . . What's up with 989-387-? We didn't ask for a trailing hyphen, or did we? In a way, we did. Our pattern used a period (.) to ask for any four characters following a hyphen, and a hyphen is itself a character. The string 989-387- comes from the combined area code and phone number 989-387-4321. Keep reading! We'll show you how to fix this.

Fuzziness

Fuzziness is a term some of us use to describe the fact that using regular expressions to identify data in free-form text is not an exact science. When you query the employee table, and you select values from the dept_id column, you can be certain of getting department ID numbers. After all, those are what the column is defined as holding, and you have database constraints and application logic to ensure that the column holds only the sort of values it is supposed to hold.

When using regular expressions to extract data elements from free-form text, you always run the risk of getting back something other than what you are looking for. Search for phone numbers using a pattern such as ...-...., and you might find a string such as '210-acre' instead. No matter how well you constrain your search pattern, you can never be certain that text matching your pattern has the meaning you expect.


17.1.2 Matching Any of a Set of Characters

Phone numbers are not made up of any characters; they are, well, numbers. Our previous regular expression is a bit too broad in this respect, as it allows for hyphens and other characters where we want to see only digits. Not to worry! There is a solution. You can use square-brackets to define a matching-list consisting of only the digits 0 through 9. The pattern [0123456789] will match any of the digits, 0 through 9, given in the list. Using this pattern, you can write a phone number expression that more narrowly focuses in on phone numbers:

[0123456789][0123456789][0123456789]-[0123456789][0123456789][0123456789][0123456789]

This pattern is rather awful though, in that it repeats the digits from 0 through 9 seven times. It's a good thing we aren't looking for area codes too. Maybe we should look for area codes, but for now we'll just tell you that you can use a hyphen within square-brackets as a short-hand for enumerating each character in a range. Rather than write [0123456789], you can use the much simpler [0-9] to represent the same set of values. You can now shorten the phone number expression to:

[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]

The following query shows the change in results from using this more refined expression:

SELECT emp_id,     REGEXP_SUBSTR(text,'[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]') text FROM employee_comment WHERE REGEXP_LIKE(text,'[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]');     EMP_ID TEXT ---------- -------------       7369 313-5351       7499 387-4321       7521 387-1698

This query is much better. Notice 387-4321 in the second row of output. This is the correct phone number that was misrepresented as 989-387- by our query at the end of the preceding section.

You're not limited to specifying a single range within square-brackets. You can specify any combination of ranges and single characters. For example, you can generate the same matching set as from [0-9] using the following expression:

[01-3456-9]

The matching set in this expression is composed of:

  • The single digit 0

  • The range of digits from 1 through 3

  • The single digits 4 and 5

  • The range of digits 6 through 9

Be aware that, under the default NLS_SORT=BINARY setting, any ranges you specify depend on the underlying character set encoding. It so happens that most, if not all, ASCII- and Unicode-based encodings map the digits 0 through 9 to contiguous code points. However, you can run into problems in this area. For example, the range [A-Z] encompasses all the uppercase Latin characters used in American English, but that range does not encompass other uppercase Latin characters, such as the Croatian alphabet's LJ. You'll also find that, while the letters A-Z are contiguous in ASCII, they are not contiguous in EBCDIC (an encoding with roots in IBM operating systems). In EBCDIC, you'll find that the range [A-Z] encompasses not only the letters A through Z, but also the closing curly-brace (}), the backslash (\), and some unassigned code points.

In addition to defining a range, you can negate a matching set. Do this by writing a caret (^) as the first character following the opening square-bracket. For example, use the expression [^0-9] to match all characters but the digits 0 through 9. Use [^a-zA-Z] to match characters other than the 26 letters used in American English.

Within square-brackets, the hyphen (-), caret (^), and other characters take on a special meaning. Table 17-1 tells you a bit more about working with these characters.

Table 17-1. Characters with special meaning inside square brackets

Character

Usage and example

^

Negates a matching set. To include the actual ^ character in a set, place it in any position but the first. For example: [0-9^] includes the digits 0 through 9, plus the caret character.

-

Separates the first and last characters in a range. To include the hyphen as itself, use it in either the first or last position, where it can't possibly define a range. For example: [-0-9] and [0-9-] both include the digits 0-9 along with the hyphen. In a negation situation you can write the hyphen immediately following the caret. The expression [^-0-9] excludes the digits 0 through 9, as well as excluding the hyphen.

[

Within a square-bracket expression, an opening square-bracket simply represents itself. For example, [[0-9] includes the digits 0 through 9 as well as the [ character.

]

Closes a matching set. To include ] in a set, specify it as the first character following the opening square-bracket (e.g., []0-9]). When negating a set, you can specify ] immediately following the caret, as in: [^]0-9].


Bracket expressions seem simple at first, but can get a bit tricky to write because of the special characters described in Table 17-1. The treatment of those characters may seem arbitrary at first, but there is a logic that will begin to sink in as you gain more experience writing regular expressions.

In addition to the characters described in Table 17-1, the sequences [:. : :], [. .], and [= =] also have special meaning. We describe these in Section 17.3.


REGEXP_LIKE Versus LIKE

There's a subtle difference in operation between REGEXP_LIKE and LIKE. The difference lies in whether the pattern you supply must match the entire target string. LIKE evaluates to TRUE only when a pattern matches the entire target string. Thus, to find strings containing a phone number using LIKE, you need to write a pattern such as '%_ _ _-_ _ _ _%'. The percent signs (%) on either end allow for other text on either side of the phone number.

REGEXP_LIKE is different from LIKE in this regard. REGEXP_LIKE evaluates to true whenever a pattern is found anywhere within a string. That's why you can use an expression like '...-....' to detect phone numbers without worrying about text that might precede or follow those numbers.

This subtle difference in operation between the two predicates is something you'll want to take into account should you ever translate a pattern from LIKE's syntax to the regular expression syntax used with REGEXP_LIKE.


17.1.3 Matching Repeating Sequences

The period (.) matches a single character in a target string. So does a bracket expression, which is why we had to repeat [0-9] three times, and then again four times, in our phone number expression. Wouldn't it be nice to more easily define elements of an expression that repeat? Sure it would, and to that end you have what are termed quantifiers. A quantifier is a special character, or sequence of characters that specify a valid range of occurrences for the immediately preceding element.

Getting back to our phone number example, rather than repeat each element in the phone number expression:

[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]

you can specify each element once followed by a quantifier giving the repeat count:

[0-9]{3}-[0-9]{4}

The first element, in this case, is the bracket expression [0-9]. The quantifier for the first element is {3}, which means that we want three repetitions of [0-9]. Similarly, the quantifier {4} for the third element specifies four repetitions of a digit. The second element is the hyphen, for which we didn't specify a quantifier, so the default is to look for a single occurrence. Plugging this shorter and simpler expression into our query from the previous section gives the following, equivalent query:

SELECT emp_id,     REGEXP_SUBSTR(text,'[0-9]{3}-[0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text,'[0-9]{3}-[0-9]{4}');

Oracle's regular expression syntax supports the quantifiers shown in Table 17-2.

Table 17-2. Regular expression quantifiers

Quantifier

Description

*

Specifies that an element may occur zero or more times. For example, .* is the regular expression equivalent to the LIKE predicate's % metacharacter.

+

Specifies that an element may occur one or more times.

?

Specifies that an element may occur zero or one times, but no more.

{n}

Specifies that an element must occur n times.

{n,}

Specifies than an element must occur at least n times, but may occur more than n times.

{n,m}

Specifies that an element must occur anywhere between n and m times, inclusive.

{0,m}

Is the same case really, as {n,m}, but this is a way to specify that an element occurs up to, but not more than, m times.


17.1.4 Defining Alternate Possibilities

Our users have not been at all consistent in how they've entered data into our free-form comment column. Some phone numbers have area codes. Some do not. And when it comes to separating digit groups, our users show no end of creativity. They've used hyphens, periods, and spaces as separators, have mixed those in a single phone number, and you'll even find a few area codes enclosed within parentheses. What's a SQL programmer to do?

Dealing with alternate possibilities is an area in which regular expressions really shine when compared to the rather lame LIKE predicate. Let's begin with the problem of area codes. Some phone numbers have them, and others do not. One way to handle this is to simply write an expression for each case, and join those two expressions with the vertical-bar (|) operator:

[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{4}

This regular expression will find phone numbers in the following two forms: 999-999-9999 and 999-9999. The vertical bar defines an alternation, or a choice between two possibilities: one with an area code and one without.

Alternation appears to take care of the area code problem. Let's turn to the issue of separators. You could try to take care of the different separators by adding even more alternations, with a different expression to cover each permutation of separators:

[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}.[0-9]{3}.[0-9]{4}|[0-9]{3} [0-9]{3} [0-9]{4}|[0- 9]{3}-[0-9]{3}.[0-9]{4}|[0-9]{3}.[0-9]{3}-[0-9]{4}|[0-9]{3} [0-9]{3}.[0-9]{4}| . . .

This gets messy fast. A combined area code and phone number contains two separator characters, and with three possible characters to choose from in each position, the number of permutations gets quickly out of hand. Since we're dealing with single characters, a better approach might be to define a matching set of valid separators:

[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}|[0-9]{3}[-. ][0-9]{4}

This is much better. We're back to a single alternation between two patterns. The matching set [-. ] matches any of the three separator characters that we are recognizing as valid.

Because the hyphen (-) comes first in the matching set [-. ], that hyphen is treated as a character in the set. On the other hand, in the matching set [0-9], the hyphen, because it is neither first nor last, is treated as a delimiter between the first (0) and last (9) characters in a range.


The following query shows the results of using this improved expression:

SELECT emp_id,     REGEXP_SUBSTR(       text,       '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}|[0-9]{3}[-. ][0-9]{4}')          text FROM employee_comment WHERE REGEXP_LIKE(    text,    '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}|[0-9]{3}[-. ][0-9]{4}');     EMP_ID TEXT ---------- -------------       7369 989 313-5351       7499 989-387-4321       7521 387-1698       7566 989.387.4444       7654 231-898-9823       7698 388-1234       7844 989-387.5359       7876 453-9999

Order matters with alternation. Consider the following two regular expressions:

[0-9]{3}-[0-9]{3}|[0-9]{3}-[0-9]{3}-[0-9]{4} [0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{3}

Both of these expressions look for either a 999-999 or a 999-999-9999 pattern. The 999-999 is a purposeful deviation from our phone number pattern. The difference between the expressions lies solely in the order in which the alternation is written. Look at the difference in output:

SELECT REGEXP_SUBSTR(text,'[0-9]{3}-[0-9]{3}|[0-9]{3}-[0-9]{3}-[0-9]{4}') FROM employee_comment WHERE emp_id = 7499; REGEXP_SUBSTR ------------- 989-387 SELECT REGEXP_SUBSTR(text,'[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{3}') FROM employee_comment WHERE emp_id = 7499; REGEXP_SUBSTR ------------- 989-387-4321

Why the difference in results? It's because Oracle's regular expression engine looks at the alternates in left-to-right order. For reference, here's an example showing the target text:

SELECT text FROM employee_comment WHERE emp_id = 7499; TEXT --------------------- 1105 McConnell Court Cedar Lake MI 48812 Home: 989-387-4321 Cell: (237) 438-3333

Both alternates begin to match at the same point in the target text. Whenever that happens, the regular expression engine looks at the alternates from left-to-right. If the left-most alternate matches, the regular expression engine is satisfied, and that match is returned from REGEXP_SUBSTR. When writing alternations, if you have any preference as to which alternation takes precedence, be sure to write that alternation first.

17.1.5 Subexpressions

We're not quite done with phone numbers yet. One last issue to deal with is that of phone numbers with area codes enclosed in parentheses, in the form: (999) 999-9999. Here's one way to think about this problem:

  • A phone number might have an area code:

    • The area code might be enclosed within parentheses.

    • The area code might not be enclosed within parentheses.

  • The phone number might not have an area code.

Essentially, what you have here is an alternation within an alternation. Working from the inside out, you can write the following expression to accommodate both area code possibilities:

[0-9]{3}|\([0-9]{3}\)

The left side of this expression looks for three digits. The right side looks for three digits enclosed within parentheses. Why the backslash (\) characters? Those are there because otherwise the parentheses are treated as special characters, enclosing a subexpression.

A subexpression is a portion of a longer expression that you wish to treat as a discrete unit. For example, you might wish to apply a quantifier to a subexpression. Indeed, that's a good way to indicate that area codes are optional:

([0-9]{3}|\([0-9]{3}\))?

The parentheses, the ones not preceded by \ characters, define a subexpression defining two alternate representations of an area code. The ? quantifier specifies that the preceding subexpression, the area code, occurs either once or not at all.

There's one more housekeeping detail to take care of before filling out the non-optional portion of the phone number expression. If there's an optional area code, it will be followed by some sort of separator. That separator will be a space if the area code is enclosed by parentheses, or it will be one of either a hyphen, period, or space:

([0-9]{3}[-. ]|\([0-9]{3}\) )?

Now, all that remains is to add the seven-digit phone number pattern onto the end of the preceding expression:

([0-9]{3}[-. ]|\([0-9]{3}\) )?[0-9]{3}[-. ][0-9]{4}

The following query shows the results of this latest iteration of the phone number pattern:

SELECT emp_id, REGEXP_SUBSTR(text,    '([0-9]{3}[-. ]|\([0-9]{3}\) )?[0-9]{3}[-. ][0-9]{4}')    text FROM employee_comment;     EMP_ID TEXT ---------- ---------------       7369 989 313-5351       7499 989-387-4321       7521 387-1698       7566 989.387.4444       7654 231-898-9823       7698 (989) 388-1234       7782       7788       7839       7844 989-387.5359       7876 (231) 453-9999       7900

In addition to dealing with tricky alternations, subexpressions are also essential when working with backreferences, something you'll read more about later in Section 17.3.

Regular Expressions and Index Usage

Many of our examples in this chapter use REGEXP_LIKE in the WHERE clause of a SELECT statement to search for patterns of text. You might rightfully wonder about index usage when executing such queries. When you apply REGEXP_LIKE to a column, you have the same issue that you have when you apply any other SQL function to a column: you preclude the use of any index defined on that column.

If you always search for the same pattern, you can create a function-based index to help you locate that pattern. However, if you always search for the same pattern, we'd argue that you should redesign your database to store your target data in discrete columns. For example, if you are constantly searching for phone number patterns, you should extract phone numbers once, and place them into their own column.

If you must create a function-based index, you won't be able to base it on REGEXP_LIKE, because that function returns a Boolean value. SQL, and by extension indexes, do not support Boolean values. You can, however, use one of the other regular expression functions:

CREATE INDEX phone_number  ON employee_comment (    REGEXP_SUBSTR(text,'...-....'));

You can use this index to support queries such as:

SELECT *  FROM employee_comment WHERE REGEXP_SUBSTR(text,'...-....') = '313-5351';

Again though, we would argue that regular expressions are best reserved for ad-hoc queries. If you issue the same regular expression query often enough to want to index for it, we think you should consider modifying your database design to enable querying the target data without resorting to regular expressions. More importantly, realize that you can do a lot with regular expressions outside of the WHERE clause, where indexes, or the lack thereof, aren't a factor.


17.1.6 Anchoring an Expression

Two metacharacters allow you to anchor an expression to either the beginning or end, or both, of the target string. When you anchor an expression to the beginning of the target string, you are specifying that no other characters may precede the expression. Likewise, when you anchor an expression to the end of a string, you are specifying that no characters are allowed to follow the expression.

Use a caret (^) to anchor an expression to the beginning of a string. Use a dollar sign ($) to anchor an expression to the end of a string. Here's an example, using REGEXP_INSTR, to illustrate. The REGEXP_INSTR function returns the character position where a match for an expression is found:

SELECT REGEXP_INSTR('123','[0-9]') unanchored,        REGEXP_INSTR('123','[0-9]$') anchored_end FROM dual; UNANCHORED ANCHORED_END ---------- ------------          1            3

Notice the difference in character positions returned by the two function calls. The expression in the second function call used a $ to search for a digit at the end of the string, and that's exactly what the function found.

The ^ and $ are anchor characters only when used outside of a bracket expression.


You can combine the use of ^ and $ to write an expression encompassing the entire target string. The following example searches for comments containing only a phone number:

SELECT emp_id, REGEXP_SUBSTR(text,    '^([0-9]{3}[-. ]|\([0-9]{3}\) )?[0-9]{3}[-. ][0-9]{4}$')    text FROM employee_comment WHERE REGEXP_LIKE(text,     '^([0-9]{3}[-. ]|\([0-9]{3}\) )?[0-9]{3}[-. ][0-9]{4}$');     EMP_ID TEXT ---------- -------------       7844 989-387.5359

There's one thing to be careful of when using the anchoring characters. It's possible to write expressions that can't possibly match anything in the target text. For example, you could write $[0-9], which anchors the beginning of the expression to the end of the target string. You won't find any digits after the string ends, so this expression will always fail. Watch for this sort of thing, and be careful about where you place your anchoring metacharacters.

17.1.7 Understanding Greediness

Greediness is an important concept to understand when writing regular expressions. We don't mean greed of the Enron sort, but rather that each quantifier in a regular expression will always match as much text as possible. The results from this rule can sometimes be surprising.

For example, consider the following quotation:

Brighten the corner where you are.

Think about the problem of extracting the first word from this text. Many would look at the text, see that a word is a series of letters followed by a space, and would immediately translate that thought to an expression like '.* ', which matches any number of characters followed by space. That seems logical, doesn't it? Yet look at the results:

SELECT REGEXP_SUBSTR('Brighten the corner where you are',                      '.* ') FROM dual; REGEXP_SUBSTR('BRIGHTENTHECORN ------------------------------ Brighten the corner where you

Is this the result you expected? If it is, then go to the head of the class and take a gold star, because you're already way ahead of us. If this result surprises you, then think about what we asked for:

We asked for a series of characters, of any length, followed by a space.

Faced with this requirement, how would you satisfy it? It turns out there are three possible approaches you can take:

  • You could stop at the first space.

  • You could stop at the last space.

  • You could stop at some arbitrary space somewhere in between the first and the last.

Computers are not good at being arbitrary, and no one wants them to be arbitrary either, so that last option is out. Here's how Oracle's regular expression engine conceptually matches the expression '.* ' with our example text:

  1. The first element of the expression is a period, representing any character. The quantifier allows any number of occurrences, so the engine finds all the characters that it can, stopping only when it reaches the letter e at the end of the string. At this point, the engine has matched the entire string: "Brighten the corner where you are".

  2. The expression calls for a space. An e is not a space, so the engine backs up one character to the r.

  3. An r is not a space, so the engine backs up another character to the a.

  4. An a is not a space, so the engine backs up again, finds the space following "you", and returns "Brighten the corner where you", including the trailing-space, as the result from REGEXP_SUBSTR.

The key point to take away from this section is that the regular expression engine doesn't look at a pattern quite the same way you might. When you mentally apply the pattern '.* ' to a string, your tendency is probably going to be to stop the moment you have a match. You can do that easily, because your brain is a much better pattern-matching engine than a computer. You see the pattern as a whole, and you'll tend to gravitate toward the interpretation that you had in mind when you wrote the expression. Regular expression engines look at patterns and strings a character at a time, moving back and forth, finding the longest match for each regular expression element.

Matching a Single Word

In Section 17.1.7, we used the expression '.* ' to match a word in a string. That expression actually matches more than just a word. It matches a word followed by a space, and it won't find the single word in the string 'Brighten', because there is no trailing space at the end of the string.

Although it worked well to illustrate greediness, the pattern '.* ' isn't at all the best way to define a word. For one thing, you don't want to define a word in terms of what it isn't (that trailing space), but rather in terms of what it is. Perhaps a better way to look at a word is as a sequence of non-space characters. You can use negation inside a bracket expression to accomplish that:

SELECT REGEXP_SUBSTR(    'Brighten the corner where you are', '[^ ]*') FROM dual; REGEXP_S -------- Brighten

However, non-space characters include punctuation as well, so you might want to narrow things down even more, by insisting that your word contain only alphabetic characters:

SELECT REGEXP_SUBSTR(    'Brighten the corner where you are',    '[A-Za-z]*') FROM dual;

Depending on your application, you might or might not wish to allow for the possibility that a word might contain digits. A password, for example, might look like My8Secret. And then you have hyphenated words to think about. Is area-code one word or two? Sometimes the problem is not so much writing the expression as it is defining the thing you want that expression to match.




Mastering Oracle SQL
Mastering Oracle SQL, 2nd Edition
ISBN: 0596006322
EAN: 2147483647
Year: 2003
Pages: 154

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net