Recipe14.14.Finding Text Not Matching a Pattern (Oracle)

Recipe 14.14. Finding Text Not Matching a Pattern (Oracle)

Problem

You have a text field that contains some structured text values (e.g., phone numbers), and you wish to find occurrences where those values are structured incorrectly. For example, you have data like the following:

  select emp_id, text   from employee_comment EMP_ID     TEXT ---------- ------------------------------------------------------------ 7369       126 Varnum, Edmore MI 48829, 989 313-5351 7499       1105 McConnell Court            Cedar Lake MI 48812            Home: 989-387-4321            Cell: (237) 438-3333

and you wish to list rows having invalidly formatted phone numbers. For example, you wish to list the following row because its phone number uses two different separator characters:

 7369            126 Varnum, Edmore MI 48829, 989 313-5351

You wish to consider valid only those phone numbers that use the same character for both delimiters.

Solution

This problem has a multi-part solution:

Find a way to describe the universe of apparent phone numbers that you wish to consider.
Remove any validly formatted phone numbers from consideration.
See whether you still have any apparent phone numbers left. If you do, you know those are invalidly formatted.

The following solution makes good use of the regular expression functionality introduced in Oracle Database 10g

  select emp_id, text from employee_comment where regexp_like(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')   and regexp_like(          regexp_replace(text,             '[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}','***'),          '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')     EMP_ID TEXT ---------- ------------------------------------------------------------   7369     126 Varnum, Edmore MI 48829, 989 313-5351   7844     989-387.5359   9999     906-387-1698, 313-535.8886

Each of these rows contains at least one apparent phone number that is not correctly formatted.

Discussion

The key to this solution lies in the detection of an "apparent phone number." Given that the phone numbers are stored in a comment field, any text at all in the field could be construed to be an invalid phone number. You need a way to narrow the field to a more reasonable set of values to consider. You don't, for example, want to see the following row in your output:

     EMP_ID TEXT ---------- ----------------------------------------------------------       7900 Cares for 100-year-old aunt during the day. Schedule only            for evening and night shifts.

Clearly there's no phone number at all in this row, much less one that is invalid. You and I can see that. The question is, how do you get the RDBMS to "see" it. I think you'll enjoy the answer. Please read on.

This recipe comes (with permission) from an article by Jonathan Gennick called "Regular Expression Anti-Patterns," which you can read at: http://gennick.com/antiregex.htm.

The solution uses Pattern A to define the set of "apparent" phone numbers to consider:

 Pattern A: [0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}

Pattern A checks for two groups of three digits followed by one group of four digits. Any one of a dash (-), a period (.), or a space are accepted as delimiters between groups. You could come up with a more complex pattern. For example, you could decide that you also wish to consider seven-digit phone numbers. But don't get side-tracked. The point now is that somehow you do need to define the universe of possible phone number strings to consider, and for this problem that universe is defined by Pattern A. You can define a different Pattern A, and the general solution still applies.

The solution uses Pattern A in the WHERE clause to ensure that only rows having potential phone numbers (as defined by the pattern!) are considered:

 select emp_id, text   from employee_comment  where regexp_like(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')

Next, you need to define what a "good" phone number looks like. The solution does this using Pattern B:

 Pattern B: [0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}

This time, the pattern uses \1 to reference the first subexpression. Whichever character is matched by ([-. ]) must also be matched by \1. Pattern B describes good phone numbers, which must be eliminated from consideration (as they are not bad). The solution eliminates the well-formatted phone numbers through a call to REGEXP_ REPLACE:

 regexp_replace(text,    '[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}','***'),

This call to REGEXP_REPLACE occurs in the WHERE clause. Any well-formatted phone numbers are replaced by a string of three asterisks. Again, Pattern B can be any pattern that you desire. The point is that Pattern B describes the acceptable pattern that you are after.

Having replaced well-formatted phone numbers with strings of three asterisks (***), any "apparent" phone numbers that remain must, by definition, be poorly formatted. The solution applies REGEXP_LIKE to the output from REGEXP_LIKE to see whether any poorly formatted phone numbers remain:

 and regexp_like(        regexp_replace(text,           '[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}','***'),        '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')

This recipe would be difficult to implement without the pattern matching capabilities inherent in Oracle's relatively new regular expression features. In particular, this recipe depends on REGEXP_REPLACE. Other databases (notably PostgreSQL) implement support for regular expressions. But to my knowledge, only Oracle supports the regular expression search and replace functionality on which this recipe depends.