Pattern Matching | XML and SOAP Programming for BizTalk(TM) Servers (DV-MPS Programming)

[Previous] [Next]

OmniMark allows you to use scanning to search for patterns in input data. For example, the following find rule will fire if the string "Hamlet:" is encountered in the input stream:

 find "Hamlet:" output "<b>Hamlet</b>: "

Using this approach, however, requires you to write a separate find rule for each character name you want to enclose in HTML bold tags, as in the following example:

 find "Hamlet:" output "<b>Hamlet</b>: " find "Horatio:" output "<b>Horatio</b>: " find "Bernardo:" output "<b>Bernardo</b>: "

This is where OmniMark patterns come in. OmniMark has rich, built-in pattern-matching capabilities that allow you to match strings through a more abstract string model rather than by matching a specific string, as in the following example. This find rule will match any string that contains any number of letters followed immediately by a colon:

 find letter+ ":"

Unfortunately, the pattern described in this find rule isn't specific enough to flawlessly match only character names. It will match any string of letters followed by a colon that appears anywhere in the text, including words in the middle of sentences.

Words that appear in the middle of sentences rarely begin with an uppercase letter, whereas proper names usually do. This knowledge allows us to add further detail to our find rule. This find rule matches any string that begins with an uppercase letter (uc) followed by at least one other letter (letter+) and a colon (:):

 find uc letter+ ":"

If we were actually trying to mark up an ASCII copy of Hamlet, however, our find rule would only match character names that contain a single word, such as Hamlet, Ophelia, or Horatio. Only the second part of two-part names would be matched. Names such as Queen Gertrude and Lord Polonius would be incorrectly marked up.

To match these more complex names as well as the single-word names, we'll have to further refine our find rule:

 find uc letter+ (white-space+ uc letter+)? ":"

In this version of the find rule, the pattern can match a second word prior to the colon. The pattern (white-space+ uc letter+)? can match one or more white-space characters followed by an uppercase letter and one or more letters. These changes to the find rule allow it to match character names that consist of one or two words.

If you want to match a series of three numbers, use the following pattern:

 find digit {3}

If you want to match either a four-digit or five-digit number, use the following pattern:

 find digit {4 to 5}

To match a date that occurs in the yy/mm/dd format, use the following pattern:

 find digit {2} "/" digit {2} "/" digit {2}

Match a Canadian postal code with the following pattern:

 find letter digit letter space digit letter digit

The letter and uc keywords used to create the preceding patterns are named character classes. OmniMark provides a variety of built-in character classes:

letter Matches a single letter character, uppercase or lowercase

uc Matches a single uppercase letter

lc Matches a single lowercase letter

digit Matches a single digit (0 through 9)

space Matches a single space character

blank Matches a single space or tab character

white-space Matches a single space, tab, or newline character

any-text Matches any single character except for a newline character

any Matches any single character

You can also define your own customized character classes. For example, the following find rule would fire if any one of the four arithmetic operators was encountered by the find rule in the input data:

 find ["+-*/"] output "found an arithmetic operator%n"

You can define character classes by exclusion using the except operator (\). The following find rule would match any character except for a right brace:

 find [\ "}"]

You can use the except operator with a built-in character class. The following find rule matches any consonant:

 find [letter \ "aeiouAEIOU"]

You can add string and built-in character classes together with the or operator to create new character classes. The following find rule matches any one of the arithmetic operators or a single digit:

 find ["+-*/" | digit]

The following find rule matches any of the arithmetic operators or any digit except zero ("0"):

 find ["+-*/" | digit \ "0"]

You can use the following occurrence operators to modify any pattern:

+ One or more

* Zero or more

? Zero or one

** Zero or more upto

++ One or more upto

As you saw earlier in this appendix, letter+ matches one or more letters, letter* matches zero or more letters, and uc? matches zero or one uppercase letter.

OmniMark pattern matching is greedy. The following rule will never fire:

 find "<table>" any* "</table>"

This rule will never fire because any* will match the entire input, including any occurrence of the string "</table>". Because the "</table>" part of the pattern can never be matched, the whole pattern will always fail.

To write a pattern that matches characters up to a specific delimiter, use the ** or ++ occurrence indicators:

 find "<table>" any** "</table>"

Here any** matches zero or more characters up to the string "</table>".

The following rule will only fire if there is at least one character between "<table>" and "</table>" in the input:

 find "<table>" any++ "</table>"

You can use lookahead in a pattern to see whether a pattern exists without consuming it:

 find any++ lookahead "</table>"

This pattern will match any number of characters from the current point up to the occurrence of the string "</table>", but will not consume "</table>", leaving it in the input where other rules can match it. Note that at least one character must precede "</table>" or this rule will either fail or match all the way up to the next "</table>" in the data, if there is one.

The following rule eliminates this problem:

 find any** lookahead "</table>"

However, this rule will fire twice if it fires at all: once when it matches all the characters up to "</table>" and again when it consumes zero characters followed by "</table>" (its location in the source following the first match). This rule could potentially go on matching zero characters followed by "</table>" forever, without ever moving forward. However, OmniMark does not permit two consecutive matches that consume zero data, so the rule will not fire again until more data has been consumed.

Both the * and ** occurrence operators can create patterns that match but which might not consume data. OmniMark will never permit two consecutive matches that do not consume data, but if you see rules or matches fire twice when you only expected them to fire once, it is probably because the second time is a zero-length match permitted by the use of * or **.