Streaming Programming Language | XML and SOAP Programming for BizTalk(TM) Servers (DV-MPS Programming)

[Previous] [Next]

A streaming programming language like OmniMark is designed to let the programmer process data streams directly and produce output stream immediately, minimizing the need to build intermediate data structures. Streaming languages are particularly useful in Internet or intranet environments, in which streaming data is ubiquitous.

Streaming languages have three principal advantages for network programming: efficiency, robustness, and productivity. Streaming languages are efficient because they minimize data copying and optimize data movement. They favor robustness because they reduce the use of variables and intermediate data structures, thus minimizing the places where errors can occur. Streaming languages enhance productivity because they foster a process-oriented programming style in which the program is a clear and direct description of the process being applied to the data.

As a streaming language, OmniMark handles input and output at the core of the language. OmniMark abstracts all data sources and data destinations so that all programming is done against generic OmniMark source objects and all output flows to generic OmniMark streams. Programmers attach sources and streams to data sources and destinations through either the core language or by OmniMark extension (OMX) components. The data processing and output creation are entirely independent of what kind of source or destination the data stream is attached to.

OmniMark maintains a current input and a current output. To process a source, you make that source the current input. To output to a destination, you make the stream attached to that destination the current output. This input-output model greatly simplifies processing, since you never need to specify what data an action operates on or what destination output goes to. You simply attach the appropriate streams and process the data as it flows. You can process data either by scanning or by parsing. Scanning employs OmniMark's sophisticated pattern-matching capabilities. Parsing employs OmniMark's integrated XML or SGML parser.

An OmniMark program consists of rules. A process rule initiates processing: you use a process rule to establish the current input and output and to initiate parsing or scanning. You use find rules for scanning. A find rule consists of a pattern to be matched in the data and a set of actions to be performed when that pattern is found. A variety of other scanning tools are also available for local scanning, the manipulation of variables, and dealing with various aspects of markup. Markup rules are used in parsing. Markup rules include element rules, which are fired when a parser encounters an element in XML or SGML data; data-content rules, which you use to process the data content of an element; and markup-error rules, which you use to catch and process errors in the markup.

The following simple OmniMark program counts the words in the text "A duck walks into a bar":

 global integer word-count initial {0} process submit "A duck walks into a bar" output "d" % word-count || "%n" find letter+ increment word-count

The program consists of two rules and a global variable declaration. The process rule is fired when the program runs. The submit statement establishes the current input as the text "A duck walks into a bar" and initiates scanning.

The find rule is activated by the submit statement. It uses the pattern letter+ to match any sequence of letters. In this pattern, the word letter is a character class representing all the uppercase and lowercase letters of the alphabet. The plus sign (+) is a repetition indicator. Together, they say "one or more letters."

Each time the find rule fires, it increments the global variable word-count. The find rule will fire once for each word (that is, each uninterrupted sequence of letters) in the current input. Once scanning is complete, execution moves on to the next statement in the process rule. This is the output statement that outputs the word count. Since the current output has not been established explicitly, output goes to the default current output, which is standard output, or standard out. If you run the program in the OmniMark Integrated Development Environment (IDE), the output will appear in the log window. If you run it on the command line, it will appear on the screen.

The output statement uses the format operator (%) to convert the value of the integer word-count to a string; it uses the concatenation operator (||) to add a new line, represented by "%n".

To try this program, type it into the OmniMark IDE and click Run. In this program, the order in which the rules appear doesn't matter, since each rule fires only when a specific event occurs. Thus we could just as easily write the program like this:

 global integer wordcount initial {0} find letter+ increment wordcount process submit "A duck walks into a bar" output "d" format wordcount || "%n"

This program runs the same way as the first program. This doesn't mean that the order of rules never matters in an OmniMark program. If one event causes more than one find rule to fire, the rule that occurs first will fire, and the one that occurs later will not. This allows you to put more specific rules before more general rules and have the general rules fire only if the specific rules do not. Take a look at this:

 global integer wordcount initial {0} process submit "A duck walks into a bar" output "d" format wordcount || "%n" find "duck" output "*" find letter+ increment wordcount find any

The preceding program prints "*5". The following program changes the order of the find rules and produces a different output:

 global integer wordcount initial {0} process submit "A duck walks into a bar" output "d" format wordcount || "%n" find letter+ increment wordcount find "duck" output "*" find any

This program prints "6".

Why did I add find any as a new rule in both programs? Because it fixes an error in the first version of the program. The find letter+ rule matches words. But what about the spaces between the words? If you actually ran the first program, you might have noticed that the result it printed was indented by four spaces. Those indents are the unmatched spaces from the input. Any input not matched by a find rule goes right through to current output. A find any rule at the end of a set of find rules is a sponge rule that soaks up any unmatched input. Of course, if you use the find any rule, it must always be the last find rule.