Creating the StringParser Module


In the following sections I am going to step through the process of creating a new module. For this example I want to build on some of the string processing examples I have already used and package them in a module.

The module will provide two functions for parsing (or tokenizing) a string. In the InStr example, I split a string up into words. In this new module I will create functions that can split a string up into words as well as sentences.

The first step in this process is to create a new project in the IDE and add a new module. I'll show you how to do that momentarily.

Create a New Project

In the original example, I used InStr to split up a string into an Array of sentences. Now I want to take that example and create a module that is more flexible and that can split up a string into an Array of words.

Dim punct as String Dim paragraph as String Dim isPunct as Integer Dim position as Integer Dim chars(-1) as String Dim x,y as Integer punct= ".,?!;:[]{}()" paragraph="This is a paragraph. It has two sentences!" chars = Split(paragraph, "") y = Ubound(chars) For x = 0 to y  isPunct = InStr(punct, chars(x))  If isPunct > 0 then    position = x // position equals 19    Exit  End If Next 


In the original example, I had a local variable, punct, which was a string that held all of the punctuation characters. Because the punctuation characters will not ever change, it makes more sense for punct to be a constant, which will be the first step. The other advantage of using a constant is that we need to make the function that tokenizes the string more generic so that it can also be used to split a string into words as well. After punct is out of the method and into a constant, it will be easy to create other constants, too, that contain sets of characters to split the string with.

Creating a Module

Throughout the rest of this chapter, I will step you through the process of creating a module.

If the IDE is already open, select New Project from the File menu. A window will pop up so that you can select the kind of project. Select Desktop Application and click OK. If the IDE is not already open, launch it and it will open to a new desktop application project by default.

After you have the new project in place, you can create a module in the IDE by clicking the Add Module icon in the Project Tab. Whenever a new module is added, it is given the name Module1 by default (if Module1 already exists, it will be called Module2, and so on). To change that name, select the module in the Project Editor on the left side of the window. When you have done this, the Properties pane to the right will list the properties for this particular module; the only module property available is the module name. You should click the name in the Property pane. The IDE will automatically select the entire word for you. At this point, type in a more descriptive name, such as StringParser. You can either press Return or click somewhere else in the IDE and the new name is automatically updated. You will then see your new module in the Project pane, as shown in Figure 2.4.

Figure 2.4. Project window showing the StringParser module.


Double-click the StringParser module in the Project Item pane and a new StringParser Tab will appear revealing an empty module.

The StringParser tab will have a Module Toolbar at the top, with the rest of the area divided into two regions. To the left will be a list of constants, methods, and properties that have been declared for this module, and to the right will be the Code Editor. You can add one of four things to a module: a new constant, a new method, a new property, or a new note. The purpose of a note is purely for documentation. The only programmatic features that you can add to a module are constants, methods, and properties. For starters, we'll have a constant and some methods.

Access Scope

All module members have an access scope and now that we are creating a module, we need to cover the topic in a little more depth. In the sections on developing classes, I will talk about scope in much more detail because it's an important part of object-oriented programming and the idea of encapsulation, and it also has a slightly different meaning when dealing with classes. Access scope in modules works differently than it does in classes. Module members must be designated one of the following three:

Global

Global scope means that the constant, method, or property can be accessed anywhere in your program at any time by using the member's name, and without reference to the module itself. Shortly, I will add a new constant called kPunctuation. This means that I can get the value of that constant at any time by referring to the name of the constant, like so:

aVar = kPunctuation 


You can name a constant anything you'd like, as long as it contains legal characters. However, it's a good idea to use a naming convention when naming them so that you can more readily tell constants from other variables in your code. Some people use all uppercase letters, but I find that hard to type and harder to read. Another common approach is to begin the name of the constant with the letter k, followed by the name of the constant. The first letter after the k is capitalized.

Protected

A protected member means that it can be accessed anywhere, but only in reference to the module name. That means we would have to modify the previous statement to look something like this:

aVar = StringParser.kPunctuation 


Although the constant can still be accessed from anywhere in your programming code, you will have to prepend the module name before calling it.

My advice would be to avoid Global scope if at all possible because of the potential for namespace collisions. In other words, if you create a module with a long list of global constants and properties, you run the risk that some time in the future you will import a module that also uses one of the same names for a global constant or property and this will make the compiler unhappy. Even though you have to do some extra typing when the scope is set to Protected, it reduces the chances of having a collision.

This is the closest thing to the use of namespaces that REALbasic currently has (something I consider to be its biggest weakness to dateand something which will likely be corrected in the future).

Private

A Private member can be accessed only by a member of the same module. If I were to make the kPunctuation constant Private, the only time I could access the value of that constant would be from within a method defined in the module itself and not elsewhere. This would not likely be used with a constant. The only time it would really make sense would be with a method or a property. I'll show you an example of why this is useful when we create the methods for our new module.

Constants

Now that the module has been created, it's time to add some members to it. The first thing to add is the constant kPunctuation. Because it might be useful for other parts of my application to read the value of kPunctuation, there's no need to make it Private. I avoid Public scope in modules as much as possible because it leaves open the possibility of potential namespace collision. That leaves us with a scope of Protected for our new constant.

When you create a constant in the IDE, you are provided fields to enter the data so you do not have to declare it in the same way that you declare a local variable. There is a field for the constant name and for the default value of the constant. If you only enter the default value for the constant, that is the default value that will be associated with that constant whenever it is accessed from any location.

However, there is also a ListBox beneath the editing area that allows you to enter values that are specific to particular platforms or locations. Important: these values override whatever value you have established for the constant in the primary editing area. In particular, if you establish add a new value that is for Any platform and the Default language, this value is the value that will appear when accessed, and not the original value entered in the primary editing area. In fact, those parameters mimic how the value in the primary editing area is used. Use the ListBox to provide platform or location-specific values, and the primary area for more general or universal values.

All text that is used in your application for communicating with the user should be stored as a constant in a module. This means that if you decide to distribute your application in France, you can easily add a French translation of your messages that will be localized for French users. It also ensures consistency in your interface. It's easy for slightly different phrases to be used in different parts of your application, which can create a perception of sloppiness and poor quality.

For now, we will be using only the punctuation characters common to the English language, so we will use those as a default. If, at some later time, we decide we need to add additional characters for other languages, we can simply come back and add them here.

Name the constant kPunctuation in the Constant Name field, and type in .,?!()[] into the Default Value field. To the right of the Constant Name are three buttons. Select the button in the middle, with a yellow triangle. This sets the scope to Private. Below the Default Value Field, check to see that the string button is selected because the data type of our constant will be a string (it should be selected by default).

There you have it. Your module now has a constant as shown in Figure 2.5.

Figure 2.5. The kPunctuation constant added to your StringParser module.


Properties

Although I won't be adding a property to the module at this time, remember that module properties have scope, much like class properties have. Global module properties are accessible by all parts of your application. If your application has preferences, making those preferences available through global module properties can be convenient.

Protected module properties are available everywhere as well. The difference is that you have to use the module name when accessing the property.

Private module properties are available only to methods of the module.

Adding a Method

The next step in the process of creating our module is to add the methods that will be used to generate the tokenized string. The way the method will work is that it will take a string as a parameter and then return the string tokenized into an Array. Because it will return a value, we will be creating a function. The next question is to decide what the scope of the function should be. In this particular example, you can probably approach this a few ways and do just fine, but what I will do is create one function that is the generic function used to tokenize all the strings. It will take the string to be tokenized as a parameter, as well as a string that represents the characters to split with string with. If we are tokenizing the string into sentences, the kPunctuation constant will be the string we pass here.

I want to set the scope of this method to Private so that only methods defined in this module will be able to access it. In a moment, I will also create a method that is private in scope that will be used specifically for tokenizing strings into sentences.

Because my first method is Private, I don't need to worry about any naming conflicts, so I will name the method Split, because that is descriptive. Note that there's already a global function named Split, so I would possibly have problems if I set the scope of this method to be Public. Protected would avoid problems as well, but for now I'll stick with Private. The reason I say possibly have problems is because methods can exist with the same name as long as their signatures are differentmeaning as long as they take a different set of parameters. This is an example of polymorphism and is explained in more detail in the object-oriented programming section.

Adding the "Split" Method

To add a method in the IDE, first get back to the Properties tab and then double-click the StringParser module. Now, instead of adding a property, select Add Method.

When you are adding a method, the interface is similar to that used when adding a constant. There is a field for entering the method name, the parameters that will be passed to the method and the return type. If the method is to be a function, enter something into the return type field; otherwise, leave it blank. There are also the same three buttons next to the name that allow you to set the scope of the method.

Type Split into the method name field and select the Private button, which is a red circle with a horizontal line in the center. For parameters, type in the following:

aString as String, DelimiterList as String 


In the return type field, type

String() 


You don't need to name the variable in the return type field, you just need to specify the type. In this case, we will return a string Array, so we type in String(). If the value were just a simple string, we would only have to type in string.

You have now created the signature for the method. Right above the method name field, you should see

Private Function Split(aString as String, DelimiterList as String) as String() 


Our added method at this point is shown in Figure 2.6.

Figure 2.6. The Split method is added to StringParser, but it still lacks the code to do its work.


Now, in the lower portion of the Editor you can write the code for this method, which follows:

Dim chars(-1) as String   Dim char_buffer(-1) as String   Dim word_buffer(-1) as String   Dim x,y as Integer   Dim pos as Integer   Dim prev as Integer   Dim tmp as String   // Use the complete name for the global Split function   // to avoid naming conflicts   chars = REALbasic.Split(aString, "")   y = ubound(chars)   prev = 0   for x = 0 to y     pos = DelimiterList.inStr(chars(x))     // If inStr returns a value greater than 0, then this character is a white space     if pos > 0 then       word_buffer.append(join(char_buffer,""))       prev = x+1       redim char_buffer(-1)     else       char_buffer.append(chars(x))     end if   next   // get the final word   word_buffer.append(join(char_buffer,""))   return word_buffer exception err   break 


In addition to Split, I will add another method called SplitSentences. This method will be Protected and will call the Private Split method. Following the steps outlined above, create the following method in the IDE:

Protected Function splitSentences(aString as String) As String()   return StringParser.split(aString, StringParser.kPunctuation) End Function 


Now, anytime you want to take a string and split it up into individual sentences, just call the splitSentences function, like so:

Dim anArray(-1) as String Dim s1, s2 as String anArray = StringParser.splitSentences("This is a sentence. So is this.") s1 = anArray(0) // s1 equals "This is a sentence" s2 = anArray(1)// s2 equals " So is this" 


I will also add a splitWords method that works just like splitSentences except that it splits the strings on whitespace rather than on punctuation. I need to create another constant, kWhiteSpace, that is a string containing space, tab, newline, and carriage return characters. Then I create the splitWords method, like so:

Protected Function splitSentences(aString as String) As String()   return StringParser.split(aString, StringParser.kWhiteSpace) End Function 


Now that we have a splitWords method, we do even more. Just for fun, let's create a method that will capitalize a sentence in a way that will win the approval of English teachers everywhere (or will at least be better than the built-in TitleCase method). I'll use the set of guidelines I found in my dusty and possibly out-of-date AP Stylebook, which says that words of four letters or more should be capitalized, while words fewer than four letters long should not be capitalized unless they are the first or last word in the title. A good copyeditor will tell you that the rules are actually a little more complicated than that, but this is sufficient for my purposes.

I will make this function Public, so that it can be accessed in the same way that the regular TitleCase function is accessed.

Function TitleCaseAP(sourceString as String) as String Dim word as String Dim words(-1) as String Dim x,y as Integer words = StringParser.splitWords(sourceString) y = Ubound(words) For x = 0 to y  word = words(x)  If Len(word) >= 4 Then      words(x) = UpperCase(word)  Else     If (x=0) Or (x=y) Then       words(x) = UpperCase(word)     Else      words(x) = LowerCase(word)     End If  End If Next Return Join(words, " ") 


There you have it. The one shortcoming is that it does not account for titles with colons or semicolons which, when encountered, cause the previous word to be capitalized as if it were the end of the sentence and the following word as if it were the beginning of one. If you recall, the splitWords() function doesn't strip out punctuation characters, so it would be easy to add a test for words that are fewer than four characters long. The test would be to check to see if the word ended with a colon or a semicolon. If it did, you can capitalize the word. You would also need to account for capitalizing the following word, too. The easiest (but not the most efficient way) would be to test the current word and the previous word to see if it ended with a colon or a semicolon, but that means you are testing the same word twice. A better way would be to declare a variable and set its value so that you would know if the current word directly followed a colon or semicolon. You could declare a variable as an integer, name it previousColon, and set its initial value to -1. When you encounter a colon or semicolon, set the value of previousColon to the value of x. During each loop (before testing the new word to see if it ends with a colon), check the value of previousColon and whether it is equal to x minus 1. If it is, then you know this word should be capitalized as well.




REALbasic Cross-Platform Application Development
REALbasic Cross-Platform Application Development
ISBN: 0672328135
EAN: 2147483647
Year: 2004
Pages: 149

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net