Source Discussion | Visual Basic Developers Guide to ASP and IIS: Build Powerful Server-Side Web Applications with Visual Basic. (Visual Basic Developers Guides)

The source for parsing text into bigram chains and then building sentences from them is surprisingly simple. The bigram is implemented as a simple two-dimensional array. Each dimension is represented by the unique words that were parsed from the text. The first dimension is the first word of a bigram and the second dimension is the word that follows. The contents of the two-dimensional array at this intersection are the number of times that the second word follows the first.

Listing 10.1 contains the symbolic constants and global variables that are used within the program.

Listing 10.1: Symbolic Constants and Global Variables.

 #define MAX_WORD_LEN      40 #define MAX_WORDS         1000 #define FIRST_WORD 0 #define MIDDLE_WORD       1 #define LAST_WORD         2 #define START_SYMBOL      0 #define END_SYMBOL 1 static int curWord = 2; char wordVector[MAX_WORDS][MAX_WORD_LEN]; int  bigramArray[MAX_WORDS][MAX_WORDS]; int  sumVector[MAX_WORDS];

The largest word possible is 40 characters ( MAX_WORD_LEN ) and the maximum number of unique words is 1000 ( MAX_WORDS ). The curWord variable identifies the current word index in the wordVector and bigramArray . Symbols START_SYMBOL and END_SYMBOL represent the indices for the start and end states of our Markov Chain. Finally, the symbols FIRST_WORD , MIDDLE_WORD , and LAST_WORD are used to identify the context of a particular word (and whether it will affect the start or end of the Markov Chain).

The main program for the bigram demonstration is very simple and shown in Listing 10.2. The two major steps performed are parsing the input file that contains the corpus ( parseFile ), and then emitting a sentence based upon the training file ( buildSentence ).

Listing 10.2: Main Program.

 int main( int argc, char *argv[] ) {   char filename[40];   int  debug = 0;   /* Parse the options from the command line */   parseOptions( argv, argc, &debug, filename );   /* Seed the random number generator */   srand(time(NULL));   bzero(bigramArray, sizeof(bigramArray));   strcpy(wordVector[0], "<START>");   strcpy(wordVector[1], "<END>");   /* Parse the Corpus */   parseFile( filename ); if (debug) emitMatrix();   /* Randomly generate a sentence */   buildSentence(); return 0; }

The remaining elements of the main program are parsing command-line options to alter the behavior of the program, and initializing the necessary structures. An important element to note here is the initialization of wordVector . Recall that wordVector is a list of the unique words that were found in the training set. Two symbols are loaded into wordVector to identify the start of a chain and the end. The start of the chain is always defined as element zero of wordVector , and the end of the chain as element one. All subsequent offsets represent unique words found within the corpus.

The program accepts two options, a "-f" to specify a filename for the corpus and a "-v" to enable verbose (or debug) mode. In the debug mode, the bigram array is emitted to show exactly what was parsed from the file (see Listing 10.3).

Listing 10.3: parseOptions Function.

 void parseOptions( char *argv[], int argc, int *dbg, char *fname ) {   int opt, error = 1;   *dbg = 0;   if (argc > 1) {     while ((opt = getopt(argc, argv, "vf:")) != -1) {       switch(opt) {         case 'v':           /* Verbose Mode */           *dbg = 1;           break;         case 'f':           /* Filename option */           strcpy(fname, optarg);           error = 0;           break;         default:           /* Any other option is an error */           error = 1;       }     }   }   if (error) {     printf("\nUsage is : \n\n");     printf("\t%s -f <filename> -v\n\n", argv[0]);     printf("\t\t -f corpus filename\n\t\t -v verbose mode\n\n");     exit(0);   }   return; }

The parseOptions function uses the standard getopt function to simplify the parsing of command-line options.

The parseFile function is used to extract the word relationships from the user -defined corpus (see Listing 10.4).

Listing 10.4: parseFile Function.

 void parseFile( char *filename ) {   FILE *fp;   int  inp, index = 0;   char word[MAX_WORD_LEN+1];   int  first = 1;   fp = fopen(filename, "r");   while (!feof(fp)) {     inp = fgetc(fp);     if (inp == EOF) {       if (index > 0) {         /* For the last word, update the matrix accordingly */         word[index++] = 0;         loadWord(word, LAST_WORD);         index = 0;       }     } else if (((char)inp == 0x0d)  ((char)inp == 0x0a)                 ((char)inp == ' ')) {       if (index > 0) {         word[index++] = 0;         if (first) {           /* First word in a sequence */           loadWord(word, FIRST_WORD);           index = 0;           first = 0;         } else {           /* Middle word of a sequence */           loadWord(word, MIDDLE_WORD);           index = 0;         }       }     } else if (((char)inp == '.')  ((char)inp == '?')) {       /* Handle punctuation by ending the current sequence */       word[index++] = 0;       loadWord(word, MIDDLE_WORD);       loadWord(word, LAST_WORD);       index = 0;       first = 1;     } else {       /* Skip white space */       if (((char)inp != 0x0a) && ((char)inp != ',')) {         word[index++] = (char)inp;       }     }   }   fclose(fp); }

The parseFile function accepts a filename that contains the training data. The goal of the parseFile function is to extract the individual words from the file and then load them into the wordVector and bigramArray structures in a meaningful way (using the loadWord function). By meaningful we mean the loading must take into account the order in which the word was found. Was it the first word in a sentence, the last, or simply a word in the middle? Depending upon the word order, we call the loadWord function with a symbol defining the order in which the word was found. The function also properly parses word context from end-of-file markers, carriage -returns and linefeeds, and a number of punctuation symbols.

The loadWord function, in Listing 10.5, updates the bigram structures for the word and its defined order within the stream.

Listing 10.5: loadWord Function.

 void loadWord( char *word, int order ) {   int wordIndex;   static int lastIndex = START_SYMBOL;   /* First, see if the word has already been recorded */   for (wordIndex = 2 ; wordIndex < curWord ; wordIndex++) {     if (!strcmp(wordVector[wordIndex], word)) {       break;     }   }   if (wordIndex == curWord) {     if (curWord == MAX_WORDS) {       printf("\nToo may words, increase MAX_WORDS\n\n");       exit(-1);     }     /* Doesn't exist, add it in */     strcpy(wordVector[curWord++], word);   }   /* At this point, we have a wordIndex that points to the current        word    * vector.    */   if (order == FIRST_WORD) {     /* Load word as the start of a sequence */     bigramArray[START_SYMBOL][wordIndex]++;     sumVector[START_SYMBOL]++;   } else if (order == LAST_WORD) {     /* Load word as the end of a sequence */     bigramArray[wordIndex][END_SYMBOL]++;     bigramArray[END_SYMBOL][wordIndex]++;     sumVector[END_SYMBOL]++;   } else {     /* Load word as the middle of a sequence */     bigramArray[lastIndex][wordIndex]++;     sumVector[lastIndex]++;   }   lastIndex = wordIndex;   return; }

Function loadWord must first identify whether the current word is unique (never been seen before). A quick scan of the wordVector array provides for this. If the word is new, we create a new slot for it in the wordVector array and update the wordVector limit accordingly ( curWord ).

Now we have the index within the wordVector array for our word. Using the order argument, (which defines the order in which the word appeared within the text), we update the bigramArray . If the word was the first in the sentence, we update the START_SYMBOL row (the first row in the table), and increment the column offset defined by the wordIndex . We also update the sum or the row as defined by the sumVector array (used to calculate the relative probabilities for each word).

If the order of the current word is the last word, then we update the bigramArray using the word as the first dimension of the array and the LAST_SYMBOL as the second dimension. Finally, if the word is in the middle, we use the last word as the first index and the current word as the second index. The last word is always saved within the function within the lastIndex variable. See Figure 10.5 for a representation of a sample sentence.

Figure 10.5: bigramArray for the sample sentence.

This completes the parsing and filling in of the bigram array. The next two functions provide for emitting a sentence using the bigram array as the model.

Function buildSentence walks through the bigramArray structure using the sumVector array to determine which path to take and thus which words to emit (see Listing 10.6).

Listing 10.6: Function buildSentence .

 int buildSentence( void ) {   int word = START_SYMBOL;   int max = 0;   printf("\n");   /* Start with a random word */   word = nextWord(word);   /* Loop until we've reached the end of the random sequence */   while (word != END_SYMBOL) {     /* Emit the current word */     printf("%s ", wordVector[word]);     /* Choose the next word */     word = nextWord(word);     /* Only allow a maximum number of words */     max += getRand(12) + 1;     /* If we've reached the end, break */     if (max >= 100) break;   }   /* Emit a backspace, '.' and a blank line */   printf("%c.\n\n", 8); return 0; }

The buildSentence function creates a path through the bigramArray using the probabilities that are defined by the contents of the bigramArray . Recall that the intersection of two words within the array is the count of the number of times the second word followed the first word. A random count ( max ) is also accrued during the walk to limit the number of words that are emitted.

Function buildSentence uses the nextWord function to determine the next word in the chain (the next step through the Markov path). See Listing 10.7.

Listing 10.7: Function nextWord.

 int nextWord( int word   ) {   int nextWord = (word + 1);   int max = sumVector[word];   int lim, sum = 0;   /* Define a limit for the roulette selection */   lim = getRand(max)+1;   while (nextWord != word) {     /* Bound the limit of the array using modulus */     nextWord = nextWord % curWord;     /* Keep a sum of the occurrences (for the roulette wheel) */     sum += bigramArray[word][nextWord];     /* If we've reached our limit, return the current word */     if (sum >= lim) {       return nextWord;     }     /* For the current word (row), go to the next word column */     nextWord++;   }   return nextWord; }

Function nextWord uses a form of roulette wheel selection to determine the next step based upon the probabilities of the steps. A limit is defined as a random number from one to sumvector for the row (total counts for all words that follow the current). Using the random limit ( lim ), we walk through the row incrementing a sum variable. When the sum variable is greater than or equal to our limit, we use the current word as the next step in the path. Otherwise, we continue to sum until the limit is reached. Probabilistically, this means that words with higher counts will be selected more often for the path.

Finally, a debug function that is used when the program is in debug mode is shown in Listing 10.8 ( emitMatrix ). This function is used to emit the bigramArray as well as the sumVector array.

Listing 10.8: Function emitMatrix .

 void emitMatrix( void ) {   int x, y;   printf("\n");   for (x = 0 ; x < curWord ; x++) {     printf("%20s : ", wordVector[x]);     for ( y = 0 ; y < curWord ; y++) {       printf("%d ", bigramArray[x][y]);     }     printf(" : %d\n", sumVector[x]);   } }