9.4. Word Frequency Counting


9.4. Word Frequency Counting

This program demonstrates several basic programming techniques, including string and character manipulation, file input/output, data structure manipulation, and recursion. The program is adapted from Chapter 6 of The C Programming Language [15]. One reason for using this particular example is to show how a C program might look when converted almost literally into Scheme.

A few differences between the Scheme program and the original C program are worth noting. First, the Scheme version employs a different protocol for file input and output. Rather than implicitly using the standard input and output ports, it requires that filenames be passed in, thus demonstrating the opening and closing of files. Second, the procedure get-word returns one of three values: a string (the word), a nonalphabetic character, or an eof value. The original C version returned a ag for letter (to say that a word was read) or a nonalphabetic character. Furthermore, the C version passed in a string to fill and a limit on the number of characters in the string; the Scheme version builds a new string of whatever length is required (the characters in the word are held in a list until the end of the word has been found, then converted into a string with list->string). Finally, char-type uses the primitive Scheme character predicates char-alphabetic? and char-numeric? to determine whether a character is a letter or digit.

The main program, frequency, takes an input filename and an output filename as arguments, e.g., (frequency "pickle" "freq.out") prints the frequency count for each word in the file "pickle" to the file "freq.out." As frequency reads words from the input file, it inserts them into a binary tree structure (using a binary sorting algorithm). Duplicate entries are recorded by incrementing the count associated with each word. Once end of file is reached, the program traverses the tree, printing each word with its count.

Assume that the file "pickle" contains the following text.

 Peter Piper picked a peck of pickled peppers; A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where's the peck of pickled peppers Peter Piper picked? 

Then, after typing (frequency "pickle" "freq.out"), the file "freq.out" should contain the following.

 1 A 1 If 4 Peter 4 Piper 1 Where 2 a 4 of 4 peck 4 peppers 4 picked 4 pickled 1 s 1 the 

(On some systems, the capitalized words may appear after the others.)

 ;;; If the next character on p is a letter, get-word reads a word ;;; from p and returns it in a string. If the character is not a ;;; letter, get-word returns the character (on eof, the eof-object). (define get-word   (lambda (p)     (let ((c (read-char p)))       (if (eq? (char-type c) 'letter)           (list->string             (let loop ((c c))               (cons c                 (if (memq (char-type (peek-char p)) '(letter digit))                     (loop (read-char p))                     '()))))        c)))) ;;; char-type tests for the eof-object first, since the eof-object ;;; may not be a valid argument to char-alphabetic? or char-numeric? ;;; It returns the eof-object, the symbol letter, the symbol digit, ;;; or the argument itself if it is not a letter or digit. (define char-type   (lambda (c)     (cond       ((eof-object? c) c)       ((char-alphabetic? c) 'letter)       ((char-numeric? c) 'digit)       (else c)))) ;;; Trees are represented as vectors with four fields: word, left, ;;; right, and count. Only one field, word, is initialized by an ;;; argument to the constructor procedure make-tree. The remaining ;;; fields are explicitly initialized and changed by subsequent ;;; operations. Most Scheme systems provide structure definition ;;; facilities that automate creation of structure manipulation ;;; procedures, but we simply define the procedures by hand here. (define make-tree   (lambda (word)     (vector word '() '() 1))) (define tree-word (lambda (tree) (vector-ref tree 0))) (define tree-left (lambda (tree) (vector-ref tree 1))) (define set-tree-left!   (lambda (tree new-left)     (vector-set! tree 1 new-left))) (define tree-right (lambda (tree) (vector-ref tree 2))) (define set-tree-right!   (lambda (tree new-right)     (vector-set! tree 2 new-right))) (define tree-count (lambda (tree) (vector-ref tree 3))) (define set-tree-count!   (lambda (tree new-count)     (vector-set! tree 3 new-count))) ;;; If the word already exists in the tree, tree increments its ;;; count. Otherwise, a new tree node is created and put into the ;;; tree. In any case, the new or modified tree is returned. (define tree   (lambda (node word)     (cond       ((null? node) (make-tree word))       ((string=? word (tree-word node))        (set-tree-count! node (+ (tree-count node) 1))        node)       ((string<? word (tree-word node))        (set-tree-left! node (tree (tree-left node) word))        node)       (else        (set-tree-right! node (tree (tree-right node) word))       node)))) ;;; tree-print prints the tree in "in-order," i.e., left subtree, ;;; then node, then right subtree. For each word, the count and the ;;; word are printed on a single line. (define tree-print   (lambda (node p)     (if (not (null? node))         (begin           (tree-print (tree-left node) p)           (write (tree-count node) p)           (write-char #\space p)           (display (tree-word node) p)           (newline p)           (tree-print (tree-right node) p))))) ;;; frequency is the driver routine. It opens the files, reads the ;;; words, and enters them into the tree. When the input port ;;; reaches end-of-file, it prints the tree and closes the ports. (define frequency   (lambda (infn outfn)     (let ((ip (open-input-file infn))          (op (open-output-file outfn)))       (let loop ((root '()))         (let ((w (get-word ip)))           (cond              ((eof-object? w) (tree-print root op))              ((string? w) (loop (tree root w)))              (else (loop root)))))       (close-input-port ip)       (close-output-port op)))) 

Exercise 9.4.1.

start example

Replace the procedures used to implement the tree datatype with a record or structure definition, using the facilities provided by the Scheme system you are using or define-structure from Section 8.4.

end example

Exercise 9.4.2.

start example

In the output file shown earlier, the capitalized words appeared before the others in the output file, and the capital A was not recognized as the same word as the lower-case a. Modify tree to use the case-insensitive versions of the string comparisons so that this does not happen.

end example

Exercise 9.4.3.

start example

The "word" s appears in the file "freq.out," although it is really just a part of the contraction Where's. Adjust get-word to allow embedded single quote marks.

end example

Exercise 9.4.4.

start example

Modify this program to "weed out" certain common words such as a, an, the, is, of, etc., in order to reduce the amount of output for long input files. Try to devise other ways to cut down on useless output.

end example

Exercise 9.4.5.

start example

get-word buffers characters in a list, allocating a new pair (with cons) for each character. Make it more efficient by using a string to buffer the characters. Devise a way to allow the string to grow if necessary. [Hint: Use string-append.]

end example

Exercise 9.4.6.

start example

This tree algorithm works by creating trees and later filling in its left and right fields. This requires many unnecessary assignments. Rewrite the tree procedure to avoid set-tree-left! and set-tree-right! entirely.

end example




The Scheme Programming Language
The Scheme Programming Language
ISBN: 026251298X
EAN: 2147483647
Year: 2003
Pages: 98

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net