Recipe 23.7. Processing Every Word in a File | PHP Cookbook: Solutions and Examples for PHP Programmers

23.7.1. Problem

You want to do something with every word in a file. For example, you want to build a concordance of how many times each word is used to compute similarities between documents.

23.7.2. Solution

Read in each line with fgets( ), separate the line into words, and process each word, as in Example 23-24.

Processing each word in a file

<?php $fh = fopen('great-american-novel.txt','r') or die($php_errormsg); while (! feof($fh)) {     if ($s = fgets($fh)) {         $words = preg_split('/\s+/',$s,-1,PREG_SPLIT_NO_EMPTY);         // process words     } } fclose($fh) or die($php_errormsg); ?>

23.7.3. Discussion

Example 23-25 calculates the average word length in a file.

Calculating average word length

<?php $word_count = $word_length = 0; if ($fh = fopen('great-american-novel.txt','r')) {   while (! feof($fh)) {     if ($s = fgets($fh)) {       $words = preg_split('/\s+/',$s,-1,PREG_SPLIT_NO_EMPTY);       foreach ($words as $word) {         $word_count++;         $word_length += strlen($word);       }     }   } } print sprintf("The average word length over %d words is %.02f characters.",               $word_count,               $word_length/$word_count); ?>

Processing every word proceeds differently depending on how "word" is defined. The code in this recipe uses the Perl-compatible regular expression engine's \s whitespace metacharacter, which includes space, tab, newline, carriage return, and formfeed. Recipe 1.5 breaks apart a line into words by splitting on a space, which is useful in that recipe because the words have to be rejoined with spaces. The Perl-compatible engine also has a word-boundary assertion (\b) that matches between a word character (alphanumeric) and a non-word character (anything else). Using \b instead of \s to delimit words most noticeably treats differently words with embedded punctuation. The term 6 o'clock is two words when split by whitespace (6 and o'clock); it's four words when split by word boundaries (6, o, ', and clock).

23.7.4. See Also

Recipe 22.2 discusses regular expressions to match words; Recipe 1.5 for breaking apart a line by words; documentation on fgets( ) at http://www.php.net/fgets, on preg_split( ) at http://www.php.net/preg-split, and on the Perl-compatible regular expression extension at http://www.php.net/pcre.