6.13 Working with a File Word-by-Word

You want to perform a particular action separately on every word in a file.

Technique

Load the file into an array of lines, implode it into a single variable, and split the variable by spaces:

 <?php $f_contents = preg_split ("/\s+/", implode ("", file ($fn))); ?>

Comments

Splitting up a file by words can be tricky because one person might consider "words" to mean everything but whitespace (as we did here), whereas another person will consider as words all the words in the English language (or Korean, or Latin, or French, and so on). To summarize, the term words is not at all clearly defined for programmers. So, the best way to work with a file word-by-word is to split a file by whitespace (no definition of words has whitespace in it), and then test the criterion for each individual non-whitespace character.

Here is an example that will test for duplicates of words in a file:

 <?php $f_contents = preg_split ("/\s+/", implode ("", file ($fn))); foreach ($f_contents as $word) {     $ar[$word]++; } print "the following words have duplicates\n"; foreach ($ar as $word => $word_count) {     if ($word_count > 1) {         print "Word: $word\nNumber Of Occurrences: $word_count\n\n";     } } ?>

Note that even this simple script is not perfect. For example, Massachusetts Institute of Technology would not be equated with MIT. Whenever you have a system dealing with humans , you will not catch all the exceptions; you just have to test and retest until you catch all the common ones.

This solution requires more memory at a single point in time than if we had read the file line-by-line ; therefore, when dealing with extremely large files:

 <?php $fp = fopen ($fn, 'r') or die("Cannot open file $fn"); while ($line = fgets($fp,1024)) {     $words = preg_split ('/\s+/', $line);     //... manipulate all the words in line, see above techniques } fclose($fp) or die("Cannot open file $fn"); ?>