Recipe 23.6. Counting Lines, Paragraphs, or Records in a File


23.6.1. Problem

You want to count the number of lines, paragraphs, or records in a file.

23.6.2. Solution

To count lines, use fgets( ), as in Example 23-18. Because it reads a line at a time, you can count the number of times it's called before reaching the end of a file.

Counting lines in a file

<?php $lines = 0; if ($fh = fopen('orders.txt','r')) {   while (! feof($fh)) {     if (fgets($fh)) {       $lines++;     }   } } print $lines; ?>

To count paragraphs, increment the counter only when you read a blank line, as in Example 23-19.

Counting paragraphs in a file

<?php $paragraphs = 0; if ($fh = fopen('great-american-novel.txt','r')) {   while (! feof($fh)) {     $s = fgets($fh);     if (("\n" == $s) || ("\r\n" == $s)) {       $paragraphs++;     }   } } print $paragraphs; ?> 

To count records, increment the counter only when the line read contains just the record separator and whitespace. In Example 23-20, the record separator is stored in $record_separator.

Counting records in a file

<?php $records = 0; $record_separator = '--end--'; if ($fh = fopen('great-american-textfile-database.txt','r')) {   while (! feof($fh)) {     $s = rtrim(fgets($fh));     if ($s == $record_separator) {       $records++;     }   } } print $records; ?>

23.6.3. Discussion

In Example 23-18, $lines is incremented only if fgets( ) returns a true value. As fgets( ) moves through the file, it returns each line it retrieves. When it reaches the last line, it returns false, so $lines isn't incremented incorrectly. Because EOF has been reached on the file, feof( ) returns true, and the while loop ends.

Example 23-19 works fine on simple text but may produce unexpected results when presented with a long string of blank lines or a file without two consecutive line breaks. These problems can be remedied with functions based on preg_split( ). If the file is small and can be read into memory, use the pc_split_paragraphs( ) function shown in Example 23-21. This function returns an array containing each paragraph in the file.

pc_split_paragraphs( )

<?php function pc_split_paragraphs($file,$rs="\r?\n") {     $text = file_get_contents($file);     $matches = preg_split("/(.*?$rs)(?:$rs)+/s",$text,-1,                           PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);     return $matches; } ?>

In Example 23-21, the contents of the file are broken on two or more consecutive newlines and returned in the $matches array. The default record-separation regular expression, \r?\n, matches both Windows and Unix line breaks.

If the file is too big to read into memory at once, use the pc_split_paragraphs_largefile( ) function shown in Example 23-22, which reads the file in 16 KB chunks.

pc_split_paragraphs_largefile( )

<?php function pc_split_paragraphs_largefile($file,$rs="\r?\n") {     global $php_errormsg;     $unmatched_text = '';     $paragraphs = array();     $fh = fopen($file,'r') or die($php_errormsg);     while(! feof($fh)) {         $s = fread($fh,16384) or die($php_errormsg);         $text_to_split = $unmatched_text . $s;         $matches = preg_split("/(.*?$rs)(?:$rs)+/s",$text_to_split,-1,                               PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);         // if the last chunk doesn't end with two record separators, save it         // to prepend to the next section that gets read         $last_match = $matches[count($matches)-1];         if (! preg_match("/$rs$rs\$/",$last_match)) {             $unmatched_text = $last_match;             array_pop($matches);         } else {             $unmatched_text = '';         }         $paragraphs = array_merge($paragraphs,$matches);     }     // after reading all sections, if there is a final chunk that doesn't     // end with the record separator, count it as a paragraph     if ($unmatched_text) {         $paragraphs[] = $unmatched_text;     }     return $paragraphs; } ?>

This function uses the same regular expression as pc_split_paragraphs( ) to split the file into paragraphs. When it finds a paragraph end in a chunk read from the file, it saves the rest of the text in the chunk in $unmatched_text and prepends it to the next chunk read. This includes the unmatched text as the beginning of the next paragraph in the file.

The record-counting function in Example 23-20 lets fgets( ) figure out how long each line is. If you can supply a reasonable upper bound on line length, stream_get_line( ) provides a more concise way to count records. This function reads a line until it reaches a certain number of bytes or it sees a particular delimiter. Supply it with the record separator as the delimiter, as in Example 23-23.

Counting records in a file with stream_get_line( )

<?php $records = 0; $record_separator = '--end--'; if ($fh = fopen('great-american-textfile-database.txt','r')) {     $done = false;     while (! $done) {         $s = stream_get_line($fh, 65536, $record_separator);         if (feof($fh)) {             $done = true;         } else {             $records++;         }   } } print $records; ?>

Example 23-23 assumes that each record is no more that 64 KB (65,536 bytes) long. Each call to stream_get_line( ) returns one record, not including the record separator. When stream_get_line( ) has advanced past the last record separator, it reaches the end of the file, so $done is set to TRue to stop counting records.

23.6.4. See Also

Documentation on fgets( ) at http://www.php.net/fgets, on feof( ) at http://www.php.net/feof, on preg_split( ) at http://www.php.net/preg-split, and on stream_get_line( ) at http://www.php.net/stream_get_line.




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net