Recipe 1.13. Parsing Fixed-Width Field Data Records | PHP Cookbook: Solutions and Examples for PHP Programmers

1.13.1. Problem

You need to break apart fixed-width records in strings.

1.13.2. Solution

Use substr( ) as shown in Example 1-34.

Parsing fixed-width records with substr( )

<?php $fp = fopen('fixed-width-records.txt','r') or die ("can't open file"); while ($s = fgets($fp,1024)) {     $fields[1] = substr($s,0,10);  // first field:  first 10 characters of the line     $fields[2] = substr($s,10,5);  // second field: next 5 characters of the line     $fields[3] = substr($s,15,12); // third field:  next 12 characters of the line     // a function to do something with the fields     process_fields($fields); } fclose($fp) or die("can't close file"); ?>

Or unpack( ) , as shown in Example 1-35.

Parsing fixed-width records with unpack( )

<?php $fp = fopen('fixed-width-records.txt','r') or die ("can't open file"); while ($s = fgets($fp,1024)) {     // an associative array with keys "title", "author", and "publication_year"     $fields = unpack('A25title/A14author/A4publication_year',$s);     // a function to do something with the fields     process_fields($fields); } fclose($fp) or die("can't close file"); ?>

1.13.3. Discussion

Data in which each field is allotted a fixed number of characters per line may look like this list of books, titles, and publication dates:

<?php $booklist=<<<END Elmer Gantry             Sinclair Lewis1927 The Scarlatti InheritanceRobert Ludlum 1971 The Parsifal Mosaic      Robert Ludlum 1982 Sophie's Choice          William Styron1979 END; ?>

In each line, the title occupies the first 25 characters, the author's name the next 14 characters, and the publication year the next 4 characters. Knowing those field widths, you can easily use substr( ) to parse the fields into an array:

<?php $books = explode("\n",$booklist); for($i = 0, $j = count($books); $i < $j; $i++) {   $book_array[$i]['title'] = substr($books[$i],0,25);   $book_array[$i]['author'] = substr($books[$i],25,14);   $book_array[$i]['publication_year'] = substr($books[$i],39,4); } ?>

Exploding $booklist into an array of lines makes the looping code the same whether it's operating over a string or a series of lines read in from a file.

The loop can be made more flexible by specifying the field names and widths in a separate array that can be passed to a parsing function, as shown in the pc_fixed_width_substr( ) function in Example 1-36.

pc_fixed_width_substr( )

<?php function pc_fixed_width_substr($fields,$data) {   $r = array();   for ($i = 0, $j = count($data); $i < $j; $i++) {     $line_pos = 0;     foreach($fields as $field_name => $field_length) {       $r[$i][$field_name] = rtrim(substr($data[$i],$line_pos,$field_length));       $line_pos += $field_length;     }   }   return $r; } $book_fields = array('title' => 25,                      'author' => 14,                      'publication_year' => 4); $book_array = pc_fixed_width_substr($book_fields,$books); ?>

The variable $line_pos keeps track of the start of each field and is advanced by the previous field's width as the code moves through each line. Use rtrim( ) to remove trailing whitespace from each field.

You can use unpack( ) as a substitute for substr( ) to extract fields. Instead of specifying the field names and widths as an associative array, create a format string for unpack( ). A fixed-width field extractor using unpack( ) looks like the pc_fixed_width_unpack( ) function shown in Example 1-37.

pc_fixed_width_unpack( )

<?php function pc_fixed_width_unpack($format_string,$data) {   $r = array();   for ($i = 0, $j = count($data); $i < $j; $i++) {     $r[$i] = unpack($format_string,$data[$i]);   }   return $r; } $book_array = pc_fixed_width_unpack('A25title/A14author/A4publication_year',                                     $books); ?>

Because the A format to unpack( ) means "space-padded string," there's no need to rtrim( ) off the trailing spaces.

Once the fields have been parsed into $book_array by either function, the data can be printed as an HTML table, for example:

<?php $book_array = pc_fixed_width_unpack('A25title/A14author/A4publication_year',                                     $books); print "<table>\n"; // print a header row print '<tr><td>'; print join('</td><td>',array_keys($book_array[0])); print "</td></tr>\n"; // print each data row foreach ($book_array as $row) {     print '<tr><td>';     print join('</td><td>',array_values($row));     print "</td></tr>\n"; } print '</table>\n'; ?>

Joining data on </td><td> produces a table row that is missing its first <td> and last </td>. We produce a complete table row by printing out <tr><td> before the joined data and </td></tr> after the joined data.

Both substr( ) and unpack( ) have equivalent capabilities when the fixed-width fields are strings, but unpack( ) is the better solution when the elements of the fields aren't just strings.

If all of your fields are the same size, str_split( ) is a handy shortcut for chopping up incoming data. Available in PHP 5, it returns an array made up of sections of a string. Example 1-38 uses str_split( ) to break apart a string into 32-byte pieces.

Chopping up a string with str_split( )

<?php $fields = str_split($line_of_data,32); // $fields[0] is bytes 0 - 31 // $fields[1] is bytes 32 - 63 // and so on

1.13.4. See Also

For more information about unpack( ), see Recipe 1.16 and http://www.php.net/unpack; documentation on str_split( ) at http://www.php.net/str_split; Recipe 4.8 discusses join( ) .