Section 8.4. Fixed-Width Data | Perl Best Practices

8.4. Fixed-Width Data

Use unpack to extract fixed-width fields.

Fixed-width text data:

      X123-S000001324700000199     SFG-AT000000010200009099     Y811-Q000010030000000033

is still widely used in many data processing applications. The obvious way to extract this kind of data is with Perl's built-in substr function. But the resulting code is unwieldy and surprisingly slow:

          # Specify field locations...     Readonly my %FIELD_POS => (ident=>0,  sales=>6,   price=>16);     Readonly my %FIELD_LEN => (ident=>6,  sales=>10,  price=>8);     # Grab each line/record...     while (my $record = <$sales_data>) {         # Extract each field...         my $ident = substr($record, $FIELD_POS{ident}, $FIELD_LEN{ident});         my $sales = substr($record, $FIELD_POS{sales}, $FIELD_LEN{sales});         my $price = substr($record, $FIELD_POS{price}, $FIELD_LEN{price});         # Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)...         push @sales, {             ident => translate_ID($ident),             sales => $sales * 1000,             price => $price,         };     }

Using regexes to capture the various fields produces slightly cleaner code, but the matches are still not optimally fast:

          # Specify order and lengths of fields...     Readonly my $RECORD_LAYOUT         => qr/\A (.{6}) (.{10}) (.{8}) /xms;     # Grab each line/record...     while (my $record = <$sales_data>) {         # Extract all fields...         my ($ident, $sales, $price)             = $record =~ m/ $RECORD_LAYOUT /xms;         # Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)...         push @sales, {             ident => translate_ID($ident),             sales => $sales * 1000,             price => $price,         };     }

The built-in unpack function is optimized for this kind of task. In particular, a series of 'A' specifiers can be used to extract a sequence of multicharacter substrings:

           # Specify order and lengths of fields
...     Readonly my $RECORD_LAYOUT => 'A6 A10 A8';  # 6 ASCII, then 10 ASCII, then 8 ASCII     # Grab each line/record
...     while (my $record = <$sales_data>) {         # Extract all fields...
         my ($ident, $sales, $price)             = unpack $RECORD_LAYOUT, $record;         # Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)
...         push @sales, {             ident => translate_ID($ident),             sales => $sales * 1000,             price => $price,         };     }

Some fixed-width formats insert one or more empty columns between the fields of each record, to make the resulting data more readable to humans. For example:

     X123-S  0000013247  00000199     SFG-AT  0000000102  00009099     Y811-Q  0000100300  00000033

When extracting fields from such data, you should use the '@' specifier to tell unpack where each field starts. For example:

           # Specify order and lengths of fields
...     Readonly my $RECORD_LAYOUT         => '@0 A6 @8 A10 @20 A8';  # At column zero extract 6 ASCII chars                                    # then at column 8 extract 10,                                    # then at column 20 extract 8.     # Grab each line/record
...     while (my $record = <$sales_data>) {         # Extract all fields
...         my ($ident, $sales, $price)             = unpack $RECORD_LAYOUT, $record;         # Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)
...         push @sales, {             ident => translate_ID($ident),             sales => $sales * 1000,             price => $price,         };     }

This approach scales extremely well, and can also cope with non-spaced data or variant layouts (i.e., with reordered fields). In particular, the unpack function doesn't require that '@' specifiers be specified in increasing column order. This means that an unpack can roam back and forth through a string (much like seek-ing a filehandle) and thereby extract fields in any convenient order. For example:

           # Specify order and lengths of fields...
     Readonly my %RECORD_LAYOUT  => (                      #  Ident   Sales   Price
         Unspaced => '    A6     A10      A8',   # Legacy layout
           Spaced => ' @0 A6  @8 A10  @20 A8',   # Standard layout
          ID_last => '@21 A6  @0 A10  @12 A8',   # New, more convenient layout
     );     # Select record layout
...     my $layout_name = get_layout($filename);     # Grab each line/record
...     while (my $record = <$sales_data>) {         # Extract all fields
...         my ($ident, $sales, $price)             = unpack $RECORD_LAYOUT{$layout_name}, $record;         # Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)
...         push @sales, {             ident => translate_ID($ident),             sales => $sales * 1000,             price => $price,         };     }

The loop body is very similar to those in the earlier examples, except for the record layout now being looked up in a hash. The three variations in formatting and sequence have been cleanly factored out into a table.

Note that the entry for $RECORD_LAYOUT{ID_last}:

          ID_last => '@21 C6  @0 C10  @12 C8',

makes use of non-monotonic '@' specifiers. By jumping to column 21 first, then back to column 0, and on again to column 12, this ID_last format ensures that the call to unpack within the loop:

          my ($ident, $sales, $price)             = unpack $RECORD_LAYOUT{$layout_name}, $record;

will extract the record ID before the sales amount and the price, even though the ID field comes after those other two fields in the file.