Section 8.4. Fixed-Width Data


8.4. Fixed-Width Data

Use unpack to extract fixed-width fields.

Fixed-width text data:

      X123-S000001324700000199     SFG-AT000000010200009099     Y811-Q000010030000000033

is still widely used in many data processing applications. The obvious way to extract this kind of data is with Perl's built-in substr function. But the resulting code is unwieldy and surprisingly slow:

          # Specify field locations...     Readonly my %FIELD_POS => (ident=>0,  sales=>6,   price=>16);     Readonly my %FIELD_LEN => (ident=>6,  sales=>10,  price=>8);     # Grab each line/record...     while (my $record = <$sales_data>) {         # Extract each field...         my $ident = substr($record, $FIELD_POS{ident}, $FIELD_LEN{ident});         my $sales = substr($record, $FIELD_POS{sales}, $FIELD_LEN{sales});         my $price = substr($record, $FIELD_POS{price}, $FIELD_LEN{price});         # Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)...         push @sales, {             ident => translate_ID($ident),             sales => $sales * 1000,             price => $price,         };     }

Using regexes to capture the various fields produces slightly cleaner code, but the matches are still not optimally fast:

          # Specify order and lengths of fields...     Readonly my $RECORD_LAYOUT         => qr/\A (.{6}) (.{10}) (.{8}) /xms;     # Grab each line/record...     while (my $record = <$sales_data>) {         # Extract all fields...         my ($ident, $sales, $price)             = $record =~ m/ $RECORD_LAYOUT /xms;         # Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)...         push @sales, {             ident => translate_ID($ident),             sales => $sales * 1000,             price => $price,         };     }

The built-in unpack function is optimized for this kind of task. In particular, a series of 'A' specifiers can be used to extract a sequence of multicharacter substrings:

      
     # Specify order and lengths of fields
... Readonly my $RECORD_LAYOUT => 'A6 A10 A8';
# 6 ASCII, then 10 ASCII, then 8 ASCII     # Grab each line/record
... while (my $record = <$sales_data>) {
# Extract all fields...
my ($ident, $sales, $price) = unpack $RECORD_LAYOUT, $record;
# Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)
... push @sales, { ident => translate_ID($ident), sales => $sales * 1000, price => $price, }; }

Some fixed-width formats insert one or more empty columns between the fields of each record, to make the resulting data more readable to humans. For example:

     X123-S  0000013247  00000199     SFG-AT  0000000102  00009099     Y811-Q  0000100300  00000033

When extracting fields from such data, you should use the '@' specifier to tell unpack where each field starts. For example:

      
     # Specify order and lengths of fields
... Readonly my $RECORD_LAYOUT => '@0 A6 @8 A10 @20 A8';
# At column zero extract 6 ASCII chars                                    # then at column 8 extract 10,                                    # then at column 20 extract 8.     # Grab each line/record
... while (my $record = <$sales_data>) {
# Extract all fields
... my ($ident, $sales, $price) = unpack $RECORD_LAYOUT, $record;
# Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)
... push @sales, { ident => translate_ID($ident), sales => $sales * 1000, price => $price, }; }

This approach scales extremely well, and can also cope with non-spaced data or variant layouts (i.e., with reordered fields). In particular, the unpack function doesn't require that '@' specifiers be specified in increasing column order. This means that an unpack can roam back and forth through a string (much like seek-ing a filehandle) and thereby extract fields in any convenient order. For example:

      
     # Specify order and lengths of fields...
Readonly my %RECORD_LAYOUT => (
     #  Ident   Sales   Price
Unspaced => ' A6 A10 A8',
# Legacy layout
Spaced => ' @0 A6 @8 A10 @20 A8',
# Standard layout
ID_last => '@21 A6 @0 A10 @12 A8',
# New, more convenient layout
);
# Select record layout
... my $layout_name = get_layout($filename);
# Grab each line/record
... while (my $record = <$sales_data>) {
# Extract all fields
... my ($ident, $sales, $price) = unpack $RECORD_LAYOUT{$layout_name}, $record;
# Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)
... push @sales, { ident => translate_ID($ident), sales => $sales * 1000, price => $price, }; }

The loop body is very similar to those in the earlier examples, except for the record layout now being looked up in a hash. The three variations in formatting and sequence have been cleanly factored out into a table.

Note that the entry for $RECORD_LAYOUT{ID_last}:

          ID_last => '@21 C6  @0 C10  @12 C8',

makes use of non-monotonic '@' specifiers. By jumping to column 21 first, then back to column 0, and on again to column 12, this ID_last format ensures that the call to unpack within the loop:

          my ($ident, $sales, $price)             = unpack $RECORD_LAYOUT{$layout_name}, $record;

will extract the record ID before the sales amount and the price, even though the ID field comes after those other two fields in the file.



Perl Best Practices
Perl Best Practices
ISBN: 0596001738
EAN: 2147483647
Year: 2004
Pages: 350
Authors: Damian Conway

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net