8.4. Fixed-Width Data
Fixed-width text data: X123-S000001324700000199 SFG-AT000000010200009099 Y811-Q000010030000000033 is still widely used in many data processing applications. The obvious way to extract this kind of data is with Perl's built-in substr function. But the resulting code is unwieldy and surprisingly slow: # Specify field locations... Readonly my %FIELD_POS => (ident=>0, sales=>6, price=>16); Readonly my %FIELD_LEN => (ident=>6, sales=>10, price=>8); # Grab each line/record... while (my $record = <$sales_data>) { # Extract each field... my $ident = substr($record, $FIELD_POS{ident}, $FIELD_LEN{ident}); my $sales = substr($record, $FIELD_POS{sales}, $FIELD_LEN{sales}); my $price = substr($record, $FIELD_POS{price}, $FIELD_LEN{price}); # Append each record, translating ID codes and # normalizing sales (which are stored in 1000s)... push @sales, { ident => translate_ID($ident), sales => $sales * 1000, price => $price, }; } Using regexes to capture the various fields produces slightly cleaner code, but the matches are still not optimally fast: # Specify order and lengths of fields... Readonly my $RECORD_LAYOUT => qr/\A (.{6}) (.{10}) (.{8}) /xms; # Grab each line/record... while (my $record = <$sales_data>) { # Extract all fields... my ($ident, $sales, $price) = $record =~ m/ $RECORD_LAYOUT /xms; # Append each record, translating ID codes and # normalizing sales (which are stored in 1000s)... push @sales, { ident => translate_ID($ident), sales => $sales * 1000, price => $price, }; } The built-in unpack function is optimized for this kind of task. In particular, a series of 'A' specifiers can be used to extract a sequence of multicharacter substrings: Some fixed-width formats insert one or more empty columns between the fields of each record, to make the resulting data more readable to humans. For example: X123-S 0000013247 00000199 SFG-AT 0000000102 00009099 Y811-Q 0000100300 00000033 When extracting fields from such data, you should use the '@' specifier to tell unpack where each field starts. For example: This approach scales extremely well, and can also cope with non-spaced data or variant layouts (i.e., with reordered fields). In particular, the unpack function doesn't require that '@' specifiers be specified in increasing column order. This means that an unpack can roam back and forth through a string (much like seek-ing a filehandle) and thereby extract fields in any convenient order. For example: The loop body is very similar to those in the earlier examples, except for the record layout now being looked up in a hash. The three variations in formatting and sequence have been cleanly factored out into a table. Note that the entry for $RECORD_LAYOUT{ID_last}: ID_last => '@21 C6 @0 C10 @12 C8', makes use of non-monotonic '@' specifiers. By jumping to column 21 first, then back to column 0, and on again to column 12, this ID_last format ensures that the call to unpack within the loop: my ($ident, $sales, $price) = unpack $RECORD_LAYOUT{$layout_name}, $record; will extract the record ID before the sales amount and the price, even though the ID field comes after those other two fields in the file. |