Section 8.6. Variable-Width Data | Perl Best Practices

8.6. Variable-Width Data

Use Text::CSV_XS to extract complex variable-width fields.

Perl's built-in functions aren't always the right answer. Using split to extract variable-width fields is efficient and easy, provided those fields really are always delimited by a simple separator. More often though, even if your records start out as purely comma-delimited:

      Readonly my $RECORD_SEPARATOR => q{,};     Readonly my $FIELD_COUNT      => 3;     my ($ident, $sales, $price) = split $RECORD_SEPARATOR, $record, $FIELD_COUNT+1;

it soon becomes necessary to extend the format rules to cope with human vagaries (such as ignoring whitespace around commas):

     Readonly my $RECORD_SEPARATOR => qr/\s* , \s*/xms;     Readonly my $FIELD_COUNT      => 3;     my ($ident, $sales, $price) = split $RECORD_SEPARATOR, $record, $FIELD_COUNT+1;

Or else someone will need to include a comma in a field and will decide to escape it with a backslash, in which case you'll need:

     Readonly my $RECORD_SEPARATOR => qr/ \s* (?<!\\) , \s* /xms;  # Unbackslashed comma

And from there it's "Oh, we ought to be able to backslash a backslash too" and then "Hey, let's allow double-quoted fields so we don't have to backslash any of the commas in them". At which point your attempts to write a suitable separator regex for split have become a whirling vortex of pain, as you struggle to reinvent the "Comma-Separated Values" encoding. Badly.

The split function is ideal for simple cases, but scales very poorly when some variant of CSV is being parsed. As soon as your record format goes beyond a simple separator that can be recognized with a (non-lookbehind) regex, consider whether you can respecify your data format and rewrite your code to use the Text::CSV_XS module instead:

      use Text::CSV_XS;     # Specify format
...     my $csv_format         = Text::CSV_XS->new({               sep_char    => q{,},   # Fields are comma-separated
               escape_char => q{\\},  # Backslashed chars are always data
               quote_char  => q{"},   # Fields can be double-quoted
           });     # Grab each line/record
...     RECORD:     while (my $record = <$sales_data>) {         # Verify record is correctly formatted (or skip it)
...         if (!$csv_format->parse($record)) {              warn 'Record ', $sales_data->input_line_number(  ), " not valid: '$record'";              next RECORD;         }         # Extract all fields...
         my ($ident, $sales, $price) = $csv_format->fields(  );                  # Append each record, translating ID codes and         # normalizing sales (which are stored in 1000s)
...         push @sales, {             ident => translate_ID($ident),             sales => $sales * 1000,             price => $price,         };     }

This solution first constructs a specialized CSV parser (Text::CSV_XS->new( )), specifying what characters to use as the field separator, the escape character, and the field quoting delimiters. Then, the while loop checks whether each line conforms to the CSV syntax ($csv_format->parse($record)) and, if so, retrieves the fields that the call to parse( ) successfully extracted.

In fact, the previous code structure ("read, parse, extract, repeat") is so common that it has been encapsulated into an even cleaner solution: the Text::CSV::Simple module. Using that module, the previous example becomes:

      use Text::CSV::Simple;     # Specify format
...     my $csv_format         = Text::CSV::Simple->new({               sep_char    => q{,},   # Fields are comma-separated
               escape_char => q{\\},  # Backslashed chars are always data
               quote_char  => q{"},   # Fields can be double-quoted
           });     # Specify field names in order (any other fields will be ignored)
...     $csv_format->field_map( qw( ident sales price ) );     # Grab each line/record
...     for my $record_ref ($csv_format->read_file($sales_data)) {         push @sales, {             ident => translate_ID($record_ref->{ident}),             sales => $record_ref->{sales} * 1000,             price => $record_ref->{price},         };     }

This version first creates a Text::CSV::Simple object, passing it the same configuration arguments as before (because it's actually just a wrapper around a Text::CSV_XS object). The call to field_map( ) then tells the object the name of each field, in the order they occur within the data. A single call to read_file( ) then reads in the entire file and converts it to a list of hashes, one for each record that was read in. Finally, the for loop processes each hash that was returned, extracting the appropriately named fields ($record_ref->{ident}, $record_ref->{sales}, $record_ref->{price})

Note however that, as the name implies, the Text::CSV_XS module is written in C, compiled to a library, and made available to Perl using the "XS" bridging mechanism^[*]. If you need to run your code on a system where using compiled modules of this kind is not feasible, the Text::CSV module provides a (slower, much less configurable) alternative that is implemented in pure Perl.

^[*] See the perlxs manpage for the horrifying details.