Flylib.com

Books Software

 
 
 

Chapter 3. Data Munging


Chapter 3. Data Munging

Hacks 19-27

Perl has always been in love with data. No matter where you find it, Perl happily processes and extracts and reports on files, databases, web pages, spreadsheets, other programs, and anything that produces data. Perl's so happy to do this that it even overlooks brute-force, rough manipulations. Hey, pragmatism works!

Perl can be gentle, too. A little subtlety, a little style and finesse, and you can write maintainable , easy-to-understand code that's just as powerful as the wild-eyed forge -ahead-at-all-costs just-do-the-job code. Why? It's often faster and more correctas well as more secure, more powerful, and shorter.

Sure, slinging data between sources sounds about as glamorous as slinging hash at the local diner, but it doesn't have to be that way. Here are several ideas to munge that yummy data with all of the elegance and style and power and clarity that you know you have.



Hack 19. Treat a File As an Array

Pretend a big stream of data on disk is a nice, malleable Perl data structure.

One of the big disappointments in programming is realizing that, although you can think of a text file as a long list of properly terminated lines, to the computer, it's just a big blob of ones and zeroes. If all you need to do is read the lines of a file and process them in order, you're fine. If you have a big file that you can't load into memory and can't process each line in order...well, good luck.

Fortunately, Mark Jason Dominus's Tie::File module exists, and is even in the core as of Perl 5.8.0. What good is it?

The Hack

Imagine you have a million-line CSV file of inventory data from a customer that's just not quite right. You can't import it into a spreadsheet, because that's too much data. You need to do some processing, inserting some lines and rearranging others. Importing the data into a little SQLite database won't work either because trying to get the queries right is too troublesome .

Tie::File won't help you write the rules for transforming lines, but it will take the pain out of manipulating the lines of a file. Just:

use Tie::File;

tie my @csv_lines, 'Tie::File', 'big_file.csv'
    or die "Cannot open big_file.csv: !$\n";

Running the Hack

Suppose that your big CSV file contains a list of products and operations. That is, each line is either a list of product data (product id, name , price, supplier, et cetera) or some operation to perform on the previous n products. Operations take the form opname:number . Obviously the file would be easier to process if the operations appeared before the data on which to operate , but you can't always change customer data formats to something sane. In fact, this might be the easiest way to clean the data for other processes.

Tie::File makes this almost trivial:

for my $i ( 0 .. $#csv_lines )
{
    next unless my ($op, $num) = $csv_lines[ $i ] =~ /^(\w+):(\d+)/;
    next unless my $op_sub     = __PACKAGE__->can( 'op_' . $op );

    my $start                  = $i - $num;
    my $end                    = $i - 1;
    my @lines                  = @csv_lines[ $start .. $end ];
    my @newlines               = $op_sub->( @lines );

    splice @csv_lines, $start, $num + 1, @newlines;
}

Okay, there is a bit of cleverness in finding the right range of lines to modify, but consider how much trickier the code would have to be to do this while looping through the file a line at a time.

Of course, you can use all of the standard array manipulation operations ( push , pop , shift , unshift , and splice ) as necessary.