Hack 21. Use Any Spreadsheet As a Data Source | Perl Hacks: Tips & Tools for Programming, Debugging, and Surviving

Make your data analysis independent of the spreadsheet program.

Spreadsheets are useful for holding structured data, usually based on columns and rows. Most often part of the data is calculated data from other cells in the same spreadsheet.

If you want to work with that data, you face the problem of too many standards and programs. Writing a script that has to read the data from the spreadsheet is more writing an interface to the spreadsheet than actually working with the interesting data.

Accessing Cell Data

The Spreadsheet::Read module gives you a single interface to the data of most spreadsheet formats available, hiding all the troublesome work that deals with the parsers and the portability stuff, yet being flexible enough to get to the guts of the spreadsheet.

It's easy to use:

use Spreadsheet::Read; my $ref  = ReadData( 'test.xls' ); my $fval = $ref->[1]{A3}; my $uval = $ref->[1]{cell}[1][3];

Here $ref is a reference to a structure that represents the data from the spreadsheet (test.xls). The reference points to a list (the worksheets) of hashes (the data).

Every cell has two representations: either access it by its name (A3), in which case the interface gives you the formatted value, or the cell hash, in which case you get the unformatted value of the cell.

Do I need Spreadsheet::Read for that?

No you don't, but it makes life easier. Setting aside all the good things of the various user interfaces for the available spreadsheets (95% probably Excel or OpenOffice.org), coding access to the cell data in the available native parsers is not always as easy as it should be. These interfaces try to give you full control, but you have no easy way to access the data.

Some examples for native equivalences of the previous code snippet are:

Microsoft Excel
OpenOffice.org
Comma Separated Values

They all show you the contents of cell A3, where you can interpret the CSV file as a collection of rows (the lines) and columns (the fields).

Spreadsheet::Read gives the same interface to all of these, but uses the native parser in the background. The only thing you have to alter if you change spreadsheet formats is the ReadData( ) call:

my $ref = ReadData( 'test.xls' ); # or my $ref = ReadData( 'test.sxc' ); # or my $ref = ReadData( 'test.csv' );

which will make your code much more readable, maintainable, and portable. Your code won't depend on the spreadsheet format used by the people shipping you the data.

Accessing a data column

Accessing a single field is good if you know the field you need to access, but quite often, your script has to analyze the data (structure) itself. For that you need a full set of data. Spreadsheet user interfaces always refer to the data location as a (column, row) pair, where (Perl) programmers more often use the (row, column) way of indexing. Perl starts indexing at 0, where spreadsheets usually start with 1. Spreadsheet::Read starts with 1 for the data and uses the zeroth field for internal control data. To fetch a complete column:

# Fetch me column "B" my @colB = @{ $ref->[1]{cell}[2] }; shift @colB;

or:

my @colB = @{$ref->[1]{cell}[2]}[1..$#{$ref->[1]{cell}[2]}];

Accessing a row of data

Likewise for fetching a complete row:

# Fetch me row 4 my @row4 = map { $ref->[1]{cell}[$_][4] } 1..$ref->[1]{maxcol};

Using programmer-style indexing

If you need to go over and through the complete set of data and prefer to have the data in a list of rows, instead of a list of columns, indexed from 0 not 1, Spreadsheet::Read offers a function to convert that for you:

use Spreadsheet::Read qw( rows ); # Get all data in a row oriented list my @rows = rows( $ref->[1] ); # A3 is now in $rows[2][0]

Showing all data in a spreadsheet

Want to show all of the data in a spreadsheet?

use Spreadsheet::Read; my $file        = 'test.xls'; my $spreadsheet = ReadData( $file )         or die "Cannot read $file\\n"; my $sheet_count = $spreadsheet->[0]{sheets} or die "No sheets in $file\\n"; for my $sheet_index (1 .. $sheet_count) {     # Skip empty worksheets     my $sheet = $spreadsheet->[$sheet_index] or next;     printf( "%s - %02d: [ %-12s ] %3d Cols, %5d Rows\\n", $file,         $sheet_index, $sheet->{label}, $sheet->{maxcol}, $sheet->{maxrow} );     for my $row ( 1 .. $sheet->{maxrow} )     {         print join "\\t" => map {         $sheet->{cell}[$_][$row] // "-" } 1 .. $sheet->{maxcol};         print "\\n";     } }

The output will be something like:

test.xls - 01: [ Sheet1       ]   4 Cols,     4 Rows A1      B1      -       D1 A2      B2      -       - A3      -       C3      D3 A4      B4      C4      - test.xls - 02: [ Second Sheet ]   5 Cols,     3 Rows x       -       x       -       x -       x       -       x x               x               x

Note that the example uses the defined-or operator (//) from Perl 6. This is available as a patch for Perl 5.8.x and will be available in Perl 5.10.

Empty cells are often undef values, which is not the same as an empty string "". If you use the above code with strict and warnings, there will be a warning for every empty cell if you do not use the defined-or. Showing empty fields as - is more visibly attractive than using whitespace.

Written in more portable code, this is equivalent to:

print join "\\t" => map {     my $val = $sheet->{cell}[$_][$row];     defined $val ? $val : "-"; } 1 .. $sheet->{maxcol};

How It Works

Spreadsheet::Read does no parsing of the spreadsheets itself, instead using the native parsers to do the hard work. For Microsoft Excel, it uses Spreadsheet::ParseExcel, for OpenOffice.org, Spreadsheet::ReadSXC, and for CSV, Text::CSV_XS.

Using Spreadsheet::Read, you do not have to worry about spreadsheet internal formats, or the way the native parser presents the data to the programmer. The interface is the same and is independent of the spreadsheet you use. If you need to get to the guts for anything this interface does not (yet) support, you can always fall back to the real parser, because without it, Spreadsheet::Read does not work anyway.

Spreadsheet::Read tries to achieve a set of commonly supported features of all of the parsers it can use and aims to extend that in the future to make the use of the native parsers unnecessary (such as color attributes, display formats, font face and sizes, and character encoding). All native parsers support that in a different way, if they support it at all. For example, CSV does not have a defined way of identifying the character encoding of the data.

Hacking the Hack

The more spreadsheet formats this module supports, the more value it gains in portability and eventually for your script's maintainability.

Currently, the module supports a hook for Spreadsheet::Perl, but there is no parser support for it yet. It would probably do this module well to isolate the conversions for the different parsers in separate modules, such as Spreadsheet::Read::Excel, to avoid cluttering the main interface.

The module comes with one conversion script: xlscat, which takes a file in any of the supported spreadsheet formats and converts it to either readable ASCII or CSV. Use xlscat -? to see the supported options. If you have useful scripts, they may be worth bundling with this module.

The module does not die if any of the parsers is not installed, making it useful if you only use OpenOffice.org and do not yet bother with Excel (or vice versa). It is quite easy and valuable to add your own parser support and supply a patch to the author to include it in future releases.