4.2 FileIO.pm: A Class to Read and Write Files | Mastering Perl for Bioinformatics

Even though you can easily obtain excellent modules for reading and writing files, this chapter shows you how to build a simple one from scratch. One reason for doing this is to better understand the issues every bioinformatics programmer needs to face, such as how to organize files and keep track of their contents. Another reason is so you can see how to extend the class to deal with the multiple file format problem that is peculiar to bioinformatics.

It's not uncommon for a biologist to use several different types of formats of files containing DNA or protein sequence data and translate from one format to another. Doing these translations by hand is very tedious. It's also tedious to save alternate forms of the same sequence data in differently formatted files. You'll see how to alleviate some of this pain by automating some of these tasks in a new class called SeqFileIO.pm .

Class inheritance is one of the main reasons why object-oriented software is so reusable. In order to see clearly how it works, let's start with the simple class FileIO.pm and later use it to define a more complex class, SeqFileIO.pm .

FileIO is a simple class that reads and writes files, and stores simple information such as the file contents, date, and write permissions.

You know that it's often possible to modify existing code to create your own program. When I wrote FileIO.pm , I simply made a copy of the Gene.pm module from Chapter 3 and modified it.

On my Linux system, I started by copying FileIO.pm from Gene.pm and giving it a new name :

 cp Gene.pm FileIO.pm

I then edited the new file FileIO.pm changing the line near the top that says:

 package Gene;

to:

 package FileIO;

The filename must be the same as the class name, with an additional .pm .

Though I now needed to modify the module to do what I want, a surprising amount of the overall framework of the code ”its constructor, accessor and mutator methods , and its basic data structures ”remains the same. Gene.pm already contained such useful parts as a new constructor, a hash-based object data structure, accessor methods to retrieve values of the attributes of the object, and mutator methods to alter attribute values. These are likely to be needed by most classes that you'll write in your own software projects.

4.2.1 Analysis of FileIO

Following is the code for FileIO , with commentary interspersed:

 package FileIO; # # A simple IO class for sequence data files # use strict; use warnings; our $AUTOLOAD; # before Perl 5.6.0 say "use vars '$AUTOLOAD';" use Carp; # Class data and methods {     # A list of all attributes with defaults and read/write/required/noinit properties     my %_attribute_properties = (         _filename    => [ '',        'read.write.required'],         _filedata    => [ [ ],       'read.write.noinit'],         _date        => [ '',         'read.write.noinit'],         _writemode   => [ '>',        'read.write.noinit'],     );              # Global variable to keep count of existing objects     my $_count = 0;     # Return a list of all attributes     sub _all_attributes {             keys %_attribute_properties;     }     # Check if a given property is set for a given attribute     sub _permissions {         my($self, $attribute, $permissions) = @_;         $_attribute_properties{$attribute}[1] =~ /$permissions/;     }     # Return the default value for a given attribute     sub _attribute_default {             my($self, $attribute) = @_;         $_attribute_properties{$attribute}[0];     }     # Manage the count of existing objects     sub get_count {         $_count;     }     sub _incr_count {         ++$_count;     }     sub _decr_count {         --$_count;     } }

In this first part of FileIO.pm file, the headers are exactly the same as in Gene.pm .

The opening block, which contains the class data and methods, also remains the same except for the hash %_attribute_properties . This new version of the hash has different attributes (the filename, the file data, the last modification date of the file, and the mode to use in writing a file) tailored to the needs of reading and writing files.

In addition to the read , write , and required properties, there is also a new "no initialization" (or noinit ) property. An attribute with the noinit property may not be given an initial value when an object is created with a call to the new constructor. In this module, attributes such as data from the file, or the date on the file, are set only when the file is read or written. You may also have noticed that the default value for the filedata attribute is an anonymous array.

Note that each attribute has both read and write properties. This being the case, you can simply omit the listing of the properties. However, in the interest of future modification, when I may want to add some attribute that won't have both properties, I've left in the specification of the two read and write properties. (Note that one method name, get_count , doesn't start with an underscore ; this encourages you to call this method to get a count of how many objects currently exist.)

4.2.1.1 The constructor method

You'll notice in the following code that I have cut the new constructor down to the bare bones.

 # The constructor method # Called from class, e.g. $obj = FileIO->new(  ); sub new {     my ($class, %arg) = @_;     # Create a new object     my $self = bless {  }, $class;     $class->_incr_count(  );     return $self; }

Why did I do so? Read on.

The read method

The code continues with the read method:

 # Called from object, e.g. $obj->read(  ); sub read {     my ($self, %arg) = @_;     # Set attributes     foreach my $attribute ($self->_all_attributes(  )) {         # E.g. attribute = "_filename",  argument = "filename"         my($argument) = ($attribute =~ /^_(.*)/);         # If explicitly given         if (exists $arg{$argument}) {             # If initialization is not allowed             if($self->_permissions($attribute, 'noinit')) {                 croak("Cannot set $argument from read: use set_$argument");             }             $self->{$attribute} = $arg{$argument};         # If not given, but required         }elsif($self->_permissions($attribute, 'required')) {             croak("No $argument attribute as required");         # Set to the default         }else{             $self->{$attribute} = $self->_attribute_default($attribute);         }     }     # Read file data     unless( open( FileIOFH, $self->{_filename} ) ) {         croak("Cannot open file " .  $self->{_filename} );     }     $self->{'_filedata'} = [ <FileIOFH> ];     $self->{'_date'} = localtime((stat FileIOFH)[9]);     close(FileIOFH); }

This new read method has two parts. The first includes the initialization of the object's attributes from the arguments and the defaults as specified in the %_attribute_properties hash. The second includes the reading of the file and the setting of the _filedata and _date attributes from the file's contents and its last modification time.

The first loop in the program initializes the attributes. If an attribute is specified as an argument, the first test is to see if the noinit property is set. This forbids initializing the attribute, in which case the program croak s. Otherwise, the attribute is set.

If the attribute isn't passed as an argument but has a required property (only the _filename attribute has the required property), the program croak s.

Finally, if the argument isn't given and not required, the attribute is set to the default value.

After performing those initializations, the read method reads in the specified file. If it can't open the file, the program croak s. (See the exercises for a discussion of this use of croak .)

The file is read by the line:

 $self->{'_filedata'} = [ <FileIOFH> ];

In list context, the input operator on the opened filehandle, which is given by <FileIOFH> reads in the entire file. This is done within an anonymous array, as determined by the square brackets around the input operator angle brackets. A reference to this anonymous array containing the file's contents is then assigned to the _filedata attribute.

4.2.1.2 stat and localtime functions

Finally, the Perl stat and localtime functions are called to generate a string with the file's last modification time, which is assigned to the object attribute _date .

This method of reading a file makes many choices. For instance, the stat command returns an array with many more items of interest about a file, such as its size, owner, access permission modes, and so on (the tenth item of which is the modification time). As you develop your programs, you should be paying attention to details such as whether you need to save some of these additional attributes of a file, the last modification date, or notes about the kind of data in the file.

The next line of code in the program:

 # # N.B. no "clone" method is necessary #

is yet another choice to think about. Are there occasions when cloning a file object makes sense? Maybe I'd like to clone a file object, make some small change to the data, give it a new filename, and write it out. Why have I left this out?

4.2.1.3 The write method

The code continues:

 # Write files # Called from object, e.g. $obj->write(  ); sub write {     my ($self, %arg) = @_;     foreach my $attribute ($self->_all_attributes(  )) {         # E.g. attribute = "_filename",  argument = "filename"         my($argument) = ($attribute =~ /^_(.*)/);         # If explicitly given         if (exists $arg{$argument}) {             $self->{$attribute} = $arg{$argument};         }     }          unless( open( FileIOFH, $self->get_writemode . $self->get_filename ) ) {         croak("Cannot write to file " .  $self->get_filename);     }     unless( print FileIOFH $self->get_filedata ) {         croak("Cannot write to file " .  $self->get_filename);     }     $self->set_date(scalar localtime((stat FileIOFH)[9]));     close(FileIOFH);     return 1; }

The write method handles writing a file object out to an actual file. First, all arguments corresponding to attributes are set as requested . The file is then opened for writing, using the _writemode attribute to specify; for example, > for truncating the file before writing or >> for appending to the file. The print FileIOFH statement actually does the writing to the opened FileIOFH filehandle, retrieving the file data from the object with the get_filedata method defined by means of AUTOLOAD . Finally, the object's _date attribute is reset to the new modification time.

4.2.1.4 AUTOLOAD

The next section of code is the AUTOLOAD method itself:

 # This takes the place of such accessor definitions as: #  sub get_attribute { ... } # and of such mutator definitions as: #  sub set_attribute { ... } sub AUTOLOAD {     my ($self, $newvalue) = @_;     my ($operation, $attribute) = ($AUTOLOAD =~ /(getset)(_\w+)$/);          # Is this a legal method name?     unless($operation && $attribute) {         croak "Method name '$AUTOLOAD' is not in the recognized form\n";     }     unless(exists $self->{$attribute}) {         croak "No such attribute '$attribute' exists in the class ", ref($self);     }     # AUTOLOAD accessors     if($operation eq 'get') {         unless($self->_permissions($attribute, 'read')) {             croak "$attribute does not have read permission";         }         # Turn off strict references to enable symbol table manipulation         no strict "refs";         # Install this accessor definition in the symbol table         *{$AUTOLOAD} = sub {             my ($self) = @_;             unless($self->_permissions($attribute, 'read')) {                 croak "$attribute does not have read permission";             }             if(ref($self->{$attribute}) eq 'ARRAY') {                 return @{$self->{$attribute}};             }else{                 return $self->{$attribute};             }         };         # Turn strict references back on         no strict "refs";         # Return the attribute value         # The attribute could be a scalar or a reference to an array         if(ref($self->{$attribute}) eq 'ARRAY') {             return @{$self->{$attribute}};         }else{             return $self->{$attribute};         }     # AUTOLOAD mutators     }elsif($operation eq 'set') {         unless($self->_permissions($attribute, 'write')) {             croak "$attribute does not have write permission";         }         # Turn off strict references to enable symbol table manipulation         no strict "refs";         # Install this mutator definition in the symbol table         *{$AUTOLOAD} = sub {                my ($self, $newvalue) = @_;             unless($self->_permissions($attribute, 'write')) {                 croak "$attribute does not have write permission";             }             $self->{$attribute} = $newvalue;         };         # Turn strict references back on         no strict "refs";         # Set and return the attribute value         $self->{$attribute} = $newvalue;         return $self->{$attribute};     } }

This AUTOLOAD method has grown! There's only one difference, however, between this code and the AUTOLOAD code for the Gene.pm class. The new set of attributes for FileIO.pm don't all take simple scalar values, as was the case with Gene.pm . Another attribute, _filedata , is a reference to an anonymous array. In order for the accessors to return the correct data, they must check to see if an attribute is a scalar or a reference to an array; the accessors can then dereference and return the data from the method call.

So the accessors, and the definitions of them installed into the symbol table, test for an array reference and dereference it accordingly . Other than that, this AUTOLOAD method is exactly the same as that defined for Gene.pm .

You may also have noticed that sections of code in the AUTOLOAD method are almost identical to each other. Recall that AUTOLOAD is invoked when a method with no subroutine defining it is called. AUTOLOAD must do two things. First, it performs whatever method is requested; for example, if an accessor method is requested, it returns the appropriate value. Second, it defines the subroutine that implements the requested method and installs it in the symbol table so the next time the method is called, AUTOLOAD and its considerable overhead won't be necessary. Because of these parameters, the code AUTOLOAD executes to handle the requested method is nearly identical with the method that AUTOLOAD also defines.

Finally, here are the last sections of the FileIO.pm program:

 # When an object is no longer being used, this will be automatically called # and will adjust the count of existing objects sub DESTROY {     my($self) = @_;     $self->_decr_count(  ); } # Other methods. They do not fall into the same form as the majority handled by AUTOLOAD # 1;

The only change here is that there are no other methods ( Gene.pm had a citation method).

4.2.2 Finishing FileIO

To finish FileIO.pm , here's some very terse (too terse for anything but a textbook ) POD documentation:

 =head1 FileIO FileIO: read and write file data =head1 Synopsis     use FileIO;     my $obj = RawfileIO->read(         filename => 'jkl'     );     print $obj->get_filename, "\n";     print $obj->get_filedata;     $obj->set_date('today');     print $obj->get_date, "\n";     print $obj->get_writemode, "\n";     my @newdata = ("line1\n", "line2\n");     $obj->set_filedata( \@newdata );     $obj->write(filename => 'lkj');     $obj->write(filename => 'lkj', writemode => '>>');     my $o = RawfileIO->read(filename => 'lkj');     print $o->get_filename, "\n";     print $o->get_filedata;     my $gene1 = Gene->new(         name => 'biggene',         organism => 'Mus musculus',         chromosome => '2p',         pdbref => 'pdb5775.ent',         author => 'L.G.Jeho',         date => 'August 23, 1989',     );     print "Gene name is ", $gene1->get_name(  );     print "Gene organism is ", $gene1->_get_organism(  );     print "Gene chromosome is ", $gene1->_get_chromosome(  );     print "Gene pdbref is ", $gene1->_get_pdbref(  );     print "Gene author is ", $gene1->_get_author(  );     print "Gene date is ", $gene1->_get_date(  );     $clone = $gene1->clone(name => 'biggeneclone');     $gene1-> set_chromosome('2q');     $gene1-> set_pdbref('pdb7557.ent');     $gene1-> set_author('G.Mendel');     $gene1-> set_date('May 25, 1865');     $clone->citation('T. Morgan', 'October 3, 1912');     print "Clone citation is ", $clone->citation; =head1 AUTHOR James Tisdall =head1 COPYRIGHT Copyright (c) 2003, James Tisdall =cut

4.2.3 Testing the FileIO Class Module

Now that we've got a class module, complete with examples of its use, let's write a small test program and see how it works. Since the examples in the documentation are, in effect, a small test program, let's try running it. We'll use the file file1.txt I created with my text editor that contains:

 > sample dna  (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat acacctgagccactctcagatgaggaccta

I'll take the code from the documentation pretty much as is, just adding strict and warnings . I'll also include a use lib directive that adds my development library directory to the list of directories in @INC , which tells my computer's Perl where to look for modules. (Recall that you can either edit this line, override it with the PERL5LIB environmental variable, or give your own directory on the command line.) I also add a few print statements to make the output easier to read:

 #!/usr/bin/perl use strict; use warnings; use lib "/home/tisdall/MasteringPerlBio/development/lib"; use FileIO; my $obj = FileIO->new(  ); $obj->read(   filename => 'file1.txt' ); print "The file name is ", $obj->get_filename, "\n"; print "The contents of the file are:\n", $obj->get_filedata; print "\nThe date of the file is ", $obj->get_date, "\n"; $obj->set_date('today'); print "The reset date of the file is ", $obj->get_date, "\n"; print "The write mode of the file is ", $obj->get_writemode, "\n"; print "\nResetting the data and filename\n"; my @newdata = ("line1\n", "line2\n"); $obj->set_filedata( \@newdata ); print "Writing a new file \"file2\"\n"; $obj->write(filename => 'file2'); print "Appending to the new file \"file2\"\n"; $obj->write(filename => 'file2', writemode => '>>'); print "Reading and printing the data from \"file2\":\n"; my $file2 = FileIO->new(  ); $file2->read(   filename => 'file2' ); print "The file name is ", $file2->get_filename, "\n"; print "The contents of the file are:\n", $file2->get_filedata;

I finally run the test program to get the following output:

 The file name is file1.txt The contents of the file are: > sample dna  (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat acacctgagccactctcagatgaggaccta The date of the file is Thu Dec  5 11:22:56 2002 The reset date of the file is today The write mode of the file is > Resetting the data and filename Writing a new file "file2" Appending to the new file "file2" Reading and printing the data from "file2": The file name is file2 The contents of the file are: line1 line2 line1 line2

The module seems to be performing as hoped. So, now we have a simple module that reads and writes files and provides a few options for the write mode.

But, frankly, this isn't too impressive. You've already been reading and writing files in Perl without the overhead of this FileIO module. The interface to the code is nice, and it's good to have objects that contain the file data, but what has really been accomplished?

The real power of this approach is coming up next. Using class inheritance, this simple module can be extended relatively easily in a very useful direction.

It's another case of the basic software engineering approach of making small, simple, generally useful tools, and then combining them into more powerful and specific applications. So, next, I'll take my simple FileIO class and use it as a base class for a bioinformatics-specific class.