4.3 SeqFileIO.pm: Sequence File Formats | Mastering Perl for Bioinformatics

Our primary interest is bioinformatics.Can we extend the FileIO class to handle biological sequence datafiles? For example, can a class be written that takes a GenBank file and writes the sequence out in FASTA format?

Using the technique of inheritance, in this section I present a module for a new class SeqFileIO that performs several basic functions on sequence files of various formats. When you call this module's read method, in addition to reading the file's contents and setting the name , date, and write mode of the file, it automatically determines the format of the sequence file, extracts the sequence, and when available, extracts the annotation, ID, and accession number. In addition, a set of put methods makes it easy to present the sequence and annotation in other formats. ^[1]

^[1] Don Gilbert's readseq package (see http://iobio.bio.indiana.edu/soft/molbio/readseq and ftp://ftp.bio.indiana.edu/molbio/readseq/classic/src) is the classic program (written in C) for reading and writing multiple sequence file formats.

4.3.1 Analysis of SeqFileIO.pm

The first part of the module SeqFileIO.pm contains the block with definitions of the new, or revised, class data and methods.

The first thing you should notice is the use command:

 use base ( "FileIO" );

This Perl command tells the current package SeqFileIO it's inheriting from the base class FileIO . Here's another statement that's often used for this purpose:

 @ISA = ( "FileIO" );

The @ISA predefined variable tells a package that it "is a" version of some base class; it then can inherit methods from that base class. The use base directive sets the @ISA array to the base class(es), plus a little else besides. (Check perldoc base for the whole story.) Without getting bogged down in details use base works a little more robustly than just setting the @ISA array, so that's what I'll use here:

 package SeqFileIO; use base ( "FileIO" ); use strict; use warnings; #use vars '$AUTOLOAD'; use Carp; # Class data and methods {     # A list of all attributes with defaults and read/write/required/noinit properties     my %_attribute_properties = (         _filename    => [ '',   'read.write.required'],         _filedata    => [ [ ],  'read.write.noinit'],         _date        => [ '',   'read.write.noinit'],         _writemode   => [ '>',  'read.write.noinit'],         _format      => [ '',   'read.write'],         _sequence    => [ '',   'read.write'],         _header      => [ '',   'read.write'],         _id          => [ '',   'read.write'],         _accession   => [ '',   'read.write'],     );              # Return a list of all attributes     sub _all_attributes {             keys %_attribute_properties;     }     # Check if a given property is set for a given attribute     sub _permissions {         my($self, $attribute, $permissions) = @_;         $_attribute_properties{$attribute}[1] =~ /$permissions/;     }     # Return the default value for a given attribute     sub _attribute_default {             my($self, $attribute) = @_;         $_attribute_properties{$attribute}[0];     }     my @_seqfileformats = qw(         _raw         _embl         _fasta         _gcg         _genbank         _pir         _staden     );     sub isformat {          my($self) = @_;         for my $format (@_seqfileformats) {             my $is_format = "is$format";             if($self->$is_format) {                 return $format;             }         }         return 'unknown';     } }

4.3.1.1 The power of inheritance

A comparison with the opening block in the base class FileIO is instructive. You'll see that I'm redefining the %_attribute_properties hash and the methods in the block that access the hash as closures. This is necessary to extend the new class to handle new attributes that relate specifically to sequence datafiles:

_format: The format of the sequence datafile, such as FASTA or GenBank
_sequence: The sequence data extracted from the sequence datafile as a scalar string
_header: The header part of the annotation; defined somewhat loosely in this module
_id: The ID of the sequence datafile, such as a gene name or other identifier
_accession: The accession number of the sequence datafile, when provided

You'll notice, in comparing this opening block with the opening block of the base class FileIO , that the methods and variables relating to counting the number of objects has been omitted in this new class.

Here's where you see the power of inheritance for the first time. When the code in the new SeqFileIO.pm module tries to call, say, the get_count method, and Perl sees that it's not defined in the module, it will go on a hunt for the method in the base class and use the definition it finds there. On the other hand, if it finds the method in SeqFileIO.pm , it just uses that; if the get_count method appeared in the base class but is redefined in SeqFileIO.pm , it uses that redefinition and ignores the older definition in the base class.

So, you don't have to rewrite methods in the base class(es); you can just call them. Perl not only finds them, it also calls them as if they were defined in the new class. For example, ref($self) returns SeqFileIO , not FileIO , without regard to whether the method is defined in the new class or inherited from the old class.

Finally, there are some new definitions in this new version of the opening block. An array @_seqfileformats lists the sequence file formats the module knows about, and a method isformat calls the methods associated with each format (such as is_fasta defined later in the module) until it either finds what format the file is in or returns unknown .

The @_seqfileformats array uses the qw operator. This splits the words on whitespace (including newlines) and returns a list of quoted words. It's a convenience for giving lists of words without having to quote each one or separate them by commas. (Check the perlop manpage for the section on "RegExp Quote-Like Operators" for all the variations and alternative quoting operators.)

Following the opening block, notice that there is no new method defined in this class. The simple, bare-bones new method from the FileIO class serves just as well for this new class; thanks to the inheritance mechanism, there's no need to write a new one. A program that calls SeqFileIO->new will use the same method from the FileIO class, but the class name will be properly set to SeqFileIO for the new object created.

4.3.1.2 A new read method

The next part of the program has a rewrite of the read method. As a result, the read method from the parent FileIO class is being overridden:

 # Called from object, e.g. $obj->read(  ); sub read {     my ($self, %arg) = @_;     # Set attributes     foreach my $attribute ($self->_all_attributes(  )) {         # E.g. attribute = "_filename",  argument = "filename"         my($argument) = ($attribute =~ /^_(.*)/);         # If explicitly given         if (exists $arg{$argument}) {             # If initialization is not allowed             if($self->_permissions($attribute, 'noinit')) {                 croak("Cannot set $argument from read: use set_$argument");             }             $self->{$attribute} = $arg{$argument};         # If not given, but required         }elsif($self->_permissions($attribute, 'required')) {             croak("No $argument attribute as required");         # Set to the default         }else{             $self->{$attribute} = $self->_attribute_default($attribute);         }     }     # Read file data     unless( open( FileIOFH, $self->{_filename} ) ) {         croak("Cannot open file " .  $self->{_filename} );     }     $self->{'_filedata'} = [ <FileIOFH> ];     $self->{'_date'} = localtime((stat FileIOFH)[9]);     $self->{'_format'} = $self->isformat;     my $parsemethod = 'parse' . $self->{'_format'};     $self->$parsemethod;     close(FileIOFH); }

The new read method just shown is almost exactly the same as the read function in the FileIO class. There are only three new lines of code, just before the end of the subroutine, preceding the call to the close function:

 $self->{'_format'} = $self->isformat; my $parsemethod = 'parse' . $self->{'_format'}; $self->$parsemethod;

These three new lines initialize the new attributes that were defined for this class. First, a call to isformat determines the format of the file and sets the _format attribute appropriately. The appropriate parse_ method name is then constructed , and finally, a call to that method is made. As you're about to see, the parse_ methods extract the sequence, header, ID, and accession number (or as many of these as possible) from the file data and set the object's attributes accordingly .

So, by the end of the read method, the file data has been read into the object, the format determined, and the interesting parts of the data (such as the sequence data) parsed out.

4.3.2 New Methods: is, parse, and put

The rest of the module consists of three groups of methods that handle the different sequence file formats. These methods didn't appear in the more simple and generic FileIO module and must be defined here:

is_: Tests to see if data is in a particular sequence file format
parse_: Parses out the sequence, and when possible the header, ID, and accession number
put_: Takes the sequence attribute, and when available the header, ID, and accession number attributes, and writes them in the sequence file format named in the format attribute

4.3.2.1 is_ methods

The first group of methods tests an object's file data to see if it conforms to a particular sequence file format:

 sub is_raw {   my($self) = @_;   my $seq = join('', @{$self->{_filedata}} );   ($seq =~ /^[ACGNT\s]+$/) ? return 'raw' : 0; } sub is_embl {   my($self) = @_;   my($begin,$seq,$end) = (0,0,0);   foreach( @{$self->{_filedata}} ) {     /^ID\s/ && $begin++;     /^SQ\s/ && $seq++;     /^\/\// && $end++;     (($begin =  = 1) && ($seq =  = 1) && ($end =  = 1)) && return 'embl';   }   return; } sub is_fasta {   my($self) = @_;   my($flag) = 0;   for(@{$self->{_filedata}}) {     #This to avoid confusion with Primer, which can have input beginning ">"     /^\*seq.*:/i && ($flag = 0) && last;     if( /^>/ && $flag =  = 1) {       last;     }elsif( /^>/ && $flag =  = 0) {       $flag = 1;     }elsif( (! /^>/) && $flag =  = 0) { #first line must start with ">"       last;     }   }   $flag ? return 'fasta' : return; } sub is_gcg {   my($self) = @_;   my($i,$j) = (0,0);   for(@{$self->{_filedata}}) {     /^\s*$/ && next;     /Length:.*Check:/ && ($i += 1);     /^\s*\d+\s*[a-zA-Z\s]+/ && ($j += 1);     ($i =  = 1) && ($j =  = 1) && return('gcg');   }   return; } sub is_genbank {   my($self) = @_;   my $Features = 0;   for(@{$self->{_filedata}}) {     /^LOCUS/ && ($Features += 1);     /^DEFINITION/ && ($Features += 2);     /^ACCESSION/ && ($Features += 4);     /^ORIGIN/ &&  ($Features += 8);     /^\/\// && ($Features += 16);     ($Features =  = 31) && return 'genbank';   }   return; } sub is_pir {   my($self) = @_;   my($ent,$ti,$date,$org,$ref,$sum,$seq,$end) = (0,0,0,0,0,0,0,0);   for(@{$self->{_filedata}}) {     /ENTRY/ && $ent++;     /TITLE/ && $ti++;     /DATE/ && $date++;     /ORGANISM/ && $org++;     /REFERENCE/ && $ref++;     /SUMMARY/ && $sum++;     /SEQUENCE/ && $seq++;     /\/\/\// && $end++;     $ent =  = 1 && $ti =  = 1 && $date >= 1 && $org >= 1       && $ref >= 1 && $sum =  = 1 && $seq =  = 1 && $end =  = 1       && return 'pir';   }   return; } sub is_staden {   my($self) = @_;   for(@{$self->{_filedata}}) {     /<-+([^-]*)-+>/ && return 'staden';   }   0; }

is_ methods are designed to be fast. They don't check to see if the file conforms to the official file format in every detail. Instead, they look to see if, given that the file is supposed to be a sequence file, there is a good indication that it is a particular file format. In other words, these methods perform heuristic, not rigorous , tests. If they are well conceived, these methods correctly identify the different formats without confusion and with a minimum of code and computation. However, they don't ensure that a format is conforming in every respect to the official format definition. (See the exercises for other approaches to file format recognition.)

Also, note that these are fairly simple tests; for example, the is_raw method doesn't check for the IUB ambiguity codes but only the four bases A, C, G, T, plus N for an undetermined base. Furthermore, some software systems have more than one format, for which only one format is shown here, such as PIR and GCG. The bottom line is that this code does what it is intended to do, but not more. It is, however, fairly short and easy to modify and extend, and I encourage you to try your hand at doing so in the exercises.

An interesting Perl note: the "leaning toothpick" regular expression notation for three forward slashes /// is a bit forbidding because each forward slash needs a backslash escape in front of it:

 /\/\/\// && $end++;

An alternative would be to use m and a separator other than /, which leads to more readable code:

 m!///! && $end++;

One more interesting Perl note: in this code, I return a false value from a subroutine like so:

 return;

This is a bit confusing because unless you already know or check in the documentation for return , it's not clear what's happening. In a scalar context, return; returns the undefined value undef ; in a list context, return; returns an empty list. So, no matter what context the subroutine is called from, a valid value is returned.

4.3.2.2 put_ methods

The next group of put_ methods returns an object's sequence and annotation data in a particular sequence file format:

 sub put_raw {   my($self) = @_;   my($out);   ($out = $self->{_sequence}) =~ tr/a-z/A-Z/;   return($out); } sub put_embl {   my($self) = @_;   my(@out,$tmp,$len,$i,$j,$a,$c,$g,$t,$o);   $len = length($self->{_sequence});   $a=($self->{_sequence} =~ tr/Aa//);   $c=($self->{_sequence} =~ tr/Cc//);   $g=($self->{_sequence} =~ tr/Gg//);   $t=($self->{_sequence} =~ tr/Tt//);   $o=($len - $a - $c - $g - $t);   $i=0;   $out[$i++] = sprintf("ID   %s %s\n",$self->{_header}, $self->{_id} );   $out[$i++] = "XX\n";   $out[$i++] = sprintf("SQ   sequence %d BP; %d A; %d C; %d G; %d T; %d other;\n",               $len, $a, $c, $g, $t, $o);   for($j = 0 ; $j < $len ; ) {       $out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10));       $j += 10;       if( $j < $len && $j % 60 != 0 ) {         $out[$i] .= " ";       }elsif ($j % 60 =  = 0 ) {         $out[$i++] .= "\n";       }   }   if($j % 60 != 0 ) {     $out[$i++] .= "\n";   }   $out[$i] = "//\n";   return @out; } sub put_fasta {   my($self) = @_;   my(@out,$len,$i,$j);   $len = length($self->{_sequence});   $i = 0;   $out[$i++] = "> " . $self->{_header} . "\n";   for($j=0; $j<$len ; $j += 50) {     $out[$i++]=sprintf("%.50s\n",substr($self->{_sequence},$j,50));   }   return @out; } sub put_gcg {   my($self) = @_;   my(@out,$len,$i,$j,$cnt,$sum);   $len = length($self->{_sequence});   #calculate Checksum   for($i=0; $i<$len ;$i++) {     $cnt++;     $sum += $cnt * ord(substr($self->{_sequence},$i,1));     ($cnt =  = 57)&& ($cnt=0);   }   $sum %= 10000;   $i = 0;                $out[$i++] = sprintf("%s\n",$self->{_header});   $out[$i++] = sprintf("    %s Length: %d (today)  Check: %d  ..\n", $self->{_id},                       $len, $sum);   for($j = 0 ; $j < $len ; ) {       if( $j % 50 =  = 0) {         $out[$i] = sprintf("%8d  ",$j+1);       }       $out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10));       $j += 10;       if( $j < $len && $j % 50 != 0 ) {         $out[$i] .= " ";       }elsif ($j % 50 =  = 0 ) {         $out[$i++] .= "\n";       }   }   if($j % 50 != 0 ) {     $out[$i] .= "\n";   }   return @out; } sub put_genbank {   my($self) = @_;   my(@out,$len,$i,$j,$cnt,$sum);   my($seq) = $self->{_sequence};   $seq =~ tr/A-Z/a-z/;   $len = length($seq);   for($i=0; $i<$len ;$i++) {     $cnt++;     $sum += $cnt * ord(substr($seq,$i,1));     ($cnt =  = 57) && ($cnt=0);   }   $sum %= 10000;   $i = 0;   $out[$i++] = sprintf("LOCUS       %s       %d bp\n",$self->{_id}, $len);   $out[$i++] = sprintf("DEFINITION  %s , %d bases, %d sum.\n", $self->{_header},                        $len, $sum);   $out[$i++] = sprintf("ACCESSION  %s\n", $self->{_accession}, );   $out[$i++] = sprintf("ORIGIN\n");   for($j = 0 ; $j < $len ; ) {       if( $j % 60 =  = 0) {         $out[$i] = sprintf("%8d  ",$j+1);       }       $out[$i] .= sprintf("%s",substr($seq,$j,10));       $j += 10;       if( $j < $len && $j % 60 != 0 ) {         $out[$i] .= " ";       }elsif($j % 60 =  = 0 ) {         $out[$i++] .= "\n";       }   }   if($j % 60 != 0 ) {     $out[$i] .= "\n";     ++i;   }   $out[$i] = "//\n";   return @out; } sub put_pir {   my($self) = @_;   my($seq) = $self->{_sequence};   my(@out,$len,$i,$j,$cnt,$sum);   $len = length($seq);   for($i=0; $i<$len ;$i++) {     $cnt++;     $sum += $cnt * ord(substr($seq,$i,1));     ($cnt=  =57) && ($cnt=0);   }   $sum %= 10000;   $i = 0;   $out[$i++] = sprintf("ENTRY           %s\n",$self->{_id});   $out[$i++] = sprintf("TITLE           %s\n",$self->{_header});   #JDT ACCESSION out if defined   $out[$i++] = sprintf("DATE            %s\n",'');   $out[$i++] = sprintf("REFERENCE       %s\n",'');   $out[$i++] = sprintf("SUMMARY #Molecular-weight %d  #Length %d  #Checksum %d\                       n",0,$len,$sum);   $out[$i++] = sprintf("SEQUENCE\n");   $out[$i++] = sprintf("              5      10      15      20      25      30\n");   for($j=1; $seq && $j < $len ; $j += 30) {     $out[$i++] = sprintf("%7d ",$j);     $out[$i++] = sprintf("%s\n", join(' ',split(//,substr($seq, $j - 1,length($seq)                        < 30 ? length($seq) : 30))) );   }   $out[$i++] = sprintf("///\n");   return @out; } sub put_staden {   my($self) = @_;   my($seq) = $self->{_sequence};   my($i,$j,$len,@out);   $i = 0;   $len = length($self->{_sequence});   $out[$i] = ";\<------------------\>\n";   substr($out[$i],int((20-length($self->{_id}))/2),length($self->{_id})) = $self->                           {_id};   $i++;   for($j=0; $j<$len ; $j+=60) {     $out[$i++]=sprintf("%s\n",substr($self->{_sequence},$j,60));   }   return @out; }

The put_ methods are, by necessity, more cognizant of the detailed rules of a particular file format.

Note the Perl function ord in put_gcg . This built-in function gives the numeric value of a character. Other Perl functions such as sprintf and substr can be reviewed as necessary by typing perldoc -f substr (for example) or by visiting the http://www.perdoc.com web page.

4.3.2.3 parse_ methods

parse_ , the third and final group of methods, parses the contents of a file in a particular format, extracting the sequence and, if possible, some additional header information.

 sub parse_raw {   my($self) = @_; ## Header and ID should be set in calling program after this    my($seq);      $seq = join('',@{$self->{_filedata}});   if( ($seq =~ /^([acgntACGNT\-\s]+)$/)) {     ($self->{_sequence} = $seq) =~ s/\s//g;   }else{     carp("parse_raw failed");   } } sub parse_embl {   my($self) = @_;   my($begin,$seq,$end,$count) = (0,0,0,0);   my($sequence,$head,$acc,$id);   for(@{$self->{_filedata}}) {     ++$count;     if(/^ID/) {       $begin++;       /^ID\s*(.*\S)\s*/ && ($id = ($head = )) =~ s/\s.*//;     }elsif(/^SQ\s/){       $seq++;     }elsif(/^\/\//){       $end++;     }elsif($seq =  = 1){       $sequence .= $_;     }elsif(/^AC\s*(.*(;\S)).*/){ #put this here - AC could be sequence       $acc .= ;     }     if($begin =  = 1 && $seq =  = 1 && $end =  = 1) {       $sequence =~ tr/a-zA-Z//cd;       $sequence =~ tr/a-z/A-Z/;       $self->{_sequence} = $sequence;       $self->{_header} = $head;        $self->{_id} = $id;       $self->{_accession} = $acc;       return 1;     }   }   return; } sub parse_fasta {   my($self) = @_;   my($flag,$count) = (0,0);   my($seq,$head,$id);   for(@{$self->{_filedata}}) {     #avoid confusion with Primer, which can have input beginning ">"     /^\*seq.*:/i && ($flag =  = 0) && last;     if(/^>/ && $flag =  = 1) {        last;     }elsif(/^>/ && $flag =  = 0){       /^>\s*(.*\S)\s*/ && ($id=($head=)) =~ s/\s.*//;       $flag=1;     }elsif( (! /^>/) && $flag =  = 1) {        $seq .= $_;     }elsif( (! /^>/) && $flag =  = 0) {        last;     }     ++$count;   }   if($flag) {     $seq =~ tr/a-zA-Z-//cd;     $seq =~ tr/a-z/A-Z/;     $self->{_sequence} = $seq;     $self->{_header} = $head;     $self->{_id} = $id;   } } sub parse_gcg {   my($self) = @_;   my($seq,$head,$id);   my($count,$flag) = (0,0);   for(@{$self->{_filedata}}) {     if(/^\s*$/) {        ;     }elsif($flag =  = 0 && /Length:.*Check:/){       /^\s*(\S+).*Length:.*Check:/;       $flag=1;       ($id=) =~ s/\s.*//;     }elsif($flag =  = 0 && /^\S/) {       ($head = $_) =~ s/\n//;      }elsif($flag =  = 1 && /^\s*[^\d\s]/) {       last;     }elsif($flag =  = 1 && /^\s*\d+\s*[a-zA-Z \t]+$/) {        $seq .= $_;     }     $count++;   }   $seq =~ tr/a-zA-Z//cd;   $seq =~ tr/a-z/A-Z/;   $head = $id unless $head;   $self->{_sequence} = $seq;   $self->{_header} = $head;   $self->{_id} = $id;   return 1; } # # sub parse_genbank {   my($self) = @_;   my($count,$features,$flag,$seqflag) = (0,0,0,0);   my($seq,$head,$id,$acc);   for(@{$self->{_filedata}}) {     if( /^LOCUS/ && $flag =  = 0 ) {       /^LOCUS\s*(.*\S)\s*$/;       ($id=($head=)) =~ s/\s.*//;       $features += 1;       $flag = 1;     }elsif( /^DEFINITION\s*(.*)/ && $flag =  = 1) {       $head .= " ";       $features += 2;     }elsif( /^ACCESSION/ && $flag =  = 1 ) {       /^ACCESSION\s*(.*\S)\s*$/;       $head .= " ".($acc=);       $features += 4;     }elsif( /^ORIGIN/ ) {       $seqflag = 1;       $features += 8;     }elsif( /^\/\// ) {       $features += 16;     }elsif( $seqflag =  = 0 ) {        ;      }elsif($seqflag =  = 1) {        $seq .= $_;     }     ++$count;     if($features =  = 31) {       $seq =~ tr/a-zA-Z//cd;       $seq =~ tr/a-z/A-Z/;       $self->{_sequence} = $seq;       $self->{_header} = $head;       $self->{_id} = $id;       $self->{_accession} = $acc;       return 1;     }   }   return; } sub parse_pir {   my($self) = @_;   my($begin,$tit,$date,$organism,$ref,$summary,$seq,$end,$count) =                       (0,0,0,0,0,0,0,0,0);   my($flag,$seqflag) = (0,0);   my($sequence,$header,$id,$acc);   for(@{$self->{_filedata}}) {     ++$count;     if( /^ENTRY\s*(.*\S)\s*$/ && $flag =  = 0 ) {       $header=;       $flag=1;       $begin++;     }elsif( /^TITLE\s*(.*\S)\s*$/ && $flag =  = 1 ) {       $header .= ;       $tit++;     }elsif( /ORGANISM/ ) {       $organism++;     }elsif( /^ACCESSIONS\s*(.*\S)\s*$/ && $flag =  = 1 ) {       ($id=($acc=)) =~ s/\s*//;     }elsif( /DATE/ ) {       $date++;     }elsif( /REFERENCE/ ) {       $ref++;     }elsif( /SUMMARY/ ) {       $summary++;     }elsif( /^SEQUENCE/ ) {       $seqflag = 1;       $seq++;     }elsif( /^\/\/\// && $flag =  = 1 ) {       $end++;     }elsif( $seqflag =  = 0) {       next;     }elsif( $seqflag =  = 1 && $flag =  = 1) {       $sequence .= $_;     }     if( $begin =  = 1 && $tit =  = 1 && $date >= 1 && $organism >= 1         && $ref >= 1 && $summary =  = 1 && $seq =  = 1 && $end =  = 1        ) {       $sequence =~ tr/a-zA-Z//cd;       $sequence =~ tr/a-z/A-Z/;       $self->{_sequence} = $seq;       $self->{_header} = $header;       $self->{_id} = $id;       $self->{_accession} = $acc;       return 1;     }   }   return; } sub parse_staden {   my($self) = @_;   my($flag,$count) = (0,0);   my($seq,$head,$id);   for(@{$self->{_filedata}}) {     if( /<---*\s*(.*[^-\s])\s*-*--->(.*)/ && $flag =  = 0 ) {       $id = $head = ;       $seq .= ;       $flag = 1;     }elsif( /<---*(.*)-*--->/ && $flag =  = 1 ) {       $count--;       last;     }elsif( $flag =  = 1 ) {       $seq .= $_;     }     ++$count;   }   if( $flag ) {     $seq =~ s/-/N/g;     $seq =~ tr/a-zA-Z-//cd;     $seq =~ tr/a-z/A-Z/;     $self->{_sequence} = $seq;     $self->{_header} = $head;     $self->{_id} = $id;     return 1;   }   return; } 1;

That's the end of the SeqFileIO.pm module that defines the SeqFileIO class.

You can see that to add the capability to handle a new sequence file format, you simply have to write new is_ , put_ , and parse_ methods, and add the name of the new format to the @_seqfiletypes array. So, extending this software to handle more sequence file formats is relatively easy. (See the exercises.)

To end this chapter, here is a small test program testSeqFileIO that exercises the main parts of the SeqFileIO.pm module, followed by its output. See the exercises for a discussion of ways to improve this test.

4.3.3 Testing SeqFileIO.pm

Here is the program testSeqFileIO to test the SeqFileIO module:

 #!/usr/bin/perl use strict; use warnings; use lib "/home/tisdall/MasteringPerlBio/development/lib"; use SeqFileIO; # # First test basic FileIO operations #  (plus filetype attribute) # my $obj = SeqFileIO->new(  ); $obj->read(   filename => 'file1.txt' ); print "The file name is ", $obj->get_filename, "\n"; print "The contents of the file are:\n", $obj->get_filedata; print "\nThe date of the file is ", $obj->get_date, "\n"; print "The filetype of the file is ", $obj->get_filetype, "\n"; $obj->set_date('today'); print "The reset date of the file is ", $obj->get_date, "\n"; print "The write mode of the file is ", $obj->get_writemode, "\n"; print "\nResetting the data and filename\n"; my @newdata = ("line1\n", "line2\n"); $obj->set_filedata( \@newdata ); print "Writing a new file \"file2.txt\"\n"; $obj->write(filename => 'file2.txt'); print "Appending to the new file \"file2.txt\"\n"; $obj->write(filename => 'file2.txt', writemode => '>>'); print "Reading and printing the data from \"file2.txt\":\n"; my $file2 = SeqFileIO->new(  ); $file2->read(   filename => 'file2.txt' ); print "The file name is ", $file2->get_filename, "\n"; print "The contents of the file are:\n", $file2->get_filedata; print "The filetype of the file is ", $file2->get_filetype, "\n"; print <<'HEADER'; ########################################## # # Test file format recognizing and reading # ########################################## HEADER my $genbank = SeqFileIO->new(  ); $genbank->read(   filename => 'record.gb' ); print "The file name is ", $genbank->get_filename, "\n"; print "\nThe date of the file is ", $genbank->get_date, "\n"; print "The filetype of the file is ", $genbank->get_filetype, "\n"; print "The contents of the file are:\n", $genbank->get_filedata; print "\n####################\n####################\n####################\n"; my $raw = SeqFileIO->new(  ); $raw->read(   filename => 'record.raw' ); print "The file name is ", $raw->get_filename, "\n"; print "\nThe date of the file is ", $raw->get_date, "\n"; print "The filetype of the file is ", $raw->get_filetype, "\n"; print "The contents of the file are:\n", $raw->get_filedata; print "\n####################\n####################\n####################\n"; my $embl = SeqFileIO->new(  ); $embl->read(   filename => 'record.embl' ); print "The file name is ", $embl->get_filename, "\n"; print "\nThe date of the file is ", $embl->get_date, "\n"; print "The filetype of the file is ", $embl->get_filetype, "\n"; print "The contents of the file are:\n", $embl->get_filedata; print "\n####################\n####################\n####################\n"; my $fasta = SeqFileIO->new(  ); $fasta->read(   filename => 'record.fasta' ); print "The file name is ", $fasta->get_filename, "\n"; print "\nThe date of the file is ", $fasta->get_date, "\n"; print "The filetype of the file is ", $fasta->get_filetype, "\n"; print "The contents of the file are:\n", $fasta->get_filedata; print "\n####################\n####################\n####################\n"; my $gcg = SeqFileIO->new(  ); $gcg->read(   filename => 'record.gcg' ); print "The file name is ", $gcg->get_filename, "\n"; print "\nThe date of the file is ", $gcg->get_date, "\n"; print "The filetype of the file is ", $gcg->get_filetype, "\n"; print "The contents of the file are:\n", $gcg->get_filedata; print "\n####################\n####################\n####################\n"; my $staden = SeqFileIO->new(  ); $staden->read(   filename => 'record.staden' ); print "The file name is ", $staden->get_filename, "\n"; print "\nThe date of the file is ", $staden->get_date, "\n"; print "The filetype of the file is ", $staden->get_filetype, "\n"; print "The contents of the file are:\n", $staden->get_filedata; print "\n####################\n####################\n####################\n"; print <<'REFORMAT'; ########################################## # # Test file format reformatting and writing # ########################################## REFORMAT print "At this point there are ", $staden->get_count, " objects.\n\n"; print "######\n###### Testing put methods\n######\n\n"; print "\nPrinting staden data in raw format:\n"; print $staden->put_raw; print "\nPrinting staden data in embl format:\n"; print $staden->put_embl; print "\nPrinting staden data in fasta format:\n"; print $staden->put_fasta; print "\nPrinting staden data in gcg format:\n"; print $staden->put_gcg; print "\nPrinting staden data in genbank format:\n"; print $staden->put_genbank; print "\nPrinting staden data in PIR format:\n"; print $staden->put_pir;

The test program depends on certain files being present on the system; the contents of the files are clear from the program output, and the files can be downloaded from this book's web site, along with the rest of the programs for this book.

4.3.4 Results

Here is the output from testSeqFileIO :

 The file name is file1.txt The contents of the file are: > sample dna  (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat acacctgagccactctcagatgaggaccta The date of the file is Thu Dec  5 11:22:56 2002 The filetype of the file is _fasta The reset date of the file is today The write mode of the file is > Resetting the data and filename Writing a new file "file2.txt" Appending to the new file "file2.txt" Reading and printing the data from "file2.txt": The file name is file2.txt The contents of the file are: line1 line2 line1 line2 The filetype of the file is _unknown ########################################## # # Test file format recognizing and reading # ########################################## The file name is record.gb The date of the file is Sun Mar 30 14:30:09 2003 The filetype of the file is _genbank The contents of the file are: LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000 DEFINITION  Sequence severely truncated for demonstration. ACCESSION   AB031069 VERSION     AB031069.1  GI:8100074 KEYWORDS    . SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to             mRNA.   ORGANISM  Homo sapiens             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;             Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE   1  (sites)   AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and             Takano,T.   TITLE     PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,             is regulated by proteolysis   JOURNAL   Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)   MEDLINE   20261256 REFERENCE   2  (bases 1 to 2487)   AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and             Takano,T.   TITLE     Direct Submission   JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.             Tadahiro Fujino, Keio University School of Medicine, Department of             Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan             (E-mail:fujino@microb.med.keio.ac.jp,             Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508) FEATURES             Location/Qualifiers      source          1..2487                      /organism="Homo sapiens"                      /db_xref="taxon:9606"                      /sex="male"                      /cell_line="HuS-L12"                      /cell_type="lung fibroblast"                      /dev_stage="embryo"      gene            229..2199                      /gene="PCCX1"      CDS             229..2199                      /gene="PCCX1"                      /note="a nuclear protein carrying a PHD finger and a CXXC                      domain"                      /codon_start=1                      /product="protein containing CXXC domain 1"                      /protein_id="BAA96307.1"                      /db_xref="GI:8100075"                      /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD                      NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP                      RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ                      QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY                      FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP                      EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE                      KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS                      DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR                      FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK                      YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC                      PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT                      AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR" BASE COUNT      564 a    715 c    768 g    440 t ORIGIN               1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg        61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg       121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt       181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat       241 aaaaaaaaaa aaaaaaaaaa aaaaaaa // #################### #################### #################### The file name is record.raw The date of the file is Sun Mar 30 14:30:39 2003 The filetype of the file is _raw The contents of the file are: AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAA AAAAAAAAAAAA #################### #################### #################### The file name is record.embl The date of the file is Sun Mar 30 14:31:23 2003 The filetype of the file is _embl The contents of the file are: ID   AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely  truncated for demonstration. AB031069 AB031069 XX SQ   sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other; AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA AAAAAAAAAA AAAAAAA // #################### #################### #################### The file name is record.fasta The date of the file is Sun Mar 30 14:31:40 2003 The filetype of the file is _fasta The contents of the file are: > AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely  truncated for demonstration. AB031069 AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA AAAAAAAAAAAAAAAAA #################### #################### #################### The file name is record.gcg The date of the file is Sun Mar 30 14:31:47 2003 The filetype of the file is _gcg The contents of the file are: AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely truncated for demonstration. AB031069     AB031069 Length: 267 (today)  Check: 4285  ..        1  AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG       51  TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT      101  GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC      151  TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT      201  CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA      251  AAAAAAAAAA AAAAAAA #################### #################### #################### The file name is record.staden The date of the file is Sun Mar 30 14:32:01 2003 The filetype of the file is _staden The contents of the file are: ;<----AB031069------> AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGG AGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCG GAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTCTGTGCAGGTT CGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGAT AAAAAAAAAAAAAAAAAAAAAAAAAAA #################### #################### #################### ########################################## # # Test file format reformatting and writing # ########################################## At this point there are 8 objects. ###### ###### Testing put methods ###### Printing staden data in raw format: AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAAA AAAAAAAAAAA Printing staden data in embl format: ID   AB031069 AB031069 XX SQ   sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other; AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA AAAAAAAAAA AAAAAAA // Printing staden data in fasta format: > AB031069 AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA AAAAAAAAAAAAAAAAA Printing staden data in gcg format: AB031069     AB031069 Length: 267 (today)  Check: 4285  ..        1  AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG       51  TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT      101  GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC      151  TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT      201  CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA      251  AAAAAAAAAA AAAAAAA Printing staden data in genbank format: LOCUS       AB031069       267 bp DEFINITION  AB031069 , 267 bases, 829 sum. ACCESSION   ORIGIN        1  agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg       61  agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg      121  gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt      181  cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat      241  aaaaaaaaaa aaaaaaaaaa aaaaaaa // Printing staden data in PIR format: ENTRY           AB031069 TITLE           AB031069 DATE             REFERENCE        SUMMARY         #Molecular-weight 0  #Length 267  #Checksum 4285 SEQUENCE                 5        10        15        20        25        30       1 A G A T G G C G G C G C T G A G G G G T C T T G G G G G C T      31 C T A G G C C G G C C A C C T A C T G G T T T G C A G C G G      61 A G A C G A C G C A T G G G G C C T G C G C A A T A G G A G      91 T A C G C T G C C T G G G A G G C G T G A C T A G A A G C G     121 G A A G T A G T T G T G G G C G C C T T T G C A A C C G C C     151 T G G G A C G C C G C C G A G T G G T C T G T G C A G G T T     181 C G C G G G T C G C T G G C G G G G G T C G T G A G G G A G     211 T G C G C C G G G A G C G G A G A T A T G G A G G G A G A T     241 A A A A A A A A A A A A A A A A A A A A A A A A A A A ///