9.7 bptutorial.pl: sequence_manipulation Demo


In this section, I'll go through the code for the demo subroutine sequence_manipulation that was shown in the last section.

The subroutine is actually an anonymous subroutine; a reference to the subroutine is saved in the scalar reference variable $sequence_manipulation :

 $sequence_manipulations = sub { ... } 

The first few lines of code declare some variables with my . Notice that these are not being passed in as arguments; this method uses no arguments but does occasionally use global variables such as $dna_seq_file , which, as you've just seen, contain the pathname of the input sequence file the demo will use:

 my ($infile, $in, $out, $seqobj); $infile = $dna_seq_file; print "\nBeginning sequence_manipulations and SeqIO example... \n"; 

The code is cross-referenced to the tutorial sections of the file. The next comment line refers to the part of the document:

 # III.3.1 Transforming sequence files (SeqIO) 

which can be looked up in the table of contents to the document for further reading:

 III.3 Manipulating sequences III.3.1 Manipulating sequence data with Seq methods (Seq) 

Now, I'll take a look at the first section of example code in the sequence_manipulations method:

 # III.3.1 Transforming sequence files (SeqIO) $in  = Bio::SeqIO->new('-file' => $infile ,'-format' => 'Fasta'); $seqobj = $in->next_seq(  ); # perl "tied filehandle" syntax is available to SeqIO, # allowing you to use the standard <> and print operations # to read and write sequence objects, eg: #$out = Bio::SeqIO->newFh('-format' => 'EMBL'); $out = Bio::SeqIO->newFh('-format' => 'fasta'); print "First sequence in fasta format... \n"; print $out $seqobj; 

The code starts with a call to the new object constructor of the Bio::SeqIO class. The new method is being passed the pathname to a FASTA file in $infile , and told that the format is FASTA.

A quick look at the Bio::SeqIO documentation explains that the call to Bio::SeqIO->new returns a stream object for the specified format. So, $out is a stream object (a stream is input or output of data) for FASTA-formatted data, and $in is a stream object for FASTA-formatted input from the file named in the $infile variable. These $in and $out objects are also filehandles.

After the $in object is initialized on the FASTA file named in $infile , it calls the next_seq method, which gets the next (in this case, the first and perhaps only) FASTA record from the file, and it creates a sequence object $seqobj . The output $out object is created. The Perl print statement is then called, using $out as a filehandle, and printing $seqobj . This prints the first FASTA record from that file, as shown in the demo output in the last section and repeated here:

 First sequence in fasta format...  >Test1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC ACAACATCCATGAAACGCATTAGCACCACC 

The next part of the sequence_manipulation demo code shows many of the methods for extracting information from a sequence object:

 # III.4 Manipulating individual sequences # The following methods return strings print "Seq object display id is ", $seqobj->display_id(  ), "\n"; # the human read-able id of the sequence print "Sequence is ", $seqobj->seq(  )," \n";        # string of sequence print "Sequence from 5 to 10 is ", $seqobj->subseq(5,10)," \n"; # part of the sequence as a string print "Acc num is ", $seqobj->accession_number(  ), " \n"; # when there, the accession number print "Moltype is ", $seqobj->alphabet(  ), " \n";    # one of 'dna','rna','protein' print "Primary id is ", $seqobj->primary_seq->primary_id(  )," \n"; # a unique id for this sequence irregardless #print "Primary id is ", $seqobj->primary_id(  ), " \n"; # a unique id for this sequence irregardless # of its display_id or accession number # The following methods return an array of  Bio::SeqFeature objects $seqobj->top_SeqFeatures; # The 'top level' sequence features $seqobj->all_SeqFeatures; # All sequence features, including sub # seq features # The following methods returns new sequence objects, # but do not transfer features across # truncation from 5 to 10 as new object print "Truncated Seq object sequence is ", $seqobj->trunc(5,10)->seq(  ), " \n"; # reverse complements sequence print "Reverse complemented sequence 5 to 10  is ", $seqobj->trunc(5,10)->revcom->seq, "  \n"; # translation of the sequence print "Translated sequence 6 to 15 is ", $seqobj->translate->subseq(6,15), " \n"; 

This section of the tutorial produced the following output (slightly reformatted to fit the pages of this book):

 Seq object display id is Test1 Sequence is AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGA TAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACC AATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC  Sequence from 5 to 10 is TTTCAT  Acc num is unknown  Moltype is dna  Primary id is Test1  Truncated Seq object sequence is TTTCAT  Reverse complemented sequence 5 to 10  is ATGAAA   Translated sequence 6 to 15 is LQRAICLCVD 

This part of the demo code is self-explanatory. As you can see, the methods return:

  • The FASTA ID (the part immediately following the > sign in the first line)

  • The sequence alone as a string (reformatted with line breaks to fit this book)

  • The accession number if it is known (FASTA format won't have this, but Genbank format will, for instance)

  • The type of molecule (dna, rna, or protein)

  • A unique ID compared to other IDs in the running program but drawn from the file itself if possible

  • The sequence of a new truncated sequence object defined from the original sequence object; a reverse complement

  • The peptide resulting from a translation of a specified part of a sequence

The final section of the sequence_manipulation demo demonstrates the ability to translate nucleotides into all six reading frames using alternate codon translation tables if desired:

 my $c = shift; $c = 'ctgagaaaataa'; print "\nBeginning 3-frame and alternate codon translation example... \n"; my $seq = new Bio::PrimarySeq('-SEQ' => $c, '-ID' => 'no.One'); print "$c translated using method defaults   : ", $seq->translate->seq, "\n"; # Bio::Seq uses same sequence methods as PrimarySeq my $seq2 = new Bio::Seq('-SEQ' => $c, '-ID' => 'no.Two'); print "$c translated as a coding region (CDS): ", $seq2->translate(undef, undef, undef, undef, 1)->seq, "\n"; print "\nTranslating in all six frames:\n"; my @frames = (0, 1, 2); foreach my $frame (@frames) {     print  " frame: ", $frame, " forward: ",     $seq->translate(undef, undef, $frame)->seq, "\n";     print  " frame: ", $frame, " reverse-complement: ",     $seq->revcom->translate(undef, undef, $frame)->seq, "\n"; } print "Translating with all codon tables using method defaults:\n"; my @codontables = qw( 1 2 3 4 5 6 9 10 11 12 13 14 15 16 21 ); foreach my $ct (@codontables) {     print $ct, " : ",     $seq->translate(undef, undef, undef, $ct)->seq, "\n"; } 

This produces the output:

 Beginning 3-frame and alternate codon translation example...  ctgagaaaataa translated using method defaults   : LRK* ctgagaaaataa translated as a coding region (CDS): MRK Translating in all six frames:  frame: 0 forward: LRK*  frame: 0 reverse-complement: LFSQ  frame: 1 forward: *EN  frame: 1 reverse-complement: YFL  frame: 2 forward: EKI  frame: 2 reverse-complement: IFS Translating with all codon tables using method defaults: 1 : LRK* 2 : L*K* 3 : TRK* 4 : LRK* 5 : LSK* 6 : LRKQ 9 : LSN* 10 : LRK* 11 : LRK* 12 : SRK* 13 : LGK* 14 : LSNY 15 : LRK* 16 : LRK* 21 : LSN* 

Again, this code is fairly easy to follow, although you may find yourself turning to the documentation for some of the finer points when you try to use these objects and methods in your own code.

This part of the code starts by getting an argument of sequence data if one is available to be shifted off of the @_ argument array and into the variable $c ; otherwise , it sets the variable $c to a preset sequence:

 my $c = shift; $c = 'ctgagaaaataa'; 

Briefly, the methods that follow in this part of the sequence_manipulation demo do the following:

  • Call the PrimarySeq::new constructor to create a lightweight sequence object from $c that doesn't have the extra annotation often present in a standard sequence file (see perldoc Bio::PrimarySe q for the whole story)

  • Translate the sequence (from $c )

  • Call the Bio::Seq constructor to create the $seq2 sequence object (from the same sequence data in $seq )

  • Translate the CDS in the sequence object $seq2

  • Translate the sequence in all six reading frames

  • Translate the sequence using all 21 defined codon tables

The translate method seems to have lots of interesting options. However, if you try to look up the documentation for this method, you may have difficulty finding it. The problem is that Bioperl classes often make considerable use of inheritance. Say that the code is calling a method on a Bio::PrimarySeq object, as do the following lines from the demo (somewhat separated in the original):

 my $seq = new Bio::PrimarySeq('-SEQ' => $c, '-ID' => 'no.One'); $seq->translate(undef, undef, $frame)->seq, "\n"; 

You may try perldoc Bio::PrimarySeq and not find a discussion of the translate method because it is being inherited from some other class. And since multiple inheritance is possible, it can take considerable effort to track down this method documentation by figuring out the parent classes and searching the documentation.

There's a very convenient way to find the documentation for a method, even if it's only inherited ”it's built into the bptutorial.pl script. In the example just mentioned, you ask for the list of methods used by the B io::PrimarySeq method.

 [tisdall]$ perl bptutorial.pl 100 Bio::PrimarySeq    ***Methods for Object Bio::PrimarySeq ********      Methods taken from package Bio::DescribableI    description   display_name     Methods taken from package Bio::IdentifiableI    authority   lsid_string   namespace   namespace_string   object_id   version       Methods taken from package Bio::PrimarySeq    accession   direct_seq_set   validate_seq     Methods taken from package Bio::PrimarySeqI    accession_number   alphabet   can_call_new   desc   display_id   id     is_circular   length   moltype   primary_id   revcom   seq     subseq   translate   trunc     Methods taken from package Bio::Root::Root    DESTROY   debug   verbose     Methods taken from package Bio::Root::RootI    carp   confess   deprecated   new   stack_trace   stack_trace_dump     throw   throw_not_implemented   warn   warn_not_implemented   [tisdall]$ 

Sure enough, there under the heading "Methods taken from package Bio::PrimarySeqI" appears the method translate . You should then look at the Bio::PrimarySeqI documentation and find the following method description:

 translate Title   : translate Usage   : $protein_seq_obj = $dna_seq_obj->translate           #if full CDS expected:           $protein_seq_obj = $cds_seq_obj->translate(undef,undef,undef,undef,1); Function:           Provides the translation of the DNA sequence using full           IUPAC ambiguities in DNA/RNA and amino acid codes.           The full CDS translation is identical to EMBL/TREMBL           database translation. Note that the trailing terminator           character is removed before returning the translation           object.           Note: if you set $dna_seq_obj->verbose(1) you will get a           warning if the first codon is not a valid initiator. Returns : A Bio::PrimarySeqI implementing object Args    : character for terminator (optional) defaults to '*'           character for unknown amino acid (optional) defaults to 'X'           frame (optional) valid values 0, 1, 2, defaults to 0           codon table id (optional) defaults to 1           complete coding sequence expected, defaults to 0 (false)           boolean, throw exception if not complete CDS (true) or           defaults to warning (false) 

You will want to explore at least a few more of the tutorials embedded in bptutorial.pl , as we've done here with the first of the tutorials.



Mastering Perl for Bioinformatics
Mastering Perl for Bioinformatics
ISBN: 0596003072
EAN: 2147483647
Year: 2003
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net