1.7 Writing Your First Perl Module | Mastering Perl for Bioinformatics

Now that you've been introduced to the basic ideas of modules, it's time to actually examine a working example of a module.

In this section, we'll write a module called Geneticcode.pm , which implements the genetic code that maps DNA codons to amino acids and then translates a string of DNA sequence data to a protein fragment.

1.7.1 An Example: Geneticcode.pm

Let's start by creating a file called Geneticcode.pm and using it to define the mapping of codons to amino acids in a hash variable called %genetic_code . We'll also discuss a subroutine called codon2aa that uses the hash to translate its codon arguments into amino acid return values.

Here are the contents of the first module file Geneticcode.pm :

 package Geneticcode; use strict; use warnings; my(%genetic_code) = (         'TCA' => 'S',    # Serine     'TCC' => 'S',    # Serine     'TCG' => 'S',    # Serine     'TCT' => 'S',    # Serine     'TTC' => 'F',    # Phenylalanine     'TTT' => 'F',    # Phenylalanine     'TTA' => 'L',    # Leucine     'TTG' => 'L',    # Leucine     'TAC' => 'Y',    # Tyrosine     'TAT' => 'Y',    # Tyrosine     'TAA' => '_',    # Stop     'TAG' => '_',    # Stop     'TGC' => 'C',    # Cysteine     'TGT' => 'C',    # Cysteine     'TGA' => '_',    # Stop     'TGG' => 'W',    # Tryptophan     'CTA' => 'L',    # Leucine     'CTC' => 'L',    # Leucine     'CTG' => 'L',    # Leucine     'CTT' => 'L',    # Leucine     'CCA' => 'P',    # Proline     'CCC' => 'P',    # Proline     'CCG' => 'P',    # Proline     'CCT' => 'P',    # Proline     'CAC' => 'H',    # Histidine     'CAT' => 'H',    # Histidine     'CAA' => 'Q',    # Glutamine     'CAG' => 'Q',    # Glutamine     'CGA' => 'R',    # Arginine     'CGC' => 'R',    # Arginine     'CGG' => 'R',    # Arginine     'CGT' => 'R',    # Arginine     'ATA' => 'I',    # Isoleucine     'ATC' => 'I',    # Isoleucine     'ATT' => 'I',    # Isoleucine     'ATG' => 'M',    # Methionine     'ACA' => 'T',    # Threonine     'ACC' => 'T',    # Threonine     'ACG' => 'T',    # Threonine     'ACT' => 'T',    # Threonine     'AAC' => 'N',    # Asparagine     'AAT' => 'N',    # Asparagine     'AAA' => 'K',    # Lysine     'AAG' => 'K',    # Lysine     'AGC' => 'S',    # Serine     'AGT' => 'S',    # Serine     'AGA' => 'R',    # Arginine     'AGG' => 'R',    # Arginine     'GTA' => 'V',    # Valine     'GTC' => 'V',    # Valine     'GTG' => 'V',    # Valine     'GTT' => 'V',    # Valine     'GCA' => 'A',    # Alanine     'GCC' => 'A',    # Alanine     'GCG' => 'A',    # Alanine     'GCT' => 'A',    # Alanine     'GAC' => 'D',    # Aspartic Acid     'GAT' => 'D',    # Aspartic Acid     'GAA' => 'E',    # Glutamic Acid     'GAG' => 'E',    # Glutamic Acid     'GGA' => 'G',    # Glycine     'GGC' => 'G',    # Glycine     'GGG' => 'G',    # Glycine     'GGT' => 'G',    # Glycine ); # # codon2aa # # A subroutine to translate a DNA 3-character codon to an amino acid #   Version 3, using hash lookup sub codon2aa {         my($codon) = @_;         $codon = uc $codon;           if(exists $genetic_code{$codon}) {                 return $genetic_code{$codon};         }else{                 die "Bad codon '$codon'!!\n";         } } 1;

Now, let's examine the code. First, the module declares its package with a name ( Geneticcode ) that is the same as the file it is in ( Geneticcode.pm ), but minus the file extension .pm .

The directives:

 use strict; use warnings;

will appear in all the code. The use strict directive enforces the use of the my directive for all variables . The use warnings directive produces useful messages about potential problems in your code. (It is possible to turn both directives off when required ”to avoid annoying warnings in your program output, for instance. See the perldiag , perllexwarn , and perlmodlib sections of the Perl manual.)

Finally, there is a subroutine definition for codon2aa . As an argument, this subroutine takes a codon represented as a string of three DNA bases and returns the amino acid code corresponding to the codon. It accomplishes this by a simple lookup in the hash %genetic_code and returns the result from the subroutine using the return built-in function.

The codon2aa subroutine calls die and exits the program when it encounters an undefined codon. See the exercises at the end of this chapter for a discussion of the pros and cons of this behavior.

In my earlier book, I defined the hash %genetic_code within the subroutine codon2aa . That meant that every time the subroutine was called, the hash would have to be initialized, which took a bit of time. In this version, the hash only has to be initialized once, when the program is first called, which results in a significant speedup . The definition of the hash is outside of the subroutine definition, but in the namespace of the Geneticcode package. The hash is initialized when the Geneticcode.pm module is loaded by this statement:

 use Geneticcode;

Every subsequent call to the codon2aa subroutine simply accesses the hash without having to initialize it each time.

Here's an example that uses the new Geneticcode module, which is saved in a file called testGeneticcode and run by typing perl testGeneticcode :

 use strict; use warnings; use lib "/home/tisdall/MasteringPerlBio/development/lib"; use Geneticcode; my $dna = 'AACCTTCCTTCCGGAAGAGAG'; # Initialize variables my $protein = ''; # Translate each three-base codon to an amino acid, and append to a protein  for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {         $protein .= Geneticcode::codon2aa( substr($dna,$i,3) ); } print $protein, "\n";

Recall that the Perl built-in function substr can extract a portion of a string. In this case, substr extracts from $dna the three characters beginning at the position given in the counter variable $i ; this three-character codon is then passed as the argument to the subroutine codon2aa . This program produces the output:

 NLPSGRE

1.7.2 Expanding Geneticcode.pm

Now, let's expand our Geneticcode module example. This new version of the module includes a few short helper subroutines. The interest here lies in how the subroutines interact with each other in the module's namespace, and how to access the code within the module from a Perl program that uses the module.

Modules are a great way to organize code into logical collections of interacting parts . When you create modules, you need to decide how to organize your code into the appropriate collection of modules. Here, we have some subroutines that translate codons into amino acids; others read sequence data from files and print it to the screen. This is a fairly obvious division of functionality, so let's create two modules for this code. We'll expand the Geneticcode module; let's also create a SequenceIO module. Of course, the new module will be created in a file called SequenceIO.pm , and that file will be placed in a directory that Perl can find ”in this case, the same directory in which we've placed the Geneticcode module.

Here's the code for Geneticcode.pm :

 package Geneticcode; use strict; use warnings; my(%genetic_code) = (          'TCA' => 'S',    # Serine     'TCC' => 'S',    # Serine     'TCG' => 'S',    # Serine     'TCT' => 'S',    # Serine     'TTC' => 'F',    # Phenylalanine      ... as before ...     'GAG' => 'E',    # Glutamic Acid     'GGA' => 'G',    # Glycine     'GGC' => 'G',    # Glycine     'GGG' => 'G',    # Glycine     'GGT' => 'G',    # Glycine ); # # codon2aa # # A subroutine to translate a DNA 3-character codon to an amino acid #   Version 3, using hash lookup sub codon2aa {     my($codon) = @_;     $codon = uc $codon;       if(exists $genetic_code{$codon}) {         return $genetic_code{$codon};     }else{             die "Bad codon '$codon'!!\n";     } } # # dna2peptide  # # A subroutine to translate DNA sequence into a peptide sub dna2peptide {     my($dna) = @_;     # Initialize variables     my $protein = '';     # Translate each three-base codon to an amino acid, and append to a protein      for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {         $protein .= codon2aa( substr($dna,$i,3) );     }     return $protein; } # translate_frame # # A subroutine to translate a frame of DNA sub translate_frame {     my($seq, $start, $end) = @_;     my $protein;     # To make the subroutine easier to use, you won't need to specify     #  the end point-it will just go to the end of the sequence     #  by default.     unless($end) {         $end = length($seq);     }     # Finally, calculate and return the translation         return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) ); } 1;

Now, we have in one module the code that accomplishes a translation from the genetic code. However, we also need to read sequence in from FASTA sequence files, and print out sequence (the translated protein) to the screen. Because these needs are likely to recur in many programs, it makes sense to make a separate module for just the file reading, sequence extraction, and sequence printing operations. (This may even be too much in one module; maybe there should be separate modules for each need? See the exercises at the end of the chapter.)

Here's the code for the second module SequenceIO.pm , which handles reading from a file, extracting FASTA sequence data, and printing sequence data:

 package SequenceIO; use strict; use warnings; # get_file_data # # A subroutine to get data from a file given its filename sub get_file_data {     my($filename) = @_;     # Initialize variables     my @filedata = (  );     open(GET_FILE_DATA, $filename) or die "Cannot open file '$filename':$!\n\n";     @filedata = <GET_FILE_DATA>;     close GET_FILE_DATA;     return @filedata; } # extract_sequence_from_fasta_data # # A subroutine to extract FASTA sequence data from an array sub extract_sequence_from_fasta_data {     my(@fasta_file_data) = @_;     # Declare and initialize variables     my $sequence = '';     foreach my $line (@fasta_file_data) {         # discard blank line         if ($line =~ /^\s*$/) {             next;         # discard comment line         } elsif($line =~ /^\s*#/) {             next;         # discard fasta header line         } elsif($line =~ /^>/) {             next;         # keep line, add to sequence string         } else {             $sequence .= $line;         }     }     # remove non-sequence data (in this case, whitespace) from $sequence string     $sequence =~ s/\s//g;     return $sequence; } # print_sequence # # A subroutine to format and print sequence data  sub print_sequence {     my($sequence, $length) = @_;     # Print sequence in lines of $length     for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) {         print substr($sequence, $pos, $length), "\n";     } } 1;

Before we discuss the code, let's see a small program that uses it:

 # Translate a DNA sequence into one of the six reading frames use strict; use warnings; use lib "/home/tisdall/MasteringPerlBio/development/lib"; use Geneticcode; use SequenceIO; # Initialize variables my @file_data = (  ); my $dna = ''; my $revcom = ''; my $protein = ''; # Read in the contents of the file "sample.dna" @file_data = SequenceIO::get_file_data("sample.dna"); # Extract the sequence data from the contents of the file "sample.dna" $dna = SequenceIO::extract_sequence_from_fasta_data(@file_data); # Translate the DNA to protein in one of the six reading frames #   and print the protein in lines 70 characters long print "\n -------Reading Frame 1--------\n\n"; $protein = Geneticcode::translate_frame($dna, 1); SequenceIO::print_sequence($protein, 70); exit;

Here's the input file:

 > sample dna  (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc acacctgagccactctcagatgaggaccta

Here's the output of the program:

 -------Reading Frame 1--------  RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAEWSVQVRGSLAGVVRE  CAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCDNCNEWFHGDCIRITEKMAKA  IREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLAR  GSASPHKSSPQPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCR  LRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATA  TPEPLSDEDL

A few comments are in order. First, the subroutines for translating codons are in the Geneticcode module. They include the hash %genetic_code and the subroutines codon2aa , dna2peptide , and translate_frame , which are involved with translating DNA data to peptides. The subroutines for reading sequence data in from files, and for formatting and printing it to the screen, are in the SequenceIO module. They are the subroutines get_file_data , extract_sequence_from_fasta_data , and print_sequence .

Now, we have two modules and code that exercises them; let's look at some more facets of using modules.