emma | Sequence Analysis in a Nutshell: A Guide to Common Tools and Databases

emma

emma calculates the multiple alignment of nucleic acid or protein sequences according to the method of J.D. Thompson, D.C. Higgins, and T.J.Gibson. This is an interface to the ClustalW distribution.

Here is an example session with emma:

% emma Input sequence: globins.fasta Output sequence [hbahum.aln]:  Output file [hbahum.dnd]:  ..clustalw17 -infile=5345A -outfile=5345B -align -type=protein ...      CLUSTAL W (1.74) Multiple Sequence Alignments     Sequence type explicitly set to Protein Sequence format is Pearson Sequence 1: hbahum          141 aa Sequence 2: hbbhum          146 aa Sequence 3: hbghum          146 aa Sequence 4: hbhagf          148 aa Sequence 5: hbrlam          149 aa Sequence 6: mycrhi          151 aa Sequence 7: myohum          153 aa Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score:  41 Sequences (1:3) Aligned. Score:  39 Sequences (1:4) Aligned. Score:  21 Sequences (1:5) Aligned. Score:  27 Sequences (1:6) Aligned. Score:  13 Sequences (1:7) Aligned. Score:  26 Sequences (2:3) Aligned. Score:  73 Sequences (2:4) Aligned. Score:  19 Sequences (2:5) Aligned. Score:  19 Sequences (2:6) Aligned. Score:  15 Sequences (2:7) Aligned. Score:  24 Sequences (3:4) Aligned. Score:  21 Sequences (3:5) Aligned. Score:  21 Sequences (3:6) Aligned. Score:  15 Sequences (3:7) Aligned. Score:  23 Sequences (4:5) Aligned. Score:  41 Sequences (4:6) Aligned. Score:  12 Sequences (4:7) Aligned. Score:  16 Sequences (5:6) Aligned. Score:  17 Sequences (5:7) Aligned. Score:  18 Sequences (6:7) Aligned. Score:  11 Guide tree        file created:   [5345C] Start of Multiple Alignment There are 6 groups Aligning... Group 1: Sequences:   2      Score:883 Group 2: Sequences:   2      Score:2344 Group 3: Sequences:   3      Score:934 Group 4:                     Delayed Group 5: Sequences:   5      Score:950 Group 6:                     Delayed Sequence:7     Score:1046 Sequence:6     Score:986 Alignment Score 1746 GCG-Alignment file created      [5345B]

Mandatory qualifiers:

[-inseqs] (seqall): Sequence database USA.
[-outseq] (seqoutset): The sequence alignment output filename.
[-dendoutfile] (outfile): The dendogram output filename.

Optional qualifiers (bold if not always prompted):

-onlydend (boolean)

Produce only a dendrogram file.

-dend (boolean)

Select if you want to perform alignment using an old dendrogram.

-dendfile (string)

Name of the old dendrogram file.

-insist (boolean)

Insist that the sequence type be changed to protein.

-slowfast (menu)

A distance is calculated between every pair of sequences, then these distances are used to construct a dendrogram that guides the final multiple alignment. The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate), or by the method of Wilbur and Lipman (extremely fast but approximate). The slow but accurate method is fine for short sequences, but will be extremely slow for many (e.g., greater than100) long (e.g., greater than 1000 residue) sequences.

-pwgapc (float)

The penalty for opening a gap in the pairwise alignments.

-pwgapv (float)

The penalty for extending a gap by 1 residue in the pairwise alignments.

-pwmatrix (menu)

A scoring table that describes the similarity of each amino acid to one another. There are three built-in series of weight matrices offered. Each consists of several matrixes that work differently at different evolutionary distances. For details, read the documentation. Crudely, we store several matrices in memory, spanning the full range of amino acid distance (from almost identical sequences to highly divergent ones). For very similar sequences, it is best to use a strict weight matrix which gives a high score only to identities and the most favoured conservative substitutions. For more divergent sequences, it is appropriate to use "softer" matrixes that give a high score to many other frequent substitutions.

BLOSUM (Henikoff). These matrixes appear to be the best available for carrying out data base similarity (homology searches). The matrixes used are: Blosum80, 62, 45 and 30.
PAM (Dayhoff). These have been extremely widely used since the late 1970s. We use the PAM 120, 160, 250 and 350 matrixes.
GONNET. These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger data set. They appear to be more sensitive than the Dayhoff series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrixes. We also supply an identity matrix which gives a score of 1.0 to two identical amino acids and a score of zero otherwise. This matrix is not very useful.

-pwdnamatrix (menu)

A scoring table that describes the scores assigned to matches and mismatches (including IUB ambiguity codes).

-pairwisedata (string)

Filename of user pairwise matrix.

-ktup (integer)

This is the size of the exact matching fragment. Increase for speed (maximum is 2 for proteins, 4 for DNA); decrease for sensitivity. For longer sequences (e.g., greater than1000 residues), you may need to increase the default.

-gapw (integer)

A penalty for each gap in the fast alignments. It has little affect on the speed or sensitivity except in the case of extreme values.

-topdiags (integer)

The number of k-tuple matches on each diagonal (in an imaginary dot matrix plot) is calculated. Only the best ones (those with the most matches) are used in the alignment. Decrease for speed; increase for sensitivity.

-window (integer)

This is the number of diagonals around each of the best diagonals that will be used. Decrease for speed; increase for sensitivity.

-nopercent (boolean)

Fast pairwise alignment: similarity scores: suppresses percentage score.

-matrix (menu)

This gives a menu where you are offered a choice of weight matrices. The default for proteins is the PAM series derived by Gonnet and colleagues. Note that a series is used! The matrix used is dependent upon the similarity of the sequences to be aligned at this alignment step. Different matrixes work differently at each evolutionary distance. There are three built-in series of weight matrixes offered. Each consists of several matrixes that work differently at different evolutionary distances. For details, read the documentation. Crudely, we store several matrices in memory, spanning the full range of amino acid distance (from almost identical sequences to highly divergent ones). For very similar sequences, it is best to use a strict weight matrix which gives a high score only to identities and the most favoured conservative substitutions. For more divergent sequences, it is appropriate to use "softer" matrices that give a high score to many other frequent substitutions.

BLOSUM (Henikoff). These matrixes appear to be the best available for carrying out data base similarity (homology searches). The matrixes used are: Blosum 80, 62, 45 and 30.
PAM (Dayhoff). These have been widely used since the late 1970s. We use the PAM 120, 160, 250 and 350 matrixes.
GONNET. These matrices were derived using almost the same procedure as Dayhoff (above), but are much more up to date and are based on a much larger data set. They appear to be more sensitive than the Dayhoff series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrixes. We also supply an identity matrix which gives a score of 1.0 to two identical amino acids and a score of zero otherwise. This matrix is not very useful. Alternatively, you can read in your own (just one matrix, not a series).

-dnamatrix (menu)

Provides a menu containing a submenu in which a single matrix (not a series) can be selected.

-mamatrix (string)

Filename of multiple user alignment matrix.

-gapc (float)

Penalty for opening a gap in the alignment. Increasing the gap opening penalty will make gaps less frequent.

-gapv (float)

Penalty for extending a gap by 1 residue. Increasing the gap extension penalty makes gaps shorter. Terminal gaps are not penalized.

-[no]endgaps (boolean)

"End gap separation" treats end gaps as internal gaps for the purposes of avoiding gaps that are too close (set by "gap separation distance"). If you turn this off, end gaps will be ignored. This is useful when you want to align fragments where the end gaps are not biologically meaningful.

-gapdist (integer)

"Gap separation distance" tries to decrease the chances of gaps being too close. Gaps that are less than this distance apart are penalized more than other gaps. This does not prevent close gaps; it only makes them less frequent, resulting in alignments that have a blocklike appearance.

-norgap (boolean)

"Residue specific penalties" are amino acid-specific gap penalties that reduce or increase the gap opening penalties at each position in the alignment or sequence. As an example, positions that are rich in glycine are more likely to have an adjacent gap than positions that are rich in valine.

-hgapres (string)

A set of the residues considered hydrophilic. It is used when introducing Hydrophilic gap penalties.

-nohgap (boolean)

"Hydrophilic gap penalties" are used to increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids; these are likely to be loop or random coil regions where gaps are more common. The residues that are considered hydrophilic are set by -hgapres.

-maxdiv (integer)

This switch delays the alignment of the most distantly related sequences until after the most closely related sequences are aligned. The setting shows the percent identity level required to delay the addition of a sequence.

Advanced qualifiers:

-prot (boolean): Do not change this value.