Sequence Analysis in a Nutshell: A Guide to Common Tools and Databases - page 65

hmmfetch Retrieve an HMM from an HMM database.

hmmfetch [options] database name

Options

-h

Print brief help.

hmmindex Create a binary SSI index for an HMM database.

hmmindex [options] database

Options

-h

Print brief help.

hmmpfam Search one or more sequences against an HMM database.

hmmpfam [options] hmmfile seqfile

Options

-h

Print brief help.

-n

Specify that models and sequence are nucleic acid, not protein.

-A n

Limits the alignment output to the n best scoring domains.

-E x

Set the E-value cutoff for the per-sequence ranked hit list to x, where x is a positive real number.

-T x

Set the bit score cutoff for the per-sequence ranked hit list to x, where x is a real number.

-Z n

Calculate the E-value scores as if we had seen a sequence database of n sequences.

hmmsearch Search a sequence database with a profile HMM.

hmmsearch [options] hmmfile seqfile

Options

-h

Print brief help; includes version number and summary of all options, including expert options.

-A n

Limits the alignment output to the n best scoring domains.

-E x

Set the E-value cutoff for the per-sequence ranked hit list to x, where x is a positive real number.

-T x

Set the bit score cutoff for the per-sequence ranked hit list to x, where x is a real number.

-Z n

Calculate the E-value scores as if we had seen a sequence database of n sequences.

10.1 References

  • Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press.

  • Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14:755-763.

    Main page

    http://hmmer.wustl.edu/

    README

    ftp://ftp.genetics.wustl.edu/pub/eddy/hmmer/CURRENT/00README

    Download

    ftp://ftp.genetics.wustl.edu/pub/eddy/hmmer/2.2g/hmmer-2.2g.tar.gz

Chapter 11. MEME/MAST

The MEME/MAST system allows you to:

  1. Discover motifs (highly conserved regions) in groups of related DNA or protein sequences using MEME.

  2. Search sequence databases using motifs using MAST.

MEME and MAST were developed by Timothy Bailey, Charles Elkan, and Bill Grundy at the UCSD Computer Science and Engineering department with input from Michael Gribskov at the San Diego Supercomputer Center.

We're using Version 3.0.4 of MEME/MAST.

11.1 MEME

MEME (Multiple EM for Motif Elicitation) is a tool for discovering motifs in a group of related DNA or protein sequences.

A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.

MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif. For details, see Section 11.3 at the end of this chapter.

11.1.1 Examples

The following examples use data files provided in this release of MEME. MEME writes its output to standard output, so you will want to redirect it to a file in order for use with MAST.

A simple DNA example:

meme crp0.s -dna -mod oops -pal > ex1.html

MEME looks for a single motif in the file crp0.s which contains DNA sequences in FASTA format. The OOPS model is used so MEME assumes that every sequence contains exactly one occurrence of the motif. The palindrome switch is given so the motif model (PSPM) is converted into a palindrome by combining corresponding frequency columns. MEME automatically chooses the best width for the motif in this example since no width was specified.

Searching for motifs on both DNA strands:

meme crp0.s -dna -mod oops -revcomp > ex2.html

This is like the previous example except that the -revcomp switch tells MEME to consider both DNA strands, and the -pal switch is absent so the palindrome conversion is omitted. When DNA uses both DNA strands, motif occurrences on the two strands may not overlap. That is, any position in the sequence given in the training set may be contained in an occurrence of a motif on the positive strand or the negative strand, but not both.

A fast DNA example:

meme crp0.s -dna -mod oops -revcomp -w 20 > ex3.html

This example differs from the first example in that MEME is told to only consider motifs of width 20. This causes MEME to execute about 10 times faster. The -w switch can also be used with protein datasets if the width of the motifs are known in advance.

Using a higher-order background model:

meme INO_up800.s -dna -mod tcm -revcomp -bfile yeast.nc.6.freq > ex4.html

In this example we use -mod tcm and -bfile yeast.nc.6.freq. This specifies that:

  • The motif may have any number of occurrences in each sequence.

  • The Markov model specified in yeast.nc.6.freq is used as the background model. This file contains a fifth-order Markov model for the non-coding regions in the yeast genome.

Using a higher-order background model can often result in more sensitive detection of motifs. This is because the background model more accurately models non-motif sequence, allowing MEME to discriminate against it and find the true motifs.

A simple protein example:

meme lipocalin.s -mod oops -maxw 20 -nmotifs 2 > ex5.html

The -dna switch is absent, so MEME assumes the file lipocalin.s contains protein sequences. MEME searches for two motifs each of width less than or equal to 20. (Specifying -maxw 20 makes MEME run faster, since it does not have to consider motifs longer than 20.) Each motif is assumed to occur in each of the sequences because the OOPS model is specified.

Another simple protein example:

meme farntrans5.s -mod tcm -maxw 40 -maxsites 50 > ex6.html

MEME searches for a motif of width up to 40, with up to 50 occurrences in the entire training set. The TCM sequence model is specified, which allows each motif to have any number of occurrences in each sequence. This dataset contains motifs with multiple repeats of motifs in each sequence. This example is fairly time consuming due to the fact that the time required to initialize the motif probability tables is proportional to maxw multiplied by maxsites. By default, MEME only looks for motifs up to 29 letters wide with a maximum total of number of occurrences equal to twice the number of sequences or 30, whichever is less.

A much faster protein example:

meme farntrans5.s -mod tcm -w 10 -maxsites 30 -nmotifs 3 > ex7.html

This time MEME is constrained to search for three motifs of width exactly ten. The effect is to break up the long motif found in the previous example. The -w switch forces motifs to be exactly ten letters wide. This example is much faster because, since only one width is considered, the time to build the motif probability tables is only proportional to maxsites.

Splitting the sites into three:

meme farntrans5.s -mod tcm -maxw 12 -nsites 24 -nmotifs 3 > ex8.html

This forces each motif to have exactly 24 occurrences, and be up to 12 letters wide.

A larger protein example with E-value cutoff:

meme adh.s -mod zoops -nmotifs 20 -evt 0.01 > ex9.html

In this example, MEME looks for up to 20 motifs, but stops when a motif is found with E-value greater than 0.01. Motifs with large E-values are likely to be statistical artifacts rather than biologically significant.

11.1.2 Command-Line Options

Usage for MEME is:

meme dataset optionalarguments

where dataset is a file containing sequences in FASTA format.

Table 11-1 summarizes the command-line options for MEME.

Table 11-1. MEME options

Option

Definition

[-h]

Print this message.

[-dna]

Sequences use DNA alphabet.

[-protein]

Sequences use protein alphabet.

[-mod oops|zoops|tcm]

Distribution of motifs.

[-nmotifs nmotifs]

Maximum number of motifs to find.

[-evt ev]

Stop if motif E-value greater than evt.

[-nsites sites]

Number of sites for each motif.

[-minsites minsites]

Minimum number of sites for each motif.

[-maxsites maxsites]

Maximum number of sites for each motif.

[-wnsites wnsites]

Weight on expected number of sites.

[-w w]

Motif width.

[-minw minw]

Minumum motif width.

[-maxw maxw]

Maximum motif width.

[-nomatrim]

Do not adjust motif width using multiple alignment.

[-wg wg]

Gap opening cost for multiple alignments.

[-ws ws]

Gap extension cost for multiple alignments.

[-noendgaps]

Do not count end gaps in multiple alignments.

[-bfile bfile]

Name of background Markov model file.

[-revcomp]

Allow sites on + or - DNA strands.

[-pal]

Force palindromes (requires -dna).

[-maxiter maxiter]

Maximum EM iterations to run.

[-distance distance]

EM convergence criterion.

[-prior dirichlet|dmix|mega|megap|addone]

Type of prior to use.

[-b b]

Strength of the prior.

[-plib plib]

Name of Dirichlet prior file.

[-spfuzz spfuzz]

Fuzziness of sequence to theta mapping.

[-spmap uni|pam]

Starting point seq to theta mapping type.

[-cons cons]

Consensus sequence to start EM from.

[-text]

Output in text format (default is HTML).

[-print_fasta]

Print sites in FASTA format (default BLOCKS).

[-maxsize maxsize]

Maximum dataset size in characters.

[-nostatus]

Do not print progress reports to terminal.

[-p np]

Use parallel version with np processors.

[-time t]

Quit before t CPU seconds consumed.

[-sf sf]

Print sf as name of sequence file.