11.3 References

11.3 References

  • Bailey, Timothy L., and Charles Elkan. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology 28-36. Menlo Park: AAAI Press.

  • Bailey, Timothy L., and Michael Gribskov. 1998. Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14:48-54.

    Main page

    http://meme.sdsc.edu/meme/website/intro.html

    Manpages

    http://meme.sdsc.edu/meme/website/meme-download.html

    Download

    ftp://ftp.sdsc.edu/pub/sdsc/biology/meme

Chapter 12. EMBOSS

EMBOSS (European Molecular Biology Open Software Suite) is an open source package of sequence analysis tools. This software covers a wide range of functionality and can handle data in a variety of formats. Extensive libraries are provided with the package, allowing users to develop and release their own software. EMBOSS also integrates a range of currently available packages and tools for sequence analysis, such as BLAST and ClustalW. A Java API (Jemboss) is also available.

EMBOSS contains around 150 programs (applications). These are just some of the areas covered:

  • Sequence alignment.

  • Rapid database searching with sequence patterns.

  • Protein motif identification, including domain analysis.

  • Nucleotide sequence pattern analysis, for example to identify CpG islands or repeats.

  • Codon usage analysis for small genomes.

  • Rapid identification of sequence patterns in large scale sequence sets.

  • Presentation tools for publication.

. . . and much more.

For details, see Section 12.4 at the end of this chapter.

We're using Version 2.5.0 of EMBOSS.

12.1 Common Themes

Many EMBOSS programs have functionality in common. They all understand the same sorts of sequence addresses, sequence formats, output formats, and feature formats. The following sections describe some common themes in EMBOSS.

12.1.1 Uniform Sequence Address

The Uniform Sequence Address (USA) is a standard sequence naming used by all EMBOSS applications.

The USA syntax is one of:

  • "format::file"

  • "format::file:entry"

  • "dbname:entry"

  • "@listfile" (a file of filenames)

The "::" and ":" syntax is to allow, for example, "embl" and "pir" to be both database names and sequence formats. In addition, EMBOSS allows the command line to separately define the format and the entry name so that only the filename is required.

The "file" and "dbname" forms of USA may have "format::" in front of them, but because a database is aware of the format, this structure is redundant and not recommended.

Any USA may optionally take this subsequence specifier after the main body of the USA, either in the form "[start : end]" or "[start : end : r]", where start and end are the required start and end positions. Negative positions count from the end of the sequence. Use of this USA subsequence specifier is equivalent to using the -sbegin, -send, or -sreverse command-line qualifiers.

Table 12-1 contains some USA examples.

Table 12-1. Emboss Uniform Sequence Address (USA) examples

Type

Example

Comments

filename

xxx.seq

A sequence file xxx.seq in any format.

format::filename

fasta::xxx.seq

A sequence file xxx.seq in FASTA format.

db:IDname

embl:paamir

EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database.

db:AccessionNumber

embl:X13776

EMBL entry X13776, using whatever access method is defined locally for the EMBL database. Search by accession number and entry name. X13776 is the accession number in this case.

db-acc:AccessionNumber

embl-acc:X13776

EMBL entry X13776, using whatever access method is defined locally for the EMBL database. Search by accession number only.

db-id:IDname

embl-id:paamir

EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database. Search by ID only.

db-searchfield:word

embl-des:lectin

EMBL entries containing the word "lectin" in the Description line.

db-searchfield:wcardword

embl-org:*human*

EMBL entries containing the wildcarded word "human" in the Organism fields.

db:wildcard-ID

embl:paami*

EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical order, using whatever access method is defined locally for the EMBL database.

db or db:*

embl or EMBL:*

All sequences in the EMBL database.

@listfile

@mylist

Reads file mylist and uses each line as a separate USA. List files can contain references to other lists files or any other standard USA.

list:listfile

list:mylist

Same as @mylist.

programparameters |

getz -e [embl-id:paamir] |

The pipe character "|" causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry PAAMIR from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way.

asis::sequence

asis::atacgcagttatctgaccat

So far, the shortest USA we could invent. In "asis" format the name is the sequence, so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines.

12.1.2 Sequence Formats

You can specify the format to use on input by giving the format name with two colons before the file holding your sequences. For example:

embl::myfile.seq

The format is not required. When reading in a sequence, EMBOSS will guess the sequence format by trying all known formats until one succeeds.

When writing out a sequence, EMBOSS will use FASTA format by default. You can specify another format to use, for example:

gcg::myresults.seq
12.1.2.1 Input sequence formats

To date, the sequence formats in Table 12-2 are accepted as input. By default (i.e., no format is explicitly specified), EMBOSS tries each format in turn until one succeeds.

Table 12-2. EMBOSS input sequence formats

Input format

Comments

abi

ABI trace file format. This is the format of file produced by ABI sequencing machines. It contains the trace data, i.e., the probabilities of the 4 bases along the sequencing run, together with the sequence, as deduced from that data. The sequence information is what is normally read in and used by EMBOSS programs, although the trace data is available and may be utilized by some specialized EMBOSS programs. The code for this is heavily based on David Mathog's Fortran library with a description of ABI trace file format (abi.txt): ftp://saf.bio.caltech.edu/pub/software/molbio/abitools.zip.

acedb

ACeDB format.

clustal

aln

ClustalW ALN (multiple alignment) format.

codata

CODATA format.

dbid

Odd FASTA format with Database name first, folowed by ID name and an optional accession number, e.g.:

>database name description

or

>database name accession description embl

em

EMBL entry format, or at least a minimal subset of the fields. The Staden package and others use EMBL or similar formats for sequence data.

pearson

FASTA format with an optional accession number after the sequence identifier, e.g.:

>name description

or

>name accession description

and with an optional database name in GCG style FASTA format included as part of the sequence identifier, e.g.:

>database:name accession description

gcg

gcg8

GCG 9.x and 10.x format with the format and sequence type identified on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data.

genbank

gb

ddbj

GENBANK entry format, or at least a minimal subset of the fields.

gff

GFF format.

hennig86

Hennig86 format.

ig

IntelliGenetics format.

jackknifer

Jackknifer format.

jackknifernon

Jackknifernon format.

nbrf

pir

NBRF (PIR) format, as used in the PIR database sequence files.

nexus

paup

Nexus/PAUP format.

nexusnonpaupnon

Nexusnon/PAUPnon format.

treecon

Treecon format.

mega

Mega format.

meganon

Meganon format.

msf

Wisconsin Package GCG's MSF multiple sequence format.

ncbi

FASTA format with optional accession number and database name in NCBI style included as part of the sequence identifier, e.g.:

>database|accession|id description

(and other variants on this theme!)

pfam

stockholm

Pfam format.

phylip

PHYLIP interleaved multiple alignment format.

selex

SELEX format is used by Sean Eddy's HMMER package. It can store RNA secondary structure as part of the sequence annotation.

staden

experiment

The experiment file format used by the gap program in the Staden package, where the sequence identifier is optional and the remainer is plain text. Some alternative nucleotide ambiguity codes are used and must be converted.

strider

DNA Strider format.

swissprot

swiss

sw

SWISS-PROT entry format, or at least a minimal subset of the fields.

text

plain

Plain text. This is the format with no format. The whole of the file is read in as a sequence. No attempt is made to parse the file contents in any way. Anything is acceptable in this format. This means that any character will be included in the sequence, even digits and punctuation. Use this format only when you are sure that the input sequence file is correct and contains only what you want to be considered as your sequence.

raw

Similar to text or plain format. However, raw removes any whitespace or digits, accepts only alphabetic characters, and rejects anything else. This format is safer than plain format. Digits, spaces, and TAB characters are removed and ignored. If a sequence contains other non-alphabetic characters (e.g., punctuation characters), it is rejected as erroneous.

asis

Not a sequence format , but a quick way of entering a sequence on the command line. It is included here for completeness. In "asis" format, the actual sequence appears where a filename would normally be given.

asis::atacgcagttatctgacc

In "asis" format the name is the sequence, so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines.

12.1.2.2 Output sequence formats

To date, the sequence formats in Table 12-3 are available as output. Some sequence formats can hold multiple sequences in one file; these are marked as multiple in the table. Formats such as GCG, plain, and staden can hold only one sequence per file and are marked as single.

Table 12-3. EMBOSS input sequence formats

Output format

Single/multiple

Comments

gcg

gcg8

single

Wisconsin Package GCG 9.x and 10.x format with the sequence type on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data.

embl

em

multiple

EMBL entry format with available fields filled in and others with no information omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.

swiss

sw

multiple

SwisProt entry format with available fields filled in and others with no information omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.

fasta

pearson

multiple

Standard Pearson FASTA format, but with the accession number included after the identifier if available.

ncbi

multiple

NCBI style FASTA format with the database name, entry name and accession number separated by pipe ("|") characters.

nbrf

pir

multiple

NBRF (PIR) format, as used in the PIR database sequence files.

genbank

gb

multiple

GENBANK entry format with available fields filled in and others with no information omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.

gff

multiple

GFF format.

ig

multiple

IntelliGenetics format, as used by the IntelliGenetics package.

codata

multiple

CODATA format.

stride

multiple

DNA strider format.

acedb

multiple

ACeDB format.

staden

experiment

single

The experiment file format used by the gap program in the Staden package. Some alternative nucleotide ambiguity codes are used and are converted.

text

plain

raw

single

Plain sequence, no annotation or heading.

fitch

multiple

Fitch format.

msf

multiple

Wisconsin Package GCG's MSF multiple sequence format.

clustal

aln

multiple

Clustal multiple sequence format.

selex

multiple

SELEX format.

phylip

multiple

PHYLIP interleaved format.

phylip3

multiple

PHYLIP non-interleaved format that was used in Phylip version 3.2.

asn1

multiple

A subset of ASN.1 containing entry name, accession number, description and sequence, similar to the current ASN.1 output of Readseq.

hennig86

multiple

Hennig86 format.

mega

multiple

Mega format.

meganon

multiple

Meganon format.

nexus

paup

multiple

Nexus/PAUP format.

nexusnon

paupnon

multiple

Nexusnon/PAUPnon format.

jackknifer

multiple

Jackknifer format.

jackknifernon

multiple

Jackknifernon format.

treecon

multiple

Treecon format.

debug

multiple

EMBOSS sequence object report for debugging showing all available fields. Not all fields will contain data—this depends very much on the input format used.

12.1.3 Alignment Formats

When writing out an alignment between two or more sequences, EMBOSS now uses a standard set of formats.

12.1.3.1 Multiple sequence alignment formats

Table 12-4 contains details about the current set of multiple sequence alignment formats available in EMBOSS.

Table 12-4. EMBOSS multiple sequence alignment formats

Name

Comments

unknown

multiple

simple

These are synonyms for simple format. This format displays the sequence names, positions and sequences, then puts the markup line underneath the sequences. When only two sequences are being aligned, the format is changed to that produced by pair.

fasta

This is the standard FASTA sequence format with gaps, where many sequences are concatenated one after the other.

msf

This is the standard MSF sequence format.

trace

This is a special verbose format for use in debugging. It is not intended for normal users.

srs

This shows the sequence ID name, the sequence position, the sequence and the sequence position for each line.

12.1.3.2 Pairwise sequence alignment formats

Table 12-5 contains details about the current set of pairwise sequence alignment formats available in EMBOSS.

Table 12-5. EMBOSS pairwise sequence alignment formats

Name

Comments

pair

This is the default format used when there are only 2 sequences. When simple format is selected but there are only 2 sequences, this format is used. The sequences have the markup line between them.

markx0

This is the standard default output format used by Bill Pearson's suite of FASTA programs.

markx1

This is an alternative output format used by Bill Pearson's suite of FASTA programs in which identities are not marked. Instead, conservative replacements are denoted by "x" and non-conservative substitutions by "X".

markx2

This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the residues in the second sequence are only shown if they are different from the first.

markx3

This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format. These can be used to build a primitive multiple alignment.

markx10

This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format and the sequence length, alignment start and stop information is given in lines starting with a ";" character just after the title line for each sequence. It is intended to be easily parsed by other programs.

srspair

This is very similar in style to pair format.

score

This does not display the sequence alignment. It shows only the names of the sequences, the length of the alignment, and the score.

12.1.4 Feature Formats

When reading or writing features associated with a sequence, a standard set of formats is used. The feature files can either be a standard sequence format with a feature table as part of the sequence format, or the features can be held in a file without the associated sequence.

Table 12-6 contains details about the current set of feature formats available in EMBOSS.

Table 12-6. EMBOSS feature formats

Name

Comments

embl

em

The format used by the EMBL nucleic database.

gff

The General Feature Format defined by the Sanger Centre.

swissprot

swiss

sw

The format used by the SWISS-PROT protein database. The feature table keys are also defined.

pir

The format used by the PIR protein database.

nbrf

Only available for input—the same as PIR format.

12.1.5 Report Formats

There are many ways in which the results of an analysis can be reported. Many EMBOSS programs are now able to output their results in a standard report format—you can change the report format used by putting -rformat name on the command line, where name is the name of one of the standard report formats.

Table 12-7 contains examples of garnier analyzing sw:100K_rat output in various report formats.

Table 12-7. EMBOSS report formats

Name

Comments

embl

Writes a report in EMBL feature table format.

genbank

Writes a report in Genbank feature table format.

gff

Writes a report in GFF feature table format.

pir

Writes a report in PIR feature table format.

swiss

Writes a report in SWISS-PROT feature table format.

trace

Of use only for debugging.

listfile

Writes out a list file with the start and end points of the motifs given by "[start:end]" after the sequence's full USA. This is useful as it is a true List File that can be read in by other EMBOSS programs using "@" or "list::" before the filename.

dbmotif

Writes a report in DbMotif format.

Format:

  Length = [length]
  Start = position [start] of sequence
  End = position [end] of sequence
... other tags ...
  [sequence]
  [start and end numbered below sequence with '|' marks]
  Blank line

Data reported: Length, Start, End, Sequence (5 bases around feature)

diffseq

This format is most useful when reporting the results of two aligned sequences, as in the program diffseq. The report describes matches, usually short, between two sequences and features which overlap them.

Format:

  [Sequence 1 Name] [start]-[end] Length: [length]
  Feature: first sequence feature(s)
  Sequence: motif in sequence 1
  Sequence: motif in sequence 2
  Feature: second sequence feature(s)
  [Sequence 2 Name] [start]-[end] Length: [length]
  Blank line

excel

A TAB-delimited table format suitable for reading into spreadsheet programs such as Excel. Name, start, end, and score are always reported. Other tags in the report definition are added as extra columns. All values are (for now) unquoted. Missing values are reported as ".".

feattable

Writes a report in FeatTable format. The report is an EMBL feature table using only the tags in the report definition. There is no requirement for tag names to match standards for the EMBL feature table. The original EMBOSS application for this format was cpgreport.

Format:

  FT [type] [start]..[end]
  FT        /[tagname]=[tagvalue]
  Blank line

Data reported: Type, Start, End

motif

Writes a report in Motif format. Based on the original output format of antigenic, helixturnhelix and sigcleave.

Format:

  (1) Score [score] length [length] at [name] [start->[end]
              *  (marked at position pos)
            [sequence]
            |        |
      [start]        [end]
  [tagname]: tagvalue

Data reported: Name, Start, End, Length, Score, Sequence

regions

Writes a report in Regions format. The report (unusually for the current report formats) includes the feature type.

Format:

[type] from [start] to [end] ([length] [name]) ([tagname]: 
[tagvalue], [tagname]: [tagvalue] ...)

Data reported: Type, Start, End, Length, Name

seqtable

Writes a report in SeqTable format. This is a simple table format that includes the feature sequence. See the following "table" entry for a version without the sequence. Missing tag values are reported as "." The column width is 6, or longer if the name is longer.

Format:

  Start   End   [tagnames]  Sequence
  [start] [end] [tagvalues] [sequence]

simple

Writes a report in SRS simple format. This is a simple parsable format that does not include the feature sequence (see also SRS format) for applications where features can be large. Missing tag values are reported as ".".

Format:

  Feature [number]
  Name: [ID name]
  Start:  [start]
  End: [end]
  Length: [length]
  [tagnames:]  [tag values]
  Blank line

srs

Writes a report in SRS format. This is a simple parsable format that includes the feature sequence. Missing tag values are reported as ".".

Format:

  Feature [number]
  Name: [ID name]
  Start:  [start]
  End: [end]
  Length: [length]
  Sequence: [sequence]
  Score: [score]
  [tagnames:]  [tag values]
  Blank line

table

Writes a report in Table format. See previous "seqtable" entry for a version with the sequence. Missing tag values are reported as ".". The column width is 6, or longer if the name is longer.

Format:

  USA    Start   End   Score   [tagnames]
  [name] [start] [end] [score] [tagvalues]

tagseq

Writes a report in Tagseq format. Features are marked up below the sequence. Originally developed for the garnier application, this format also has general uses.

Format:

  Sequence position written every 10 bases/residues
  Sequence (50 residues)
  tagname        ++++++++++++    +++++++++
  Blank line

If the tag value is a 1-letter code, use it in place of "+".

12.1.6 EMBOSS Application Groups

To aid users in finding programs of interest, the EMBOSS developers have clustered the programs into application groups. These groups are presented below.

12.1.6.1 Alignment consensus

cons
megamerger
merger

12.1.6.2 Alignment differences

diffseq

12.1.6.3 Alignment dot plots

dotmatcher
dotpath
dottup
polydot

12.1.6.4 Alignment global

alignwrap
est2genome
needle
stretcher

12.1.6.5 Alignment local

matcher
seqmatchall
supermatcher
water
wordmatch

12.1.6.6 Alignment multiple

emma
plotcon
showalign
infoalign
prettyplot
tranalign

12.1.6.7 Display

abiview
pepnet
prettyseq
showalign
showseq
cirdna
pepwheel
remap
showdb
textsearch
lindna
prettyplot
seealso
showfeat

12.1.6.8 Edit

cutseq
listor
nthseq
splitter
yank
biosed
extractseq
notseq
skipseq
vectorstrip
degapseq
maskfeat
pasteseq
swissparse
descseq
maskseq
revseq
trimest
entret
newseq
seqret
trimseq
extractfeat
noreturn
seqretsplit
union

12.1.6.9 Enzyme kinetics

findkm

12.1.6.10 Feature tables

coderet
extractfeat
maskfeat
showfeat
swissparse

12.1.6.11 Information

infoalign
seealso
textsearch
whichdb
wossname
infoseq
showdb
tfm

12.1.6.12 Menus

emnu

12.1.6.13 Nucleic 2d structure

einverted

12.1.6.14 Nucleic codon usage

cai
chips
codcmp
cusp
syco

12.1.6.15 Nucleic composition

banana
chaos
dan
isochore
btwisted
compseq
freak
wordcount

12.1.6.16 Nucleic cpg islands

cpgplot
cpgreport
geecee
newcpgreport
newcpgseek

12.1.6.17 Nucleic gene finding

getorf
marscan
plotorf
showorf
wobble

12.1.6.18 Nucleic motifs

dreg
fuzznuc
fuzztran
marscan

12.1.6.19 Nucleic mutation

msbar
shuffleseq

12.1.6.20 Nucleic primers

eprimer3
primersearch
stssearch

12.1.6.21 Nucleic profiles

profit
prophecy
prophet

12.1.6.22 Nucleic repeats

einverted
equicktandem
etandem
palindrome

12.1.6.23 Nucleic restriction

recoder
remap
restrict
silent
redata
restover
showseq

12.1.6.24 Nucleic transcription

tfscan

12.1.6.25 Nucleic translation

backtranseq
plotorf
remap
showseq
coderet
prettyseq
showorf
transeq

12.1.6.26 Phylogeny

distmat

12.1.6.27 Protein 2d structure

garnier
hmoment
pepnet
tmap
helixturnhelix
pepcoil
pepwheel

12.1.6.28 Protein 3d structure

contacts
interface
scopalign
seqalign
seqwords
dichet
profgen
scoprep
seqsearch
siggen
hmmgen
psiblasts
scopreso
seqsort
sigscan

12.1.6.29 Protein composition

backtranseq
compseq
iep
octanol
pepwindow
charge
emowse
mwcontam
pepinfo
pepwindowall
checktrans
freak
mwfilter
pepstats

12.1.6.30 Protein motifs

antigenic
fuzztran
patmatdb
preg
digest
helixturnhelix
patmatmotifs
pscan
fuzzpro
oddcomp
pepcoil
sigcleave

12.1.6.31 Protein mutation

msbar
shuffleseq

12.1.6.32 Protein profiles

profit
prophecy
prophet

12.1.6.33 Protein structure

seqsort

12.1.6.34 Test

histogramtest

12.1.6.35 Utilities—database creation

aaindexextract
groups
pdbtosp
scope
tfextract
cutgextract
hetparse
printsextract
scopnr
domainer
nrscope
prosextract
scopparse
funky
pdbparse
rebaseextract
scopseqs

12.1.6.36 Utilities—database indexing

dbiblast
dbifasta
dbiflat
dbigcg

12.1.6.37 Utilities—miscellaneous

embossdata
embossversion