Many EMBOSS programs have functionality in common. They all understand the same sorts of sequence addresses, sequence formats, output formats, and feature formats. The following sections describe some common themes in EMBOSS. 12.1.1 Uniform Sequence Address The Uniform Sequence Address (USA) is a standard sequence naming used by all EMBOSS applications. The USA syntax is one of: The "::" and ":" syntax is to allow, for example, "embl" and "pir" to be both database names and sequence formats. In addition, EMBOSS allows the command line to separately define the format and the entry name so that only the filename is required. The "file" and "dbname" forms of USA may have "format::" in front of them, but because a database is aware of the format, this structure is redundant and not recommended. Any USA may optionally take this subsequence specifier after the main body of the USA, either in the form "[start : end]" or "[start : end : r]", where start and end are the required start and end positions. Negative positions count from the end of the sequence. Use of this USA subsequence specifier is equivalent to using the -sbegin, -send, or -sreverse command-line qualifiers. Table 12-1 contains some USA examples. Table 12-1. Emboss Uniform Sequence Address (USA) examples Type | Example | Comments | filename | xxx.seq | A sequence file xxx.seq in any format. | format::filename | fasta::xxx.seq | A sequence file xxx.seq in FASTA format. | db:IDname | embl:paamir | EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database. | db:AccessionNumber | embl:X13776 | EMBL entry X13776, using whatever access method is defined locally for the EMBL database. Search by accession number and entry name. X13776 is the accession number in this case. | db-acc:AccessionNumber | embl-acc:X13776 | EMBL entry X13776, using whatever access method is defined locally for the EMBL database. Search by accession number only. | db-id:IDname | embl-id:paamir | EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database. Search by ID only. | db-searchfield:word | embl-des:lectin | EMBL entries containing the word "lectin" in the Description line. | db-searchfield:wcardword | embl-org:*human* | EMBL entries containing the wildcarded word "human" in the Organism fields. | db:wildcard-ID | embl:paami* | EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical order, using whatever access method is defined locally for the EMBL database. | db or db:* | embl or EMBL:* | All sequences in the EMBL database. | @listfile | @mylist | Reads file mylist and uses each line as a separate USA. List files can contain references to other lists files or any other standard USA. | list:listfile | list:mylist | Same as @mylist. | programparameters | | getz -e [embl-id:paamir] | | The pipe character "|" causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry PAAMIR from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way. | asis::sequence | asis::atacgcagttatctgaccat | So far, the shortest USA we could invent. In "asis" format the name is the sequence, so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines. | 12.1.2 Sequence Formats You can specify the format to use on input by giving the format name with two colons before the file holding your sequences. For example: embl::myfile.seq The format is not required. When reading in a sequence, EMBOSS will guess the sequence format by trying all known formats until one succeeds. When writing out a sequence, EMBOSS will use FASTA format by default. You can specify another format to use, for example: gcg::myresults.seq 12.1.2.1 Input sequence formats To date, the sequence formats in Table 12-2 are accepted as input. By default (i.e., no format is explicitly specified), EMBOSS tries each format in turn until one succeeds. Table 12-2. EMBOSS input sequence formats Input format | Comments | abi | ABI trace file format. This is the format of file produced by ABI sequencing machines. It contains the trace data, i.e., the probabilities of the 4 bases along the sequencing run, together with the sequence, as deduced from that data. The sequence information is what is normally read in and used by EMBOSS programs, although the trace data is available and may be utilized by some specialized EMBOSS programs. The code for this is heavily based on David Mathog's Fortran library with a description of ABI trace file format (abi.txt): ftp://saf.bio.caltech.edu/pub/software/molbio/abitools.zip. | acedb | ACeDB format. | clustal aln | ClustalW ALN (multiple alignment) format. | codata | CODATA format. | dbid | Odd FASTA format with Database name first, folowed by ID name and an optional accession number, e.g.: >database name description or >database name accession description embl | em | EMBL entry format, or at least a minimal subset of the fields. The Staden package and others use EMBL or similar formats for sequence data. | pearson | FASTA format with an optional accession number after the sequence identifier, e.g.: >name description or >name accession description and with an optional database name in GCG style FASTA format included as part of the sequence identifier, e.g.: >database:name accession description | gcg gcg8 | GCG 9.x and 10.x format with the format and sequence type identified on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data. | genbank gb ddbj | GENBANK entry format, or at least a minimal subset of the fields. | gff | GFF format. | hennig86 | Hennig86 format. | ig | IntelliGenetics format. | jackknifer | Jackknifer format. | jackknifernon | Jackknifernon format. | nbrf pir | NBRF (PIR) format, as used in the PIR database sequence files. | nexus paup | Nexus/PAUP format. | nexusnonpaupnon | Nexusnon/PAUPnon format. | treecon | Treecon format. | mega | Mega format. | meganon | Meganon format. | msf | Wisconsin Package GCG's MSF multiple sequence format. | ncbi | FASTA format with optional accession number and database name in NCBI style included as part of the sequence identifier, e.g.: >database|accession|id description (and other variants on this theme!) | pfam stockholm | Pfam format. | phylip | PHYLIP interleaved multiple alignment format. | selex | SELEX format is used by Sean Eddy's HMMER package. It can store RNA secondary structure as part of the sequence annotation. | staden experiment | The experiment file format used by the gap program in the Staden package, where the sequence identifier is optional and the remainer is plain text. Some alternative nucleotide ambiguity codes are used and must be converted. | strider | DNA Strider format. | swissprot swiss sw | SWISS-PROT entry format, or at least a minimal subset of the fields. | text plain | Plain text. This is the format with no format. The whole of the file is read in as a sequence. No attempt is made to parse the file contents in any way. Anything is acceptable in this format. This means that any character will be included in the sequence, even digits and punctuation. Use this format only when you are sure that the input sequence file is correct and contains only what you want to be considered as your sequence. | raw | Similar to text or plain format. However, raw removes any whitespace or digits, accepts only alphabetic characters, and rejects anything else. This format is safer than plain format. Digits, spaces, and TAB characters are removed and ignored. If a sequence contains other non-alphabetic characters (e.g., punctuation characters), it is rejected as erroneous. | asis | Not a sequence format , but a quick way of entering a sequence on the command line. It is included here for completeness. In "asis" format, the actual sequence appears where a filename would normally be given. | asis::atacgcagttatctgacc | In "asis" format the name is the sequence, so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines. | 12.1.2.2 Output sequence formats To date, the sequence formats in Table 12-3 are available as output. Some sequence formats can hold multiple sequences in one file; these are marked as multiple in the table. Formats such as GCG, plain, and staden can hold only one sequence per file and are marked as single. Table 12-3. EMBOSS input sequence formats Output format | Single/multiple | Comments | gcg gcg8 | single | Wisconsin Package GCG 9.x and 10.x format with the sequence type on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data. | embl em | multiple | EMBL entry format with available fields filled in and others with no information omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence. | swiss sw | multiple | SwisProt entry format with available fields filled in and others with no information omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence. | fasta pearson | multiple | Standard Pearson FASTA format, but with the accession number included after the identifier if available. | ncbi | multiple | NCBI style FASTA format with the database name, entry name and accession number separated by pipe ("|") characters. | nbrf pir | multiple | NBRF (PIR) format, as used in the PIR database sequence files. | genbank gb | multiple | GENBANK entry format with available fields filled in and others with no information omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence. | gff | multiple | GFF format. | ig | multiple | IntelliGenetics format, as used by the IntelliGenetics package. | codata | multiple | CODATA format. | stride | multiple | DNA strider format. | acedb | multiple | ACeDB format. | staden experiment | single | The experiment file format used by the gap program in the Staden package. Some alternative nucleotide ambiguity codes are used and are converted. | text plain raw | single | Plain sequence, no annotation or heading. | fitch | multiple | Fitch format. | msf | multiple | Wisconsin Package GCG's MSF multiple sequence format. | clustal aln | multiple | Clustal multiple sequence format. | selex | multiple | SELEX format. | phylip | multiple | PHYLIP interleaved format. | phylip3 | multiple | PHYLIP non-interleaved format that was used in Phylip version 3.2. | asn1 | multiple | A subset of ASN.1 containing entry name, accession number, description and sequence, similar to the current ASN.1 output of Readseq. | hennig86 | multiple | Hennig86 format. | mega | multiple | Mega format. | meganon | multiple | Meganon format. | nexus paup | multiple | Nexus/PAUP format. | nexusnon paupnon | multiple | Nexusnon/PAUPnon format. | jackknifer | multiple | Jackknifer format. | jackknifernon | multiple | Jackknifernon format. | treecon | multiple | Treecon format. | debug | multiple | EMBOSS sequence object report for debugging showing all available fields. Not all fields will contain data this depends very much on the input format used. | 12.1.3 Alignment Formats When writing out an alignment between two or more sequences, EMBOSS now uses a standard set of formats. 12.1.3.1 Multiple sequence alignment formats Table 12-4 contains details about the current set of multiple sequence alignment formats available in EMBOSS. Table 12-4. EMBOSS multiple sequence alignment formats Name | Comments | unknown multiple simple | These are synonyms for simple format. This format displays the sequence names, positions and sequences, then puts the markup line underneath the sequences. When only two sequences are being aligned, the format is changed to that produced by pair. | fasta | This is the standard FASTA sequence format with gaps, where many sequences are concatenated one after the other. | msf | This is the standard MSF sequence format. | trace | This is a special verbose format for use in debugging. It is not intended for normal users. | srs | This shows the sequence ID name, the sequence position, the sequence and the sequence position for each line. | 12.1.3.2 Pairwise sequence alignment formats Table 12-5 contains details about the current set of pairwise sequence alignment formats available in EMBOSS. Table 12-5. EMBOSS pairwise sequence alignment formats Name | Comments | pair | This is the default format used when there are only 2 sequences. When simple format is selected but there are only 2 sequences, this format is used. The sequences have the markup line between them. | markx0 | This is the standard default output format used by Bill Pearson's suite of FASTA programs. | markx1 | This is an alternative output format used by Bill Pearson's suite of FASTA programs in which identities are not marked. Instead, conservative replacements are denoted by "x" and non-conservative substitutions by "X". | markx2 | This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the residues in the second sequence are only shown if they are different from the first. | markx3 | This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format. These can be used to build a primitive multiple alignment. | markx10 | This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format and the sequence length, alignment start and stop information is given in lines starting with a ";" character just after the title line for each sequence. It is intended to be easily parsed by other programs. | srspair | This is very similar in style to pair format. | score | This does not display the sequence alignment. It shows only the names of the sequences, the length of the alignment, and the score. | 12.1.4 Feature Formats When reading or writing features associated with a sequence, a standard set of formats is used. The feature files can either be a standard sequence format with a feature table as part of the sequence format, or the features can be held in a file without the associated sequence. Table 12-6 contains details about the current set of feature formats available in EMBOSS. Table 12-6. EMBOSS feature formats Name | Comments | embl em | The format used by the EMBL nucleic database. | gff | The General Feature Format defined by the Sanger Centre. | swissprot swiss sw | The format used by the SWISS-PROT protein database. The feature table keys are also defined. | pir | The format used by the PIR protein database. | nbrf | Only available for input the same as PIR format. | 12.1.5 Report Formats There are many ways in which the results of an analysis can be reported. Many EMBOSS programs are now able to output their results in a standard report format you can change the report format used by putting -rformat name on the command line, where name is the name of one of the standard report formats. Table 12-7 contains examples of garnier analyzing sw:100K_rat output in various report formats. Table 12-7. EMBOSS report formats Name | Comments | embl | Writes a report in EMBL feature table format. | genbank | Writes a report in Genbank feature table format. | gff | Writes a report in GFF feature table format. | pir | Writes a report in PIR feature table format. | swiss | Writes a report in SWISS-PROT feature table format. | trace | Of use only for debugging. | listfile | Writes out a list file with the start and end points of the motifs given by "[start:end]" after the sequence's full USA. This is useful as it is a true List File that can be read in by other EMBOSS programs using "@" or "list::" before the filename. | dbmotif | Writes a report in DbMotif format. Format: Length = [length] Start = position [start] of sequence End = position [end] of sequence ... other tags ... [sequence] [start and end numbered below sequence with '|' marks] Blank line Data reported: Length, Start, End, Sequence (5 bases around feature) | diffseq | This format is most useful when reporting the results of two aligned sequences, as in the program diffseq. The report describes matches, usually short, between two sequences and features which overlap them. Format: [Sequence 1 Name] [start]-[end] Length: [length] Feature: first sequence feature(s) Sequence: motif in sequence 1 Sequence: motif in sequence 2 Feature: second sequence feature(s) [Sequence 2 Name] [start]-[end] Length: [length] Blank line | excel | A TAB-delimited table format suitable for reading into spreadsheet programs such as Excel. Name, start, end, and score are always reported. Other tags in the report definition are added as extra columns. All values are (for now) unquoted. Missing values are reported as ".". | feattable | Writes a report in FeatTable format. The report is an EMBL feature table using only the tags in the report definition. There is no requirement for tag names to match standards for the EMBL feature table. The original EMBOSS application for this format was cpgreport. Format: FT [type] [start]..[end] FT /[tagname]=[tagvalue] Blank line Data reported: Type, Start, End | motif | Writes a report in Motif format. Based on the original output format of antigenic, helixturnhelix and sigcleave. Format: (1) Score [score] length [length] at [name] [start->[end] * (marked at position pos) [sequence] | | [start] [end] [tagname]: tagvalue Data reported: Name, Start, End, Length, Score, Sequence | regions | Writes a report in Regions format. The report (unusually for the current report formats) includes the feature type. Format: [type] from [start] to [end] ([length] [name]) ([tagname]: [tagvalue], [tagname]: [tagvalue] ...) Data reported: Type, Start, End, Length, Name | seqtable | Writes a report in SeqTable format. This is a simple table format that includes the feature sequence. See the following "table" entry for a version without the sequence. Missing tag values are reported as "." The column width is 6, or longer if the name is longer. Format: Start End [tagnames] Sequence [start] [end] [tagvalues] [sequence] | simple | Writes a report in SRS simple format. This is a simple parsable format that does not include the feature sequence (see also SRS format) for applications where features can be large. Missing tag values are reported as ".". Format: Feature [number] Name: [ID name] Start: [start] End: [end] Length: [length] [tagnames:] [tag values] Blank line | srs | Writes a report in SRS format. This is a simple parsable format that includes the feature sequence. Missing tag values are reported as ".". Format: Feature [number] Name: [ID name] Start: [start] End: [end] Length: [length] Sequence: [sequence] Score: [score] [tagnames:] [tag values] Blank line | table | Writes a report in Table format. See previous "seqtable" entry for a version with the sequence. Missing tag values are reported as ".". The column width is 6, or longer if the name is longer. Format: USA Start End Score [tagnames] [name] [start] [end] [score] [tagvalues] | tagseq | Writes a report in Tagseq format. Features are marked up below the sequence. Originally developed for the garnier application, this format also has general uses. Format: Sequence position written every 10 bases/residues Sequence (50 residues) tagname ++++++++++++ +++++++++ Blank line If the tag value is a 1-letter code, use it in place of "+". | 12.1.6 EMBOSS Application Groups To aid users in finding programs of interest, the EMBOSS developers have clustered the programs into application groups. These groups are presented below. 12.1.6.1 Alignment consensus 12.1.6.2 Alignment differences 12.1.6.3 Alignment dot plots - dotmatcher
| - dotpath
| - dottup
| - polydot
| | 12.1.6.4 Alignment global - alignwrap
| - est2genome
| - needle
| - stretcher
| | 12.1.6.5 Alignment local - matcher
| - seqmatchall
| - supermatcher
| - water
| - wordmatch
| 12.1.6.6 Alignment multiple - emma
| - plotcon
| - showalign
| | | - infoalign
| - prettyplot
| - tranalign
| | | 12.1.6.7 Display - abiview
| - pepnet
| - prettyseq
| - showalign
| - showseq
| - cirdna
| - pepwheel
| - remap
| - showdb
| - textsearch
| - lindna
| - prettyplot
| - seealso
| - showfeat
| | 12.1.6.8 Edit - cutseq
| - listor
| - nthseq
| - splitter
| - yank
| - biosed
| - extractseq
| - notseq
| - skipseq
| - vectorstrip
| - degapseq
| - maskfeat
| - pasteseq
| - swissparse
| | - descseq
| - maskseq
| - revseq
| - trimest
| | - entret
| - newseq
| - seqret
| - trimseq
| | - extractfeat
| - noreturn
| - seqretsplit
| - union
| | 12.1.6.9 Enzyme kinetics 12.1.6.10 Feature tables - coderet
| - extractfeat
| - maskfeat
| - showfeat
| - swissparse
| 12.1.6.11 Information - infoalign
| - seealso
| - textsearch
| - whichdb
| - wossname
| - infoseq
| - showdb
| - tfm
| | | 12.1.6.12 Menus 12.1.6.13 Nucleic 2d structure 12.1.6.14 Nucleic codon usage - cai
| - chips
| - codcmp
| - cusp
| - syco
| 12.1.6.15 Nucleic composition - banana
| - chaos
| - dan
| - isochore
| | - btwisted
| - compseq
| - freak
| - wordcount
| | 12.1.6.16 Nucleic cpg islands - cpgplot
| - cpgreport
| - geecee
| - newcpgreport
| - newcpgseek
| 12.1.6.17 Nucleic gene finding - getorf
| - marscan
| - plotorf
| - showorf
| - wobble
| 12.1.6.18 Nucleic motifs - dreg
| - fuzznuc
| - fuzztran
| - marscan
| | 12.1.6.19 Nucleic mutation 12.1.6.20 Nucleic primers - eprimer3
| - primersearch
| - stssearch
| | | 12.1.6.21 Nucleic profiles 12.1.6.22 Nucleic repeats - einverted
| - equicktandem
| - etandem
| - palindrome
| | 12.1.6.23 Nucleic restriction - recoder
| - remap
| - restrict
| - silent
| | - redata
| - restover
| - showseq
| | | 12.1.6.24 Nucleic transcription 12.1.6.25 Nucleic translation - backtranseq
| - plotorf
| - remap
| - showseq
| | - coderet
| - prettyseq
| - showorf
| - transeq
| | 12.1.6.26 Phylogeny 12.1.6.27 Protein 2d structure - garnier
| - hmoment
| - pepnet
| - tmap
| | - helixturnhelix
| - pepcoil
| - pepwheel
| | | 12.1.6.28 Protein 3d structure - contacts
| - interface
| - scopalign
| - seqalign
| - seqwords
| - dichet
| - profgen
| - scoprep
| - seqsearch
| - siggen
| - hmmgen
| - psiblasts
| - scopreso
| - seqsort
| - sigscan
| 12.1.6.29 Protein composition - backtranseq
| - compseq
| - iep
| - octanol
| - pepwindow
| - charge
| - emowse
| - mwcontam
| - pepinfo
| - pepwindowall
| - checktrans
| - freak
| - mwfilter
| - pepstats
| | 12.1.6.30 Protein motifs - antigenic
| - fuzztran
| - patmatdb
| - preg
| | - digest
| - helixturnhelix
| - patmatmotifs
| - pscan
| | - fuzzpro
| - oddcomp
| - pepcoil
| - sigcleave
| | 12.1.6.31 Protein mutation 12.1.6.32 Protein profiles 12.1.6.33 Protein structure 12.1.6.34 Test 12.1.6.35 Utilities database creation - aaindexextract
| - groups
| - pdbtosp
| - scope
| - tfextract
| - cutgextract
| - hetparse
| - printsextract
| - scopnr
| | - domainer
| - nrscope
| - prosextract
| - scopparse
| | - funky
| - pdbparse
| - rebaseextract
| - scopseqs
| | 12.1.6.36 Utilities database indexing - dbiblast
| - dbifasta
| - dbiflat
| - dbigcg
| | 12.1.6.37 Utilities miscellaneous |