2.6 EMBL Field Definitions

2.6 EMBL Field Definitions

The field codes found in EMBL sequence flat files are used to help organize the information for human readability and machine-based parsing. There are several field codes found in an EMBL sequence flat file, and they are designated with a two-letter abbreviation. Table 2-2 summarizes the content of each field code.

Table 2-2. EMBL field definitions

Line code

Content

ID

Identification

AC

Accession number(s)

SV

New sequence identifier

DT

Date

DE

Description

KW

Keyword

OS

Organism species

OC

Organism classification

OG

Organelle

RN

Reference number

RC

Reference comment(s)

RP

Reference positions

RX

Reference cross-reference(s)

RA

Reference authors

RT

Reference title

RL

Reference location

DR

Database cross-references

FH

Feature table header

FT

Feature table data

CC

Comments or notes

XX

Spacer line

SQ

Sequence header

 

(Blanks) Sequence data

//

Termination line

2.7 DDBJ/EMBL/GenBank Feature Table

In February 1986, GenBank and EMBL (joined by DDBJ in 1987) started a collaborative effort to create a common feature table format. The overall objective of the feature table was to supply an in-depth vocabulary for describing nucleotide (and protein) features. We're using Version 4 of the feature table.

2.7.1 Features

A feature is a single word or abbreviation indicating a functional role or region associated with a sequence. A list of DDBJ/EMBL/GenBank features is presented in Table 2-3. In the Definition column of the table, the appropriate qualifiers for each feature are in brackets. Mandatory qualifiers are highlighted in bold.

Table 2-3. DDBJ/EMBL/GenBank feature key table

Feature Key

Definition

attenuator

1) region of DNA at which regulation of termination of transcription occurs, which controls the expression of some bacterial operons.

2) sequence segment located between the promoter and the first structural gene that causes partial termination of transcription.

[citation, db_xref, evidence, gene, label, map, note, phenotype, usedin]

C_region

Constant region of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; includes one or more exons depending on the particular chain.

[citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin]

CAAT_signal

CAAT box; part of a conserved sequence located about 75 bp up-stream of the start point of eukaryotic transcription units which may be involved in RNA polymerase binding; consensus=GG (C or T) CAATCT.

[citation, db_xref, evidence, gene, label, map, note, usedin]

CDS

Coding sequence; sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon); feature includes amino acid conceptual translation.

[allele, citation, codon, codon_start, db_xref, EC_number, evidence, exception, function, gene, label, map, note, number, product, protein_id, pseudo, standard_name, translation, transl_except, transl_table, usedin]

conflict

Independent determinations of the "same" sequence differ at this site or region.

[citation, db_xref, evidence, label, map, note, gene, replace, usedin]

D-loop

Displacement loop; a region within mitochondrial DNA in which a short stretch of RNA is paired with one strand of DNA, displacing the original partner DNA strand in this region. Also used to describe the displacement of a region of one strand of duplex DNA by a single stranded invader in the reaction catalyzed by RecA protein.

[citation, db_xref, evidence, gene, label, map, note, usedin]

D_segment

Diversity segment of immunoglobulin heavy chain, and T-cell receptor beta chain

[citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin]

enhancer

A cis-acting sequence that increases the utilization of (some) eukaryotic promoters, and can function in either orientation and in any location (upstream or downstream) relative to the promoter.

[citation, db_xref, evidence, gene, label, map, note, standard_name, usedin]

exon

Region of genome that codes for portion of spliced mRNA, rRNA and tRNA; may contain 5' UTR, all CDSs, and 3' UTR.

[allele, citation, db_xref, EC_number, evidence, function, gene, label, map, note, number, product, pseudo, standard_name, usedin]

GC_signal

GC box; a conserved GC-rich region located upstream of the start point of eukaryotic transcription units which may occur in multiple copies or in either orientation; consensus=GGGCGG.

[citation, db_xref, evidence, gene, label, map, note, usedin]

gene

Region of biological interest identified as a gene and for which a name has been assigned.

[allele, citation, db_xref, evidence, function, label, map, note, product, pseudo, phenotype, standard_name, usedin]

iDNA

Intervening DNA; DNA which is eliminated through any of several kinds of recombination.

[citation, db_xref, evidence, function, label, gene, map, note, number, standard_name, usedin]

intron

A segment of DNA that is transcribed, but removed from within the transcript by splicing together the sequences (exons) on either side of it.

[allele, citation, cons_splice, db_xref, evidence, function, gene, label, map, note, number, standard_name, usedin]

J_segment

Joining segment of immunoglobulin light and heavy chains and T-cell receptor alpha, beta, and gamma chains.

[citation, db_xref, evidence, gene, map, note, product, pseudo, standard_name, usedin]

LTR

Long terminal repeat, a sequence directly repeated at both ends of a defined sequence, of the sort typically found in retroviruses.

[citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin]

mat_peptide

Mature peptide or protein coding sequence; coding sequence for the mature or final peptide or protein product following post-translational modification; the location does not include the stop codon (unlike the corresponding CDS).

[citation, db_xref, EC_number, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin]

misc_binding

Site in nucleic acid which covalently or non-covalently binds another moiety that cannot be described by any other binding key (primer_bind or protein_bind).

[citation, bound_moiety, db_xref, evidence, function, gene, label, map, note, usedin]

misc_difference

Feature sequence is different from that presented in the entry and cannot be described by any other Difference key (conflict, unsure, old_sequence, mutation, or modified_base).

[citation, clone, db_xref, evidence, gene, label, map, note, phenotype, replace, standard_name, usedin]

misc_feature

Region of biological interest which cannot be described by any other feature key; a new or rare feature.

[citation, db_xref, evidence, function, gene, label, map, note, number, phenotype, product, pseudo, standard_name, usedin]

misc_recomb

Site of any generalized, site-specific or replicative recombination event where there is a breakage and reunion of duplex DNA that cannot be described by other recombination keys (iDNA and virion) or qualifiers of source key (/insertion seq, /transposon, /proviral).

[citation, db_xref, evidence, gene, label, map, note, organism, standard_name, usedin]

misc_RNA

Any transcript or RNA product that cannot be defined by other RNA keys (prim_transcript, precursor_RNA, mRNA, 5' clip, 3' clip, 5' UTR, 3' UTR, exon, CDS, sig_peptide, transit_peptide, mat_peptide, intron, polyA_site, rRNA, tRNA, scRNA, and snRNA).

[citation, db_xref, evidence, function, gene, label, map, note, product, standard_name, usedin]

misc_signal

Any region containing a signal controlling or altering gene function or expression that cannot be described by other signal keys (promoter, CAAT_signal, TATA_signal, -35_signal, -10_signal, GC_signal, RBS, polyA_signal, enhancer, attenuator, terminator, and rep_origin).

[citation, db_xref, evidence, function, gene, label, map, note, phenotype, standard_name, usedin]

misc_structure

Any secondary or tertiary nucleotide structure or conformation that cannot be described by other Structure keys (stem_loop and D-loop).

[citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin]

modified_base

The indicated nucleotide is a modified nucleotide and should be substituted for by the indicated molecule (given in the mod_base qualifier value).

[citation, db_xref, evidence, frequency, gene, label, map, mod_base, note, usedin]

mRNA

Messenger RNA; includes 5' untranslated region (5'UTR), coding sequences (CDS, exon) and 3' untranslated region (3'UTR);

[allele, citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin]

N_region

Extra nucleotides inserted between rearranged immmunoglobulin segments.

[citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin]

old_sequence

The presented sequence revises a previous version of the sequence at this location.

[citation, db_xref, evidence, gene, label, map, note, replace, usedin]

polyA_signal

Recognition region necessary for endonuclease cleavage of an RNA transcript that is followed by polyadenylation; consensus=AATAAA.

[citation, db_xref, evidence, gene, label, map, note, usedin]

polyA_site

Site on an RNA transcript to which will be added adenine residues by post-transcriptional polyadenylation.

[citation, db_xref, evidence, gene, label, map, note, usedin]

precursor_RNA

Any RNA species that is not yet the mature RNA product; may include 5' clipped region (5'clip), 5' untranslated region (5'UTR), coding sequences (CDS, exon), intervening sequences (intron), 3' untranslated region (3'UTR), and 3' clipped region (3'clip).

[allele, citation, db_xref, evidence, function, gene, label, map, note, product, standard_name, usedin]

prim_transcript

Primary (initial, unprocessed) transcript; includes 5' clipped region (5'clip), 5' untranslated region (5'UTR), coding sequences (CDS, exon), intervening sequences (intron), 3' untranslated region (3'UTR), and 3' clipped region (3'clip).

[allele, citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin]

primer_bind

Non-covalent primer binding site for initiation of replication, transcription, or reverse transcription; includes site(s) for synthetic e.g., PCR primer elements.

[citation, db_xref, evidence, gene, label, map, note, standard_name, PCR_conditions, usedin]

promoter

Region on a DNA molecule involved in RNA polymerase binding to initiate transcription.

[citation, db_xref, evidence, gene, function, label, map, note, phenotype, pseudo, standard_name, usedin]

protein_bind

Non-covalent protein binding site on nucleic acid.

[bound_moiety, citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin]

RBS

Ribosome binding site.

[citation, db_xref, evidence, gene, label, map, note, standard_name, usedin]

repeat_region

Region of genome containing repeating units.

[citation, db_xref, evidence, function, gene, insertion_seq, label, map, note, rpt_family, rpt_type, rpt_unit, standard_name, transposon, usedin]

repeat_unit

Single repeat element.

[citation, db_xref, evidence, function, gene, label, map, note, rpt_family, rpt_type, rpt_unit, usedin]

rep_origin

Origin of replication; starting site for duplication of nucleic acid to give two identical copies.

[citation, db_xref, direction, evidence, gene, label, map, note, standard_name, usedin]

rRNA

Mature ribosomal RNA ; RNA component of the ribonucleoprotein particle (ribosome) which assembles amino acids into proteins.

[citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin]

S_region

Switch region of immunoglobulin heavy chains; involved in the rearrangement of heavy chain DNA leading to the expression of a different immunoglobulin class from the same B-cell.

[citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin]

satellite

Many tandem repeats (identical or related) of a short basic repeating unit; many have a base composition or other property different from the genome average that allows them to be separated from the bulk (main band) genomic DNA.

[citation, db_xref, evidence, gene, label, map, note, rpt_type, rpt_family, rpt_unit, standard_name, usedin]

scRNA

Small cytoplasmic RNA; any one of several small cytoplasmic RNA molecules present in the cytoplasm and (sometimes) nucleus of a eukaryote.

[citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin]

sig_peptide

Signal peptide coding sequence; coding sequence for an N-terminal domain of a secreted protein; this domain is involved in attaching nascent polypeptide to the membrane leader sequence.

[citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin]

snRNA

Small nuclear RNA molecules involved in pre-mRNA splicing and processing.

[citation, db_xref, evidence, function, gene, label, map, note, partial, product, pseudo, standard_name, usedin]

snoRNA

Small nucleolar RNA molecules mostly involved in rRNA modification and processing.

[citation, db_xref, evidence, function, gene, label, map, note, partial, product, pseudo, standard_name, usedin]

source

Identifies the biological source of the specified span of the sequence; this key is mandatory; more than one source key per sequence is permissable; every entry will have, as a minimum, a single source key spanning the entire sequence or multiple source keys together spanning the entire sequence.

[cell_line, cell_type, chromosome, citation, clone, clone_lib, country, cultivar, db_xref, dev_stage, environmental_sample, focus, frequency, germline, haplotype, lab_host, insertion_seq, isolate, isolation_source, label, macronuclear, map, note, organelle, organism, plasmid, pop_variant, proviral, rearranged, sequenced_mol, serotype, serovar, sex, specimen_voucher, specific_host, strain, sub_clone, sub_species, sub_strain, tissue_lib, tissue_type, transgenic, transposon, usedin, variety, virion]

stem_loop

Hairpin; a double-helical region formed by base-pairing between adjacent (inverted) complementary sequences in a single strand of RNA or DNA.

[citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin]

STS

Sequence tagged site; short, single-copy DNA sequence that characterizes a mapping landmark on the genome and can be detected by PCR; a region of the genome can be mapped by determining the order of a series of STSs.

[citation, db_xref, evidence, gene, label, note, map, standard_name, usedin]

TATA_signal

TATA box; Goldberg-Hogness box; a conserved AT-rich septamer found about 25 bp before the start point of each eukaryotic RNA polymerase II transcript unit which may be involved in positioning the enzyme for correct initiation; consensus=TATA(A or T)A(A or T).

[citation, db_xref, evidence, gene, label, map, note, usedin]

terminator

Sequence of DNA located either at the end of the transcript that causes RNA polymerase to terminate transcription.

[citation, db_xref, evidence, gene, label, map, note, standard_name, usedin]

transit_peptide

Transit peptide coding sequence; coding sequence for an N-terminal domain of a nuclear-encoded organellar protein; this domain is involved in post-translational import of the protein into the organelle.

[citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin]

tRNA

Mature transfer RNA, a small RNA molecule (75-85 bases long) that mediates the translation of a nucleic acid sequence into an amino acid sequence.

[anticodon, citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin]

unsure

Author is unsure of exact sequence in this region.

[citation, db_xref, evidence, gene, label, map, note, replace, usedin]

V_region

Variable region of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; codes for the variable amino terminal portion; can be composed of V_segments, D_segments, N_regions, and J_segments.

[citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin]

V_segment

Variable segment of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; codes for most of the variable region (V_region) and the last few amino acids of the leader peptide.

[citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin]

variation

A related strain contains stable mutations from the same gene (e.g., RFLPs, polymorphisms, etc.) which differ from the presented sequence at this location (and possibly others).

[allele, citation, db_xref, evidence, frequency, gene, label, map, note, phenotype, product, replace, standard_name, usedin]

3' clip

3'-most region of a precursor transcript that is clipped off during processing.

[allele, citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin]

3' UTR

Region at the 3' end of a mature transcript (following the stop codon) that is not translated into a protein.

[allele, citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin]

5' clip

5'-most region of a precursor transcript that is clipped off during processing.

[allele, citation, db_xref, evidence, function, gene, label, map, note, partial, standard_name, usedin]

5' UTR

Region at the 5' end of a mature transcript (preceding the initiation codon) that is not translated into a protein.

[allele, citation, db_xref, evidence, function, gene, label, map, note, partial, standard_name, usedin]

-10_signal

Pribnow box; a conserved region about 10 bp upstream of the start point of bacterial transcription units which may be involved in binding RNA polymerase; consensus=TAtAaT.

[citation, db_xref, evidence, gene, label, map, note, standard_name, usedin]

-35_signal

A conserved hexamer about 35 bp upstream of the start point of bacterial transcription units; consensus=TTGACa [ ] or TGTTGACA [ ];

[citation, db_xref, evidence, gene, label, map, note, standard_name, usedin]

-

"-" is a placeholder for no key; should be used when the need is merely to mark region in order to comment on it or to use it in another feature's location.

[citation, db_xref, evidence, function, gene, label, map, note, number, phenotype, product, pseudo, standard_name, usedin]

2.7.2 Qualifiers

A qualifer is auxiliary information about a feature. A feature can have one or more qualifiers. However, some features require mandatory qualifers, while others don't need a qualifer at all. Table 2-4 lists all DDBJ/EMBL/GenBank qualifiers.

Table 2-4. DDBJ/EMBL/GenBank qualifier table

/<qualifier>=

Description

/allele=

Name of the allele for the given gene.

/anticodon=

Location of the anticodon of tRNA and the amino acid for which it codes.

/bound_moiety=

Moiety bound.

/cell_line=

Cell line from which the sequence was obtained.

/cell_type=

Cell type from which the sequence was obtained.

/chromosome=

Chromosome (e.g., Chromosome number) from which the sequence was obtained.

/citation=

Reference to a citation listed in the entry reference field.

/clone=

Clone from which the sequence was obtained.

/clone_lib=

Clone library from which the sequence was obtained.

/codon=

Specifies a codon which is different from any found in the reference genetic code.

/codon_start=

Indicates the offset at which the first complete codon of a coding feature can be found, relative to the first base of that feature.

/cons_splice=

Differentiates between intron splice sites that conform to the 5'-GT ... AG-3' splice site consensus.

/country=

Country of origin for DNA sample, intended for epidemiological or population studies.

/cultivar=

Cultivar (cultivated variety) of plant from which sequence was obtained.

/db_xref=

Database cross-reference: pointer to related information in another database.

/dev_stage=

If the sequence was obtained from an organism in a specific developmental stage, it is specified with this qualifier.

/direction=

Direction of DNA replication.

/EC_number=

Enzyme Commission number for enzyme product of sequence.

/environmental_sample

Identifies sequences derived by direct molecular isolation (PCR, DGGE, or other anonymous methods) from an environmental sample with no reliable identification of the source organism.

/evidence=

Value indicating the nature of supporting evidence, distinguishing between experimentally determined and theoretically derived data.

/exception=

Indicates that the amino acid or RNA sequence will not translate or agree with the DNA sequence according to standard biological rules

/focus

Defines the source feature of primary biological interest for records that have multiple source features originating from different organisms.

/frequency=

Frequency of the occurrence of a feature.

/function=

Function attributed to a sequence.

/gene=

Symbol of the gene corresponding to a sequence region.

/germline

If the sequence shown is DNA and a member of the immunoglobulin family, this qualifier is used to denote that the sequence is from unrearranged DNA.

/haplotype=

Haplotype of organism from which the sequence was obtained.

/insertion_seq=

Insertion sequence element from which the sequence was obtained.

/isolate=

Individual isolate from which the sequence was obtained.

/isolation_source=

Describes the physical, environmental and/or local geographical source of the biological sample from which the sequence was derived.

/label=

A label used to permanently tag a feature.

/lab_host=

Laboratory host used to propagate the organism from which the sequence was obtained

/map=

Genomic map position of feature.

/macronuclear

If the sequence shown is DNA and from an organism which undergoes chromosomal differentiation between macronuclear and micronuclear stages, this qualifier is used to denote that the sequence is from macronuclear DNA.

/mod_base=

Abbreviation for a modified nucleotide base.

/note=

Any comment or additional information.

/number=

A number to indicate the order of genetic elements (e.g., exons or introns) in the 5' to 3' direction.

/organelle=

Type of membrane-bound intracellular structure from which the sequence was obtained.

/organism=

Scientific name of the organism that provided the sequenced genetic material.

/partial

Differentiates between complete and partial regions.

/PCR_conditions=

Description of reaction conditions and components for PCR.

/phenotype=

Phenotype conferred by the feature.

/pop_variant=

Population variant from which the sequence was obtained.

/plasmid=

Name of plasmid from which sequence was obtained.

/product=

Name of a product encoded by a sequence.

/protein_id=

Protein identifier, issued by International collaborators, this qualifier consists of a stable ID portion (3+5 format with 3 position letters and 5 numbers) plus a version number after the decimal point.

/proviral

Denotes that the sequence shown is viral and integrated into another organism's genome.

/pseudo

Indicates that this feature is a non-functional version of the element named by the feature key.

/rearranged

If the sequence shown is DNA and a member of the immunoglobulin family, this qualifier denotes that the sequence is from rearranged DNA.

/replace=

Indicates that the sequence identified a feature's intervals is replaced by the sequence shown in "text".

/rpt_family=

Type of repeated sequence; "Alu" or "Kpn", for example.

/rpt_type=

Organization of repeated sequence.

/rpt_unit=

Identity of repeat unit which constitutes a repeat_region.

/sequenced_mol=

Molecule from which the sequence was obtained.

/serotype=

Serological variety of a species characterized by its antigenic properties.

/serovar=

Serological variety of a species (usually a prokaryote) characterized by its antigenic properties.

/sex=

Sex of the organism from which the sequence was obtained.

/specific_host=

natural host from which the sequence was obtained.

/specimen_voucher=

An identifier of the individual or collection of the source organism and the place where it is currently stored, usually an institution.

/standard_name=

Accepted standard name for this feature.

/strain=

Strain from which sequence was obtained.

/sub_clone=

sub-clone from which sequence was obtained.

/sub_species=

Name of sub-species of organism from which sequence was obtained.

/sub_strain=

sub_strain from which sequence was obtained.

/tissue_lib=

Tissue library from which sequence was obtained.

/tissue_type=

Tissue type from which the sequence was obtained.

/transgenic

Identifies the source feature of the organism which was the recipient of transgenic DNA.

/translation=

Automatically generated one-letter abbreviated amino acid sequence derived from either the universal genetic code or the table as specified in /transl_table and as determined by exceptions in the /transl_except and /codon qualifiers.

/transl_except=

Translational exception: single codon the translation of which does not conform to genetic code defined by Organism and /codon=.

/transl_table=

Definition of genetic code table used if other than universal genetic code table (Tables are described in Appendix B).

/transposon=

Transposable element from which the sequence was obtained.

/usedin=

Indicates that the feature is used in a compound feature in another entry.

/variety=

Name of variety (formal Linnean rank) of organism from which the sequence was obtained; use the /cultivar qualifier for cultivated plant varieties.

/virion

Viral genomic sequence as it is encapsidated (distinguished from its proviral form integrated in a host cell's chromosome) .

2.7.3 Locations

A location is an instruction for finding a feature in a sequence. A list of DDBJ/EMBL/GenBank locations is presented in Table 2-5.

Table 2-5. DDBJ/EMBL/GenBank location examples

Location

Description

467

Points to a single base in the presented sequence.

340..565

Points to a continuous range of bases bounded by and including the starting and ending bases.

<345..500

Indicates that the exact lower boundary point of a feature is unknown. The location begins at some base previous to the first base specified (which need not be contained in the presented sequence) and continues to and includes the ending base.

<1..888

The feature starts before the first sequenced base and continues to and includes base 888.

(102.110)

Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110, inclusive.

(23.45)..600

Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end point is base 600.

(122.133)..(204.221)

The feature starts at a base between 122 and 133, inclusive, and ends at a base between 204 and 221, inclusive.

123^124

Points to a site between bases 123 and 124.

145^177

Points to a site between two adjacent bases anywhere between bases 145 and 177.

join(12..78,134..202)

Regions 12 to 78 and 134 to 202 should be joined to form one contiguous sequence.

complement(join(2691..4571,4918..5163))

Joins regions 2691 to 4571 and 4918 to 5163, then complements the joined segments (the feature is on the strand complementary to the presented strand).

join(complement(4918..5163),complement(2691..4571))

Complements regions 4918 to 5163 and 2691 to 4571, then joins the complemented segments (the feature is on the strand complementary to the presented strand).

complement(34..(122.126))

Start at one of the bases complementary to those between 122 and 126 on the presented strand and finish at the base complementary to base 34 (the feature is on the strand complementary to the presented strand).

J00194:100..202

Points to bases 100 to 202, inclusive, in the entry (in this database) with primary accession number "J00194".