2.7 DDBJ/EMBL/GenBank Feature Table In February 1986, GenBank and EMBL (joined by DDBJ in 1987) started a collaborative effort to create a common feature table format. The overall objective of the feature table was to supply an in-depth vocabulary for describing nucleotide (and protein) features. We're using Version 4 of the feature table. 2.7.1 Features A feature is a single word or abbreviation indicating a functional role or region associated with a sequence. A list of DDBJ/EMBL/GenBank features is presented in Table 2-3. In the Definition column of the table, the appropriate qualifiers for each feature are in brackets. Mandatory qualifiers are highlighted in bold. Table 2-3. DDBJ/EMBL/GenBank feature key table Feature Key | Definition | attenuator | 1) region of DNA at which regulation of termination of transcription occurs, which controls the expression of some bacterial operons. 2) sequence segment located between the promoter and the first structural gene that causes partial termination of transcription. [citation, db_xref, evidence, gene, label, map, note, phenotype, usedin] | C_region | Constant region of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; includes one or more exons depending on the particular chain. [citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin] | CAAT_signal | CAAT box; part of a conserved sequence located about 75 bp up-stream of the start point of eukaryotic transcription units which may be involved in RNA polymerase binding; consensus=GG (C or T) CAATCT. [citation, db_xref, evidence, gene, label, map, note, usedin] | CDS | Coding sequence; sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon); feature includes amino acid conceptual translation. [allele, citation, codon, codon_start, db_xref, EC_number, evidence, exception, function, gene, label, map, note, number, product, protein_id, pseudo, standard_name, translation, transl_except, transl_table, usedin] | conflict | Independent determinations of the "same" sequence differ at this site or region. [citation, db_xref, evidence, label, map, note, gene, replace, usedin] | D-loop | Displacement loop; a region within mitochondrial DNA in which a short stretch of RNA is paired with one strand of DNA, displacing the original partner DNA strand in this region. Also used to describe the displacement of a region of one strand of duplex DNA by a single stranded invader in the reaction catalyzed by RecA protein. [citation, db_xref, evidence, gene, label, map, note, usedin] | D_segment | Diversity segment of immunoglobulin heavy chain, and T-cell receptor beta chain [citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin] | enhancer | A cis-acting sequence that increases the utilization of (some) eukaryotic promoters, and can function in either orientation and in any location (upstream or downstream) relative to the promoter. [citation, db_xref, evidence, gene, label, map, note, standard_name, usedin] | exon | Region of genome that codes for portion of spliced mRNA, rRNA and tRNA; may contain 5' UTR, all CDSs, and 3' UTR. [allele, citation, db_xref, EC_number, evidence, function, gene, label, map, note, number, product, pseudo, standard_name, usedin] | GC_signal | GC box; a conserved GC-rich region located upstream of the start point of eukaryotic transcription units which may occur in multiple copies or in either orientation; consensus=GGGCGG. [citation, db_xref, evidence, gene, label, map, note, usedin] | gene | Region of biological interest identified as a gene and for which a name has been assigned. [allele, citation, db_xref, evidence, function, label, map, note, product, pseudo, phenotype, standard_name, usedin] | iDNA | Intervening DNA; DNA which is eliminated through any of several kinds of recombination. [citation, db_xref, evidence, function, label, gene, map, note, number, standard_name, usedin] | intron | A segment of DNA that is transcribed, but removed from within the transcript by splicing together the sequences (exons) on either side of it. [allele, citation, cons_splice, db_xref, evidence, function, gene, label, map, note, number, standard_name, usedin] | J_segment | Joining segment of immunoglobulin light and heavy chains and T-cell receptor alpha, beta, and gamma chains. [citation, db_xref, evidence, gene, map, note, product, pseudo, standard_name, usedin] | LTR | Long terminal repeat, a sequence directly repeated at both ends of a defined sequence, of the sort typically found in retroviruses. [citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin] | mat_peptide | Mature peptide or protein coding sequence; coding sequence for the mature or final peptide or protein product following post-translational modification; the location does not include the stop codon (unlike the corresponding CDS). [citation, db_xref, EC_number, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin] | misc_binding | Site in nucleic acid which covalently or non-covalently binds another moiety that cannot be described by any other binding key (primer_bind or protein_bind). [citation, bound_moiety, db_xref, evidence, function, gene, label, map, note, usedin] | misc_difference | Feature sequence is different from that presented in the entry and cannot be described by any other Difference key (conflict, unsure, old_sequence, mutation, or modified_base). [citation, clone, db_xref, evidence, gene, label, map, note, phenotype, replace, standard_name, usedin] | misc_feature | Region of biological interest which cannot be described by any other feature key; a new or rare feature. [citation, db_xref, evidence, function, gene, label, map, note, number, phenotype, product, pseudo, standard_name, usedin] | misc_recomb | Site of any generalized, site-specific or replicative recombination event where there is a breakage and reunion of duplex DNA that cannot be described by other recombination keys (iDNA and virion) or qualifiers of source key (/insertion seq, /transposon, /proviral). [citation, db_xref, evidence, gene, label, map, note, organism, standard_name, usedin] | misc_RNA | Any transcript or RNA product that cannot be defined by other RNA keys (prim_transcript, precursor_RNA, mRNA, 5' clip, 3' clip, 5' UTR, 3' UTR, exon, CDS, sig_peptide, transit_peptide, mat_peptide, intron, polyA_site, rRNA, tRNA, scRNA, and snRNA). [citation, db_xref, evidence, function, gene, label, map, note, product, standard_name, usedin] | misc_signal | Any region containing a signal controlling or altering gene function or expression that cannot be described by other signal keys (promoter, CAAT_signal, TATA_signal, -35_signal, -10_signal, GC_signal, RBS, polyA_signal, enhancer, attenuator, terminator, and rep_origin). [citation, db_xref, evidence, function, gene, label, map, note, phenotype, standard_name, usedin] | misc_structure | Any secondary or tertiary nucleotide structure or conformation that cannot be described by other Structure keys (stem_loop and D-loop). [citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin] | modified_base | The indicated nucleotide is a modified nucleotide and should be substituted for by the indicated molecule (given in the mod_base qualifier value). [citation, db_xref, evidence, frequency, gene, label, map, mod_base, note, usedin] | mRNA | Messenger RNA; includes 5' untranslated region (5'UTR), coding sequences (CDS, exon) and 3' untranslated region (3'UTR); [allele, citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin] | N_region | Extra nucleotides inserted between rearranged immmunoglobulin segments. [citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin] | old_sequence | The presented sequence revises a previous version of the sequence at this location. [citation, db_xref, evidence, gene, label, map, note, replace, usedin] | polyA_signal | Recognition region necessary for endonuclease cleavage of an RNA transcript that is followed by polyadenylation; consensus=AATAAA. [citation, db_xref, evidence, gene, label, map, note, usedin] | polyA_site | Site on an RNA transcript to which will be added adenine residues by post-transcriptional polyadenylation. [citation, db_xref, evidence, gene, label, map, note, usedin] | precursor_RNA | Any RNA species that is not yet the mature RNA product; may include 5' clipped region (5'clip), 5' untranslated region (5'UTR), coding sequences (CDS, exon), intervening sequences (intron), 3' untranslated region (3'UTR), and 3' clipped region (3'clip). [allele, citation, db_xref, evidence, function, gene, label, map, note, product, standard_name, usedin] | prim_transcript | Primary (initial, unprocessed) transcript; includes 5' clipped region (5'clip), 5' untranslated region (5'UTR), coding sequences (CDS, exon), intervening sequences (intron), 3' untranslated region (3'UTR), and 3' clipped region (3'clip). [allele, citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin] | primer_bind | Non-covalent primer binding site for initiation of replication, transcription, or reverse transcription; includes site(s) for synthetic e.g., PCR primer elements. [citation, db_xref, evidence, gene, label, map, note, standard_name, PCR_conditions, usedin] | promoter | Region on a DNA molecule involved in RNA polymerase binding to initiate transcription. [citation, db_xref, evidence, gene, function, label, map, note, phenotype, pseudo, standard_name, usedin] | protein_bind | Non-covalent protein binding site on nucleic acid. [bound_moiety, citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin] | RBS | Ribosome binding site. [citation, db_xref, evidence, gene, label, map, note, standard_name, usedin] | repeat_region | Region of genome containing repeating units. [citation, db_xref, evidence, function, gene, insertion_seq, label, map, note, rpt_family, rpt_type, rpt_unit, standard_name, transposon, usedin] | repeat_unit | Single repeat element. [citation, db_xref, evidence, function, gene, label, map, note, rpt_family, rpt_type, rpt_unit, usedin] | rep_origin | Origin of replication; starting site for duplication of nucleic acid to give two identical copies. [citation, db_xref, direction, evidence, gene, label, map, note, standard_name, usedin] | rRNA | Mature ribosomal RNA ; RNA component of the ribonucleoprotein particle (ribosome) which assembles amino acids into proteins. [citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin] | S_region | Switch region of immunoglobulin heavy chains; involved in the rearrangement of heavy chain DNA leading to the expression of a different immunoglobulin class from the same B-cell. [citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin] | satellite | Many tandem repeats (identical or related) of a short basic repeating unit; many have a base composition or other property different from the genome average that allows them to be separated from the bulk (main band) genomic DNA. [citation, db_xref, evidence, gene, label, map, note, rpt_type, rpt_family, rpt_unit, standard_name, usedin] | scRNA | Small cytoplasmic RNA; any one of several small cytoplasmic RNA molecules present in the cytoplasm and (sometimes) nucleus of a eukaryote. [citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin] | sig_peptide | Signal peptide coding sequence; coding sequence for an N-terminal domain of a secreted protein; this domain is involved in attaching nascent polypeptide to the membrane leader sequence. [citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin] | snRNA | Small nuclear RNA molecules involved in pre-mRNA splicing and processing. [citation, db_xref, evidence, function, gene, label, map, note, partial, product, pseudo, standard_name, usedin] | snoRNA | Small nucleolar RNA molecules mostly involved in rRNA modification and processing. [citation, db_xref, evidence, function, gene, label, map, note, partial, product, pseudo, standard_name, usedin] | source | Identifies the biological source of the specified span of the sequence; this key is mandatory; more than one source key per sequence is permissable; every entry will have, as a minimum, a single source key spanning the entire sequence or multiple source keys together spanning the entire sequence. [cell_line, cell_type, chromosome, citation, clone, clone_lib, country, cultivar, db_xref, dev_stage, environmental_sample, focus, frequency, germline, haplotype, lab_host, insertion_seq, isolate, isolation_source, label, macronuclear, map, note, organelle, organism, plasmid, pop_variant, proviral, rearranged, sequenced_mol, serotype, serovar, sex, specimen_voucher, specific_host, strain, sub_clone, sub_species, sub_strain, tissue_lib, tissue_type, transgenic, transposon, usedin, variety, virion] | stem_loop | Hairpin; a double-helical region formed by base-pairing between adjacent (inverted) complementary sequences in a single strand of RNA or DNA. [citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin] | STS | Sequence tagged site; short, single-copy DNA sequence that characterizes a mapping landmark on the genome and can be detected by PCR; a region of the genome can be mapped by determining the order of a series of STSs. [citation, db_xref, evidence, gene, label, note, map, standard_name, usedin] | TATA_signal | TATA box; Goldberg-Hogness box; a conserved AT-rich septamer found about 25 bp before the start point of each eukaryotic RNA polymerase II transcript unit which may be involved in positioning the enzyme for correct initiation; consensus=TATA(A or T)A(A or T). [citation, db_xref, evidence, gene, label, map, note, usedin] | terminator | Sequence of DNA located either at the end of the transcript that causes RNA polymerase to terminate transcription. [citation, db_xref, evidence, gene, label, map, note, standard_name, usedin] | transit_peptide | Transit peptide coding sequence; coding sequence for an N-terminal domain of a nuclear-encoded organellar protein; this domain is involved in post-translational import of the protein into the organelle. [citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin] | tRNA | Mature transfer RNA, a small RNA molecule (75-85 bases long) that mediates the translation of a nucleic acid sequence into an amino acid sequence. [anticodon, citation, db_xref, evidence, function, gene, label, map, note, product, pseudo, standard_name, usedin] | unsure | Author is unsure of exact sequence in this region. [citation, db_xref, evidence, gene, label, map, note, replace, usedin] | V_region | Variable region of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; codes for the variable amino terminal portion; can be composed of V_segments, D_segments, N_regions, and J_segments. [citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin] | V_segment | Variable segment of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; codes for most of the variable region (V_region) and the last few amino acids of the leader peptide. [citation, db_xref, evidence, gene, label, map, note, product, pseudo, standard_name, usedin] | variation | A related strain contains stable mutations from the same gene (e.g., RFLPs, polymorphisms, etc.) which differ from the presented sequence at this location (and possibly others). [allele, citation, db_xref, evidence, frequency, gene, label, map, note, phenotype, product, replace, standard_name, usedin] | 3' clip | 3'-most region of a precursor transcript that is clipped off during processing. [allele, citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin] | 3' UTR | Region at the 3' end of a mature transcript (following the stop codon) that is not translated into a protein. [allele, citation, db_xref, evidence, function, gene, label, map, note, standard_name, usedin] | 5' clip | 5'-most region of a precursor transcript that is clipped off during processing. [allele, citation, db_xref, evidence, function, gene, label, map, note, partial, standard_name, usedin] | 5' UTR | Region at the 5' end of a mature transcript (preceding the initiation codon) that is not translated into a protein. [allele, citation, db_xref, evidence, function, gene, label, map, note, partial, standard_name, usedin] | -10_signal | Pribnow box; a conserved region about 10 bp upstream of the start point of bacterial transcription units which may be involved in binding RNA polymerase; consensus=TAtAaT. [citation, db_xref, evidence, gene, label, map, note, standard_name, usedin] | -35_signal | A conserved hexamer about 35 bp upstream of the start point of bacterial transcription units; consensus=TTGACa [ ] or TGTTGACA [ ]; [citation, db_xref, evidence, gene, label, map, note, standard_name, usedin] | - | "-" is a placeholder for no key; should be used when the need is merely to mark region in order to comment on it or to use it in another feature's location. [citation, db_xref, evidence, function, gene, label, map, note, number, phenotype, product, pseudo, standard_name, usedin] | 2.7.2 Qualifiers A qualifer is auxiliary information about a feature. A feature can have one or more qualifiers. However, some features require mandatory qualifers, while others don't need a qualifer at all. Table 2-4 lists all DDBJ/EMBL/GenBank qualifiers. Table 2-4. DDBJ/EMBL/GenBank qualifier table /<qualifier>= | Description | /allele= | Name of the allele for the given gene. | /anticodon= | Location of the anticodon of tRNA and the amino acid for which it codes. | /bound_moiety= | Moiety bound. | /cell_line= | Cell line from which the sequence was obtained. | /cell_type= | Cell type from which the sequence was obtained. | /chromosome= | Chromosome (e.g., Chromosome number) from which the sequence was obtained. | /citation= | Reference to a citation listed in the entry reference field. | /clone= | Clone from which the sequence was obtained. | /clone_lib= | Clone library from which the sequence was obtained. | /codon= | Specifies a codon which is different from any found in the reference genetic code. | /codon_start= | Indicates the offset at which the first complete codon of a coding feature can be found, relative to the first base of that feature. | /cons_splice= | Differentiates between intron splice sites that conform to the 5'-GT ... AG-3' splice site consensus. | /country= | Country of origin for DNA sample, intended for epidemiological or population studies. | /cultivar= | Cultivar (cultivated variety) of plant from which sequence was obtained. | /db_xref= | Database cross-reference: pointer to related information in another database. | /dev_stage= | If the sequence was obtained from an organism in a specific developmental stage, it is specified with this qualifier. | /direction= | Direction of DNA replication. | /EC_number= | Enzyme Commission number for enzyme product of sequence. | /environmental_sample | Identifies sequences derived by direct molecular isolation (PCR, DGGE, or other anonymous methods) from an environmental sample with no reliable identification of the source organism. | /evidence= | Value indicating the nature of supporting evidence, distinguishing between experimentally determined and theoretically derived data. | /exception= | Indicates that the amino acid or RNA sequence will not translate or agree with the DNA sequence according to standard biological rules | /focus | Defines the source feature of primary biological interest for records that have multiple source features originating from different organisms. | /frequency= | Frequency of the occurrence of a feature. | /function= | Function attributed to a sequence. | /gene= | Symbol of the gene corresponding to a sequence region. | /germline | If the sequence shown is DNA and a member of the immunoglobulin family, this qualifier is used to denote that the sequence is from unrearranged DNA. | /haplotype= | Haplotype of organism from which the sequence was obtained. | /insertion_seq= | Insertion sequence element from which the sequence was obtained. | /isolate= | Individual isolate from which the sequence was obtained. | /isolation_source= | Describes the physical, environmental and/or local geographical source of the biological sample from which the sequence was derived. | /label= | A label used to permanently tag a feature. | /lab_host= | Laboratory host used to propagate the organism from which the sequence was obtained | /map= | Genomic map position of feature. | /macronuclear | If the sequence shown is DNA and from an organism which undergoes chromosomal differentiation between macronuclear and micronuclear stages, this qualifier is used to denote that the sequence is from macronuclear DNA. | /mod_base= | Abbreviation for a modified nucleotide base. | /note= | Any comment or additional information. | /number= | A number to indicate the order of genetic elements (e.g., exons or introns) in the 5' to 3' direction. | /organelle= | Type of membrane-bound intracellular structure from which the sequence was obtained. | /organism= | Scientific name of the organism that provided the sequenced genetic material. | /partial | Differentiates between complete and partial regions. | /PCR_conditions= | Description of reaction conditions and components for PCR. | /phenotype= | Phenotype conferred by the feature. | /pop_variant= | Population variant from which the sequence was obtained. | /plasmid= | Name of plasmid from which sequence was obtained. | /product= | Name of a product encoded by a sequence. | /protein_id= | Protein identifier, issued by International collaborators, this qualifier consists of a stable ID portion (3+5 format with 3 position letters and 5 numbers) plus a version number after the decimal point. | /proviral | Denotes that the sequence shown is viral and integrated into another organism's genome. | /pseudo | Indicates that this feature is a non-functional version of the element named by the feature key. | /rearranged | If the sequence shown is DNA and a member of the immunoglobulin family, this qualifier denotes that the sequence is from rearranged DNA. | /replace= | Indicates that the sequence identified a feature's intervals is replaced by the sequence shown in "text". | /rpt_family= | Type of repeated sequence; "Alu" or "Kpn", for example. | /rpt_type= | Organization of repeated sequence. | /rpt_unit= | Identity of repeat unit which constitutes a repeat_region. | /sequenced_mol= | Molecule from which the sequence was obtained. | /serotype= | Serological variety of a species characterized by its antigenic properties. | /serovar= | Serological variety of a species (usually a prokaryote) characterized by its antigenic properties. | /sex= | Sex of the organism from which the sequence was obtained. | /specific_host= | natural host from which the sequence was obtained. | /specimen_voucher= | An identifier of the individual or collection of the source organism and the place where it is currently stored, usually an institution. | /standard_name= | Accepted standard name for this feature. | /strain= | Strain from which sequence was obtained. | /sub_clone= | sub-clone from which sequence was obtained. | /sub_species= | Name of sub-species of organism from which sequence was obtained. | /sub_strain= | sub_strain from which sequence was obtained. | /tissue_lib= | Tissue library from which sequence was obtained. | /tissue_type= | Tissue type from which the sequence was obtained. | /transgenic | Identifies the source feature of the organism which was the recipient of transgenic DNA. | /translation= | Automatically generated one-letter abbreviated amino acid sequence derived from either the universal genetic code or the table as specified in /transl_table and as determined by exceptions in the /transl_except and /codon qualifiers. | /transl_except= | Translational exception: single codon the translation of which does not conform to genetic code defined by Organism and /codon=. | /transl_table= | Definition of genetic code table used if other than universal genetic code table (Tables are described in Appendix B). | /transposon= | Transposable element from which the sequence was obtained. | /usedin= | Indicates that the feature is used in a compound feature in another entry. | /variety= | Name of variety (formal Linnean rank) of organism from which the sequence was obtained; use the /cultivar qualifier for cultivated plant varieties. | /virion | Viral genomic sequence as it is encapsidated (distinguished from its proviral form integrated in a host cell's chromosome) . | 2.7.3 Locations A location is an instruction for finding a feature in a sequence. A list of DDBJ/EMBL/GenBank locations is presented in Table 2-5. Table 2-5. DDBJ/EMBL/GenBank location examples Location | Description | 467 | Points to a single base in the presented sequence. | 340..565 | Points to a continuous range of bases bounded by and including the starting and ending bases. | <345..500 | Indicates that the exact lower boundary point of a feature is unknown. The location begins at some base previous to the first base specified (which need not be contained in the presented sequence) and continues to and includes the ending base. | <1..888 | The feature starts before the first sequenced base and continues to and includes base 888. | (102.110) | Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110, inclusive. | (23.45)..600 | Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end point is base 600. | (122.133)..(204.221) | The feature starts at a base between 122 and 133, inclusive, and ends at a base between 204 and 221, inclusive. | 123^124 | Points to a site between bases 123 and 124. | 145^177 | Points to a site between two adjacent bases anywhere between bases 145 and 177. | join(12..78,134..202) | Regions 12 to 78 and 134 to 202 should be joined to form one contiguous sequence. | complement(join(2691..4571,4918..5163)) | Joins regions 2691 to 4571 and 4918 to 5163, then complements the joined segments (the feature is on the strand complementary to the presented strand). | join(complement(4918..5163),complement(2691..4571)) | Complements regions 4918 to 5163 and 2691 to 4571, then joins the complemented segments (the feature is on the strand complementary to the presented strand). | complement(34..(122.126)) | Start at one of the bases complementary to those between 122 and 126 on the presented strand and finish at the base complementary to base 34 (the feature is on the strand complementary to the presented strand). | J00194:100..202 | Points to bases 100 to 202, inclusive, in the entry (in this database) with primary accession number "J00194". | |