2.4 GenBankDDBJ Field Definitions

2.4 GenBank/DDBJ Field Definitions

The field terms found in GenBank/DDBJ sequence flat files are used to help organize the information for human readabilty and machine parsing. There are several GenBank/DDBJ field terms found in a sequence flat file, but the repositories themselves share the same field definitions. Table 2-1 summarizes each of the field definitions.

Table 2-1. GenBank/DDBJ field definitions

Field

Description

LOCUS

A short mnemonic name for the entry, chosen to suggest the sequence's definition. Mandatory keyword/exactly one record.

DEFINITION

A concise description of the sequence. Mandatory keyword/one or more records.

ACCESSION

The primary accession number is a unique, unchanging code assigned to each entry. Mandatory keyword/one or more records.

VERSION

A compound identifier consisting of the primary accession number and a numeric version number associated with the current version of the sequence data in the record. This is followed by an integer key (a "GI") assigned to the sequence by NCBI. Mandatory keyword/exactly one record.

NID

An alternative method of presenting the NCBI GI identifier (described above). The NID is obsolete and was removed from the GenBank flat file format in December 1999.

KEYWORDS

Short phrases describing gene products and other information about an entry. Mandatory keyword in all annotated entries/one or more records.

SEGMENT

Information on the order in which this entry appears in a series of discontinuous sequences from the same molecule. Optional keyword (only in segmented entries)/exactly one record.

SOURCE

Common name of the organism or the name most frequently used in the literature. Mandatory keyword in all annotated entries/one or more records/includes one subkeyword.

ORGANISM

Formal scientific name of the organism (first line) and taxonomic classification levels (second and subsequent lines). Mandatory subkeyword in all annotated entries/two or more records.

REFERENCE

Citations for all articles containing data reported in this entry. Includes four subkeywords and may repeat. Mandatory keyword/one or more records.

AUTHORS

Lists the authors of the citation. Mandatory subkeyword/one or more records.

TITLE

Full title of citation. Optional subkeyword (present in all but unpublished citations)/one or more records.

JOURNAL

Lists the journal name, volume, year, and page numbers of the citation. Mandatory subkeyword/one or more records.

MEDLINE

Provides the Medline unique identifier for a citation. Optional subkeyword/one record.

PUBMED

Provides the PubMed unique identifier for a citation. Optional subkeyword/one record.

REMARK

Specifies the relevance of a citation to an entry. Optional subkeyword/one or more records.

COMMENT

Cross-references to other sequence entries, comparisons to other collections, notes of changes in LOCUS names, and other remarks. Optional keyword/one or more records/may include blank records.

FEATURES

Table containing information on portions of the sequence that code for proteins and RNA molecules and information on experimentally determined sites of biological significance. Optional keyword/one or more records.

BASE COUNT

Summary of the number of occurrences of each base code in the sequence. Mandatory keyword/exactly one record.

ORIGIN

Specification of how the first base of the reported sequence is operationally located within the genome. Where possible, this includes its location within a larger genetic map. Mandatory keyword/exactly one record.

//

Entry termination symbol. Mandatory at the end of an entry/exactly one record.