Chapter 2. GenBankEMBLDDBJ

Chapter 2. GenBank/EMBL/DDBJ

GenBank is maintained by the National Center for Biotechnology Information (NCBI). It is joined by the DNA Data Bank of Japan (DDBJ, in Mishima, Japan) and the European Molecular Biology Laboratory (EMBL, in Heidelberg, Germany) nucleotide database from the European Bioinformatics Institute (EBI, in Hinxton, UK) to form the International Nucleotide Sequence Database Collaboration. Although the three repositories have separate sites for data submission, they share sequence data and allow daily downloads of sequence files by the public. We're using GenBank Release 132, EMBL Release 72, and DDBJ Release 51.

2.1 Example Flat Files

Sequence flat files are frequently used in many software tools. GenBank, DDBJ, and EMBL each have their own specific flat file format. Flat files from each of these databases are shown in the next several sections, and these examples are used to illustrate the field definitions and the feature table sections for each repository. The sequence from cyclin-dependent kinase-2 (CDK2) is used as the example for all of the sequence flat file entries and the fasta file.

2.2 GenBank Example Flat File

Example 2-1 contains a sample sequence entry from GenBank. This entry contains terms from the GenBank Field Definitions and the DDBJ/EMBL/GenBank Feature Table, discussed later in this chapter.

Example 2-1. Sample Genbank entry
LOCUS       HSCDK2MR                1476 bp    mRNA    linear   PRI 15-JAN-1992
DEFINITION  H.sapiens CDK2 mRNA.
ACCESSION   X61622
VERSION     X61622.1  GI:29848
KEYWORDS    CDK2 gene; cell cycle regulation protein; cyclin A binding; protein
            kinase.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1476)
  AUTHORS   Elledge,S.J. and Spottswood,M.R.
  TITLE     A new human p34 protein kinase, CDK2, identified by complementation
            of a cdc28 mutation in Saccharomyces cerevisiae, is a homolog of
            Xenopus Eg1
  JOURNAL   EMBO J. 10 (9), 2653-2659 (1991)
  MEDLINE   91330891
REFERENCE   2  (bases 1 to 1476)
  AUTHORS   Elledge,S.J.
  TITLE     Direct Submission
  JOURNAL   Submitted (28-NOV-1991) S.J. Elledge, Dept. of Biochemistry, Baylor
            College of Medicine, 1 Baylor Place, Houston, TX 77030, USA
FEATURES             Location/Qualifiers
     source          1..1476
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /clone="pSE1000"
                     /cell_line="EBV transformed Human peripheral lymphocyte
                     (B-cell)"
                     /clone_lib="lambda YES-R cDNA library"
     gene            1..1476
                     /gene="CDK2"
     CDS             1..897
                     /gene="CDK2"
                     /function="protein kinase"
                     /note="cell division kinase. CDC2 homolog"
                     /codon_start=1
                     /protein_
                     /db_xref="GI:29849"
                     /db_xref="SWISS-PROT:P24941"
                     /translation="MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGV
                     PSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLP
                     LIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYT
                     HEVVTLWYRAPEILLGSKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRT
                     LGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRIS
                     AKAALAHPFFQDVTKPVPHLRL"
BASE COUNT      368 a    372 c    351 g    385 t
ORIGIN      
        1 atggagaact tccaaaaggt ggaaaagatc ggagagggca cgtacggagt tgtgtacaaa
       61 gccagaaaca agttgacggg agaggtggtg gcgcttaaga aaatccgcct ggacactgag
      121 actgagggtg tgcccagtac tgccatccga gagatctctc tgcttaagga gcttaaccat
      181 cctaatattg tcaagctgct ggatgtcatt cacacagaaa ataaactcta cctggttttt
      241 gaatttctgc accaagatct caagaaattc atggatgcct ctgctctcac tggcattcct
      301 cttcccctca tcaagagcta tctgttccag ctgctccagg gcctagcttt ctgccattct
      361 catcgggtcc tccaccgaga ccttaaacct cagaatctgc ttattaacac agagggggcc
      421 atcaagctag cagactttgg actagccaga gcttttggag tccctgttcg tacttacacc
      481 catgaggtgg tgaccctgtg gtaccgagct cctgaaatcc tcctgggctc gaaatattat
      541 tccacagctg tggacatctg gagcctgggc tgcatctttg ctgagatggt gactcgccgg
      601 gccctgttcc ctggagattc tgagattgac cagctcttcc ggatctttcg gactctgggg
      661 accccagatg aggtggtgtg gccaggagtt acttctatgc ctgattacaa gccaagtttc
      721 cccaagtggg cccggcaaga ttttagtaaa gttgtacctc ccctggatga agatggacgg
      781 agcttgttat cgcaaatgct gcactacgac cctaacaagc ggatttcggc caaggcagcc
      841 ctggctcacc ctttcttcca ggatgtgacc aagccagtac cccatcttcg actctgatag
      901 ccttcttgaa gcccccgacc ctaatcggct caccctctcc tccagtgtgg gcttgaccag
      961 cttggccttg ggctatttgg actcaggtgg gccctctgaa cttgccttaa acactcacct
     1021 tctagtctta accagccaac tctgggaata caggggtgaa aggggggaac cagtgaaaat
     1081 gaaaggaagt ttcagtatta gatgcactta agttagcctc caccaccctt tcccccttct
     1141 cttagttatt gctgaagagg gttggtataa aaataatttt aaaaaagcct tcctacacgt
     1201 tagatttgcc gtaccaatct ctgaatgccc cataattatt atttccagtg tttgggatga
     1261 ccaggatccc aagcctcctg ctgccacaat gtttataaag gccaaatgat agcgggggct
     1321 aagttggtgc ttttgagaat taagtaaaac aaaaccactg ggaggagtct attttaaaga
     1381 attcggttaa aaaatagatc caatcagttt ataccctagt tagtgttttc ctcacctaat
     1441 aggctgggag actgaagact cagcccgggt gggggt

//