1.1 NCBI s Sequence Identifier Syntax

1.1 NCBI's Sequence Identifier Syntax

The National Center for Biotechnology Information (NCBI) uses the following syntax for its BLAST server. NCBI is part of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). The following (including the table) is NCBI's description. See ftp://ftp.ncbi.nih.gov/blast/db/README for details.

The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence was obtained. The table below lists the identifiers for the databases from which the sequences were derived.

Database name

Identifier syntax

GenBank

gb|accession|locus

EMBL Data Library

emb|accession|locus

DDBJ, DNA Database of Japan

dbj|accession|locus

NBRF PIR

pir||entry

Protein Research Foundation

prf||name

SWISS-PROT

sp|accession|entry name

Brookhaven Protein Data Bank

pdb|entry|chain

Patents

pat|country|number

GenInfo Backbone Id

bbs|number

General database identifier

gnl|database|identifier

NCBI Reference Sequence

ref|accession|locus

Local Sequence identifier

lcl|identifier

For example, an identifier might be "gb|M73307|AGMA13GT", where the "gb" tag indicates that the identifier refers to a GenBank sequence, "M73307" is its GenBank ACCESSION, and "AGMA13GT" is the GenBank LOCUS.

"gi" identifiers are being assigned by NCBI for all sequences contained within NCBI's sequence databases. This identifier provides a uniform and stable naming convention whereby a specific sequence is assigned its unique gi identifier. If a nucleotide or protein sequence changes, however, a new gi identifier is assigned, even if the accession number of the record remains unchanged. Thus, gi identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search.

1.2 NCBI's Non-Redundant Database Syntax

You should be aware of one additional syntax that's used by the NCBI for their non-redundant database. Since the whole point of the database is to have sequence entries listed only once, the description line syntax allows for more than one set of identifier and description. The sets are delimited by Ctrl-A characters. Here's what NCBI has to say about this.

These files are all non-redundant; identical sequences are merged into one entry. To be merged two sequences must have identical lengths and every residue (or basepair) at every position must be the same. The FASTA deflines for the different entries that belong to one sequence are separated by control-A's (^A). In the following example, both entries gi|1469284 and gi|1477453 have the same sequence, in every respect.

>gi|1469284 (U05042) afuC gene product [Actinobacillus 
pleuropneumoniae]^Agi|1477453 (U04954) afuC gene product [Actinobacillus 
pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT
KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ
QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN
KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE
AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE

1.3 References

  • Pearson, W.R., and D. J. Lipman. 1988. Improved Tools for Biological Sequence Analysis. Proceedings of teh National Academy of Sciences 85:2444-2448.

    NCBI Sequence Identifier Syntax

    ftp://ftp.ncbi.nih.gov/blast/db/README

    Non-redundant database

    ftp://ftp.ncbi.nih.gov/blast/db/README