Chapter 1. FASTA Format

The most common sequence format you'll encounter is FASTA. This format is quite simple. The first line of a sequence entry consists of ">", followed by an identifier, which contains no whitespace. This can be followed by whitespace and a comment or description. This first line is referred to as the comment or description line. One or more sequence data lines may follow. The length of the sequence data lines may not be constant. Common line lengths are 60, 70, 72, and 80. For details, see Section 1.3 at the end of this chapter. Example 1-1 contains a sample FASTA entry.

Example 1-1. Sample FASTA entry
>gi|29848|emb|X61622.1|HSCDK2MR H.sapiens CDK2 mRNA ATGGAGAACTTCCAAAAGGTGGAAAAGATCGGAGAGGGCACGTACGGAGTTGTGTACAAAGCCAGAAACA AGTTGACGGGAGAGGTGGTGGCGCTTAAGAAAATCCGCCTGGACACTGAGACTGAGGGTGTGCCCAGTAC TGCCATCCGAGAGATCTCTCTGCTTAAGGAGCTTAACCATCCTAATATTGTCAAGCTGCTGGATGTCATT CACACAGAAAATAAACTCTACCTGGTTTTTGAATTTCTGCACCAAGATCTCAAGAAATTCATGGATGCCT CTGCTCTCACTGGCATTCCTCTTCCCCTCATCAAGAGCTATCTGTTCCAGCTGCTCCAGGGCCTAGCTTT CTGCCATTCTCATCGGGTCCTCCACCGAGACCTTAAACCTCAGAATCTGCTTATTAACACAGAGGGGGCC ATCAAGCTAGCAGACTTTGGACTAGCCAGAGCTTTTGGAGTCCCTGTTCGTACTTACACCCATGAGGTGG TGACCCTGTGGTACCGAGCTCCTGAAATCCTCCTGGGCTCGAAATATTATTCCACAGCTGTGGACATCTG GAGCCTGGGCTGCATCTTTGCTGAGATGGTGACTCGCCGGGCCCTGTTCCCTGGAGATTCTGAGATTGAC CAGCTCTTCCGGATCTTTCGGACTCTGGGGACCCCAGATGAGGTGGTGTGGCCAGGAGTTACTTCTATGC CTGATTACAAGCCAAGTTTCCCCAAGTGGGCCCGGCAAGATTTTAGTAAAGTTGTACCTCCCCTGGATGA AGATGGACGGAGCTTGTTATCGCAAATGCTGCACTACGACCCTAACAAGCGGATTTCGGCCAAGGCAGCC CTGGCTCACCCTTTCTTCCAGGATGTGACCAAGCCAGTACCCCATCTTCGACTCTGATAGCCTTCTTGAA GCCCCCGACCCTAATCGGCTCACCCTCTCCTCCAGTGTGGGCTTGACCAGCTTGGCCTTGGGCTATTTGG ACTCAGGTGGGCCCTCTGAACTTGCCTTAAACACTCACCTTCTAGTCTTAACCAGCCAACTCTGGGAATA CAGGGGTGAAAGGGGGGAACCAGTGAAAATGAAAGGAAGTTTCAGTATTAGATGCACTTAAGTTAGCCTC CACCACCCTTTCCCCCTTCTCTTAGTTATTGCTGAAGAGGGTTGGTATAAAAATAATTTTAAAAAAGCCT TCCTACACGTTAGATTTGCCGTACCAATCTCTGAATGCCCCATAATTATTATTTCCAGTGTTTGGGATGA CCAGGATCCCAAGCCTCCTGCTGCCACAATGTTTATAAAGGCCAAATGATAGCGGGGGCTAAGTTGGTGC TTTTGAGAATTAAGTAAAACAAAACCACTGGGAGGAGTCTATTTTAAAGAATTCGGTTAAAAAATAGATC CAATCAGTTTATACCCTAGTTAGTGTTTTCCTCACCTAATAGGCTGGGAGACTGAAGACTCAGCCCGGGT GGGGGT

Many organizations have specific syntax for the description line and have written their own code for parsing and writing FASTA files. Most open source tools expect only the identifier, and treat the rest of the line as a single description string.

A FASTA file may contain more than one sequence entry. The entries are merely concatentated, with the ">" prefixed lines indicating the start of a new sequence entry.



Sequence Analysis in a Nutshell
Sequence Analysis in a Nutshell: A Guide to Common Tools and Databases
ISBN: 059600494X
EAN: 2147483647
Year: 2005
Pages: 312

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net