7.1 References

7.1 References

  • Altschul, S.F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410.

  • Altschul, S.F., T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402.

  • Gish, W., and D. J. States. 1993. Identification of protein coding regions by database similarity search. Nature Genet. 3:266-272.

    Main page

    http://www.ncbi.nlm.nih.gov/BLAST/

    Information guide

    http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

    Download

    ftp://ftp.ncbi.nih.gov/blast/executables/

Chapter 8. BLAT

BLAT (BLAST-Like Alignment Tool) is a very fast sequence alignment tool similar to BLAST. It's relatively new compared to BLAST, but is becoming very popular. We like it a lot. BLAT is more accurate and can be hundreds of times faster than BLAST. BLAT's speed comes from its runtime indexing of all nonoverlapping subsequences of given lengths. This index is small enough to fit into computer memory and is typically computed only once for each genome assembly. Jim Kent developed BLAT specifically to help with genome assembly while working on the human genome. For details see Section 8.2 at the end of this chapter. We're using Version 16 of BLAT.

An example of a BLAT command-line entry:

blat database query [-ooc=11.ooc] output.psl

where:

  • database is a .fa file, a .nib file, or a list of .fa or .nib files.

  • query is a .fa, .nib, or list of .fa or .nib files.

  • -ooc=11.ooc tells the program to load over-occurring 11-mers from an external file. This will increase the speed by a factor of 40 in many cases, but is not required.

  • output.psl is where to put the output.

8.1 Command-Line Options

Table 8-1 summarizes the BLAT options.

Table 8-1. BLAT options

Option

Definition

Default

-dots=N

Output dot every N sequences to show the program's progress.

 

-makeOoc=N.ooc

Make overused tile file.

 

-mask=type

Mask out repeats. Alignments won't be started in masked region but may extend through it in nucleotide searches. Masked areas are ignored entirely in protein or translated searches. Types are:

lower = Mask out lowercased sequence.

upper = Mask out uppercased sequence.

out = Mask according to database.out RepeatMasker .out file.

file.out = Mask database according to RepeatMasker file.out.

 

-maxGap=N

Sets the size of maximum gap between tiles in a clump. Usually set from 0 to 3.

Only relevant for minMatch > 1.

2

-minIdentity=N

Sets minimum sequence identity (in percent).

90 (nucleotide)

25 (protein)

25 (translated)

-minMatch=N

Sets the number of tile matches. Usually set from 2 to 4.

2 (nucleotide)

1 (protein)

-minScore=N

Sets minimum score. This is twice the matches minus the mismatches minus some sort of gap penalty.

30

-minRepDivergence=NN

Minimum percent divergence of repeats to allow them to be unmasked. Only relevant for masking using RepeatMasker .out files.

15

-noHead

Suppress .psl header (so it's just a tab-separated file).

 

-noTrimA

Don't trim trailing poly-A.

 

-oneOff=N

If set to 1, this allows one mismatch in tile and still triggers an alignment.

0

-ooc=N.ooc

Use overused tile file N.ooc. N should correspond to the tileSize.

 

-out=type

Controls output file format. Type is one of:

psl = Tab-separated format without actual sequence.

pslx = Tab-separated format with sequence.

axt = blastz-associated axt format.

maf = multiz-associated maf format.

wublast = similar to wublast format.

blast = similar to NCBI blast format.

psl

-prot

Synonymous to -d=prot -q=prot.

 

-qMask=type

Mask out repeats in query sequence. Similar to -mask, but for query rather than target sequences.

 

-q=type

Query type. Type is one of:

dna = DNA sequence.

rna = RNA sequence.

prot = protein sequence.

dnax = DNA sequence translated in six frames to protein.

rnax = DNA sequence translated in three frames to protein.

dna

-repMatch=N

Sets the number of repetitions of a tile allowed before it is marked as overused. Typically this is:

256 for tileSize 12.

1024 for tile size 11.

4096 for tile size 10.

Typically comes into play only with makeOoc.

1024

-t=type

Database type. Type is one of:

dna = DNA sequence.

prot = protein sequence.

dnax = DNA sequence translated in six frames to protein.

dna

-tileSize=N

Sets the size of match that triggers an alignment. Usually between 8 and 12.

11 (DNA)

5 (protein)

-trimHardA

Removes poly-A tail from qSize and alignments in psl output.

 

-trimT

Trims leading poly-T.