Blat-FAQ: Difference between revisions
(New page: This email from Galt Barber to the mailing list in jan 2009 should solve most questions on the inner workings of blat: <tt> Here is my understanding of how BLAT works. This may be flawed ...) |
No edit summary |
||
Line 1: | Line 1: | ||
This email from Galt Barber to the mailing list in jan 2009 should solve most questions on the inner workings of blat: | This email from Galt Barber to the mailing list in jan 2009 should solve most questions on the inner workings of blat: | ||
Here is my understanding of how BLAT works. | Here is my understanding of how BLAT works. | ||
This may be flawed or incomplete, and cannot be taken as official. | This may be flawed or incomplete, and cannot be taken as official. | ||
Line 83: | Line 82: | ||
Several human and mouse genomes at UCSC now have PCR searching for genes. | Several human and mouse genomes at UCSC now have PCR searching for genes. | ||
Look for target "UCSC genes". | Look for target "UCSC genes". | ||
Revision as of 21:35, 21 January 2009
This email from Galt Barber to the mailing list in jan 2009 should solve most questions on the inner workings of blat:
Here is my understanding of how BLAT works. This may be flawed or incomplete, and cannot be taken as official. However, it should still be helpful.
BLAT walks along the target sequence(s) in stepSize increments, adding position hits to each tile of size tileSize, e.g. a sequence of nucleotides or amino acids, thus indexing the target, usually a genome.
If a tile is 4 nucleotides in 11 (tileSize) positions, you can have 4^11 different tiles. This is implemented as an array with 4,194,304 tile slots. Each array element is a list of genome positions where the tile's sequence occurs. For nucleotides, only the positive target strand is indexed.
Standalone BLAT creates the index in memory and uses it for one batch run. It is also helpful in low-memory situations because you can run just one chromosome at a time if you want.
gfServer is designed to keep the sequence in memory in a server that can be queried interactively with gfClient. However doing this may require more memory.
Batch jobs are usually best done with standalone blat.
To query nucleotides, BLAT walks along the query sequence looking up tiles of size tileSize, but stepping along the query with stepSize=1. It uses the target index to find lists of (default 2 or more) hits along a diagonal and this corresponds to a seed or high-scoring pair.
Then this seed is run through banded dynamic programming algorithm similar to Smith-Waterman/Needleman-Wunsch to extend the alignment. Further filtering eliminates low-scoring alignments. Exons are chained together and the ends are fine-tuned. (In contrast, the tool BLAST just gives the exons alone.)
Then BLAT reverse-complements the query and repeats the same process. If a hit is found, the strand is reported as (-) meaning the query was reverse-complemented.
Normally, for nucleotides, stepSize=tileSize=11. This is the optimal setting for a wide variety of uses.
However, because PCR sequences are so short, we can gain some extra sensitivity by setting stepSize to 5. This means that there are more than twice as many tiles in the target or genome to index, which takes more memory.
BLAT automatically ignores tiles that are "over-used", so it will not index a short sequence that appears too-many (thousands of) times.
So for PCR work you should probably use tileSize=11, stepSize=5.
isPCR has some extensions to support PCR applications, such as the ability to calculate melting points and deal with the alignment job as a pair of ends.
Protein or translated blat works a little differently. Various default parameters differ. It has to index both the positive and negative strand of the target. It will still reverse complement the query and search again. This means it reports the "strand" as any of ++, +-, -+, --. If the first character is (-), this means the query was RC'd. If the second char is (-) it means the target's negative strand index was searched.
Although BLAT is typically used for nucleotide or protein targets, recent extensions at UCSC have adapted gfServer/hgBlat to use mRNA targets. This can be useful for a variety of purposes. One of the obvious advantages is that query and target can match better without intron breaks. This can increase sensitivity, and is especially useful with PCR since the probes are often short and if a probe is broken by an intron in the target, it might not be findable in the index. Several human and mouse genomes at UCSC now have PCR searching for genes. Look for target "UCSC genes".