Blat-FAQ

From genomewiki
Revision as of 19:45, 6 January 2010 by Max (talk | contribs)
Jump to navigationJump to search

This email from Galt Barber to the mailing list in jan 2009 should solve most questions on the inner workings of blat:

Here is my understanding of how BLAT works. This may be flawed or incomplete, and cannot be taken as official. However, it should still be helpful.

BLAT walks along the target sequence(s) in stepSize increments, adding position hits to each tile of size tileSize, e.g. a sequence of nucleotides or amino acids, thus indexing the target, usually a genome.

If a tile is 4 nucleotides in 11 (tileSize) positions, you can have 4^11 different tiles. This is implemented as an array with 4,194,304 tile slots. Each array element is a list of genome positions where the tile's sequence occurs. For nucleotides, only the positive target strand is indexed.

Standalone BLAT creates the index in memory and uses it for one batch run. It is also helpful in low-memory situations because you can run just one chromosome at a time if you want.

gfServer is designed to keep the target index in memory in a server that can be queried interactively with gfClient. However doing this may require more memory.

Batch jobs are usually best done with standalone blat.

To query nucleotides, BLAT walks along the query sequence looking up tiles of size tileSize, but stepping along the query with stepSize=1. It uses the target index to find lists of (default 2 or more) hits along a diagonal and this corresponds to a seed or high-scoring pair.

Then this seed is run through banded dynamic programming algorithm similar to Smith-Waterman/Needleman-Wunsch to extend the alignment. Further filtering eliminates low-scoring alignments. Exons are chained together and the ends are fine-tuned. (In contrast, the tool BLAST just gives the exons alone.)

Then BLAT reverse-complements the query and repeats the same process. If a hit is found, the strand is reported as (-) meaning the query was reverse-complemented.

Normally, for nucleotides, stepSize=tileSize=11. This is the optimal setting for a wide variety of uses.

However, because PCR sequences are so short, we can gain some extra sensitivity by setting stepSize to 5. This means that there are more than twice as many tiles in the target or genome to index, which takes more memory.

BLAT automatically ignores tiles that are "over-used", so it will not index a short sequence that appears too-many (thousands of) times.

So for PCR work you should probably use tileSize=11, stepSize=5.

isPCR has some extensions to support PCR applications, such as the ability to calculate melting points and deal with the alignment job as a pair of ends.

Protein or translated blat works a little differently. Various default parameters differ. It has to index both the positive and negative strand of the target. It will still reverse complement the query and search again. This means it reports the "strand" as any of ++, +-, -+, --. If the first character is (-), this means the query was RC'd. If the second char is (-) it means the target's negative strand index was searched.

Although BLAT is typically used for nucleotide or protein targets, recent extensions at UCSC have adapted gfServer/hgBlat to use RNA targets. This can be useful for a variety of purposes. One of the obvious advantages is that query and target can match better without intron breaks. This can increase sensitivity, and is especially useful with PCR since the probes are often short and if a probe is broken by an intron in the target, it might not be findable in the index. This is also helpful because introns are not considered in the maximum product size (i.e. the maximum distance between probe pairs).

Several human and mouse genomes at UCSC now have PCR searching for genes. Look for target "UCSC genes".

-- Some more info from a [post in 2010]:

tileSize means the number of contiguous bases in the target database (often a genome assembly) that are used as an index key. If it is longer it is more specific, but also uses more ram. The default value good for for many nucleotide alignments is 11.

stepSize means the number of bases to shift forward to read the next tile when building the target database index.

Of course the query side is processed as tiles too, but in that case the stepSize is always 1.

Rather than index the negative nucleotide strand, The query is simply reverse-complemented and then the run again on the target index.

repMatch refers to the number of hits on a tile before the tile is marked as over-used and becomes masked out.

-maxGap=N   sets the size of maximum gap between tiles in a clump.

Usually set from 0 to 3. Default is 2. Only relevent for minMatch > 1.

 -minMatch=N sets the number of tile matches.  Usually set from 2 to 4
             Default is 2 for nucleotide, 1 for protein.

This is a useful way to filter out noisy time-consuming hits that are of low-value. By requiring that there be two hits along the same diagonal for the query before further expensive processing is done, speed is greatly increased with only a small decrease in sensitivity.

A maxGap of 0 means that the two hits have to be exactly on the same diagonal, whereas increasing maxGap allows it to tolerate small indels of 1 to 3 bp.