Blat-FAQ: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
Line 38: Line 38:
Some more info from a [[https://lists.soe.ucsc.edu/pipermail/genome/2009-December/020859.html post in 2010]]:
Some more info from a [[https://lists.soe.ucsc.edu/pipermail/genome/2009-December/020859.html post in 2010]]:
   
   
'''tileSize''' means the number of contiguous bases in the target database (often a genome assembly) that are used as an index key.  If it is longer it is more
*'''tileSize''' means the number of contiguous bases in the target database (often a genome assembly) that are used as an index key.  If it is longer it is more
specific, but also uses more ram. The default value good for for many nucleotide alignments is 11. stepSize means the number of bases to shift forward to read the next tile when building the target database index. Of course the query side is processed as tiles too, but in that case the stepSize is always 1. Rather than index the negative nucleotide strand, The query is simply reverse-complemented and then the run again on the target index.  
specific, but also uses more ram. The default value good for for many nucleotide alignments is 11. stepSize means the number of bases to shift forward to read the next tile when building the target database index. Of course the query side is processed as tiles too, but in that case the stepSize is always 1. Rather than index the negative nucleotide strand, The query is simply reverse-complemented and then the run again on the target index.  


'''repMatch''' refers to the number of hits on a tile
*'''repMatch''' refers to the number of hits on a tile
before the tile is marked as over-used and becomes
before the tile is marked as over-used and becomes
masked out.
masked out.


-maxGap=N  sets the size of maximum gap between tiles in a clump.
* maxGap and minMatch: This is a useful way to filter out noisy time-consuming hits that are of
Usually set from 0 to 3.  Default is 2. Only relevent for minMatch > 1.
 
  -minMatch=N sets the number of tile matches.  Usually set from 2 to 4
              Default is 2 for nucleotide, 1 for protein.
 
This is a useful way to filter out noisy time-consuming hits that are of
low-value.  By requiring that there be two hits along the same diagonal
low-value.  By requiring that there be two hits along the same diagonal
for the query before further expensive processing is done, speed is
for the query before further expensive processing is done, speed is
greatly increased with only a small decrease in sensitivity.
greatly increased with only a small decrease in sensitivity.


A maxGap of 0 means that the two hits have to be exactly on the same
* A maxGap of 0 means that the two hits have to be exactly on the same
diagonal, whereas increasing maxGap allows it to tolerate small indels
diagonal, whereas increasing maxGap allows it to tolerate small indels
of 1 to 3  
of 1 to 3


== Memory consumption II ==
== Memory consumption II ==

Revision as of 18:19, 8 March 2010

All of these are from Galt Barber, in emails to the mailing list:

Galt's mental model (Jan 2009)

Here is my understanding of how BLAT works. This may be flawed or incomplete, and cannot be taken as official. However, it should still be helpful.

BLAT walks along the target sequence(s) in stepSize increments, adding position hits to each tile of size tileSize, e.g. a sequence of nucleotides or amino acids, thus indexing the target, usually a genome.

If a tile is 4 nucleotides in 11 (tileSize) positions, you can have 4^11 different tiles. This is implemented as an array with 4,194,304 tile slots. Each array element is a list of genome positions where the tile's sequence occurs. For nucleotides, only the positive target strand is indexed.

Standalone BLAT creates the index in memory and uses it for one batch run. It is also helpful in low-memory situations because you can run just one chromosome at a time if you want.

gfServer is designed to keep the target index in memory in a server that can be queried interactively with gfClient. However doing this may require more memory. Batch jobs are usually best done with standalone blat.

To query nucleotides, BLAT walks along the query sequence looking up tiles of size tileSize, but stepping along the query with stepSize=1. It uses the target index to find lists of (default 2 or more) hits along a diagonal and this corresponds to a seed or high-scoring pair.

Then this seed is run through banded dynamic programming algorithm similar to Smith-Waterman/Needleman-Wunsch to extend the alignment. Further filtering eliminates low-scoring alignments. Exons are chained together and the ends are fine-tuned. (In contrast, the tool BLAST just gives the exons alone.)

Then BLAT reverse-complements the query and repeats the same process. If a hit is found, the strand is reported as (-) meaning the query was reverse-complemented.

Normally, for nucleotides, stepSize=tileSize=11. This is the optimal setting for a wide variety of uses.

However, because PCR sequences are so short, we can gain some extra sensitivity by setting stepSize to 5. This means that there are more than twice as many tiles in the target or genome to index, which takes more memory.

BLAT automatically ignores tiles that are "over-used", so it will not index a short sequence that appears too-many (thousands of) times.

So for PCR work you should probably use tileSize=11, stepSize=5.

isPCR has some extensions to support PCR applications, such as the ability to calculate melting points and deal with the alignment job as a pair of ends.

Protein or translated blat works a little differently. Various default parameters differ. It has to index both the positive and negative strand of the target. It will still reverse complement the query and search again. This means it reports the "strand" as any of ++, +-, -+, --. If the first character is (-), this means the query was RC'd. If the second char is (-) it means the target's negative strand index was searched.

Although BLAT is typically used for nucleotide or protein targets, recent extensions at UCSC have adapted gfServer/hgBlat to use RNA targets. This can be useful for a variety of purposes. One of the obvious advantages is that query and target can match better without intron breaks. This can increase sensitivity, and is especially useful with PCR since the probes are often short and if a probe is broken by an intron in the target, it might not be findable in the index. This is also helpful because introns are not considered in the maximum product size (i.e. the maximum distance between probe pairs).

Several human and mouse genomes at UCSC now have PCR searching for genes. Look for target "UCSC genes".

Memory consumption

Some more info from a [post in 2010]:

  • tileSize means the number of contiguous bases in the target database (often a genome assembly) that are used as an index key. If it is longer it is more

specific, but also uses more ram. The default value good for for many nucleotide alignments is 11. stepSize means the number of bases to shift forward to read the next tile when building the target database index. Of course the query side is processed as tiles too, but in that case the stepSize is always 1. Rather than index the negative nucleotide strand, The query is simply reverse-complemented and then the run again on the target index.

  • repMatch refers to the number of hits on a tile

before the tile is marked as over-used and becomes masked out.

  • maxGap and minMatch: This is a useful way to filter out noisy time-consuming hits that are of

low-value. By requiring that there be two hits along the same diagonal for the query before further expensive processing is done, speed is greatly increased with only a small decrease in sensitivity.

  • A maxGap of 0 means that the two hits have to be exactly on the same

diagonal, whereas increasing maxGap allows it to tolerate small indels of 1 to 3

Memory consumption II

From an email in March 2010.

  • Higher tileSize increases memory, increases speed, decreases sensitivity slightly.
  • The default tileSize 11 is very good. On rare occasions you see 10 or 12 used. Smaller tileSizes tend to lead to dramatically longer runtime. It's a little complex to state easily in a formula because there are multiple phases internally that have each different characteristics.
  • The default stepSize is just tileSize. This means that you are sampling a position of the genome every stepSize bases.
  • For PCR primer searching, we leave tileSize at 11 and lower stepSize to 5 for increased sensitivity. Of course this will also cause the runtime to grow.
  • Increasing sensitivity means increasing the number of hits, and each hit that has to be explored can take a lot of processing.
  • And of course, whatever generalizations one would make, the real power, speed, and memory-required will depend on the characteristics of the genome, the queries. Not to mention several command-line switches that are available.

But luckily the defaults have good performance and sensitivity for a wide-range of applications.

If you are doing short-reads then perhaps one of the many good freely available short-read aligners like would be useful.