Blat-FAQ: Difference between revisions
No edit summary |
No edit summary |
||
(8 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
All of these are from Galt Barber, in emails to the mailing list. The formatting is bad, the emails were simply copied in here without too much wiki-formatting... | |||
== What is a tile? == | |||
A tile is a contiguous set of nucleotides (or amino-acids with | |||
translated blat). The default DNA tileSize is 11 which means | |||
that 11 nucleotides in a row are read and used as a key, | |||
either to store or read information. | |||
BLAT | When indexing a DNA target genome database, | ||
BLAT reads the first tile from position 0, | |||
then steps stepSize bases along and reads | |||
the next tile (index-key) at position 11. | |||
This continues with 22, 33, etc. | |||
The default stepSize is set to tileSize. | |||
So the default is non-overlapping tiles. | |||
But for extra sensitivity with short primer probes we set stepSize to 5. | |||
you | So in that case the tiles actually overlap. | ||
In that case you are taking a key of size 11 nucleotides | |||
from each position: 0, 5, 10, 15, 20, 25, etc. | |||
BLAT does not use "spaced-seeds". | |||
Similarly, when processing the query, | |||
BLAT turns it into tiles and positions, | |||
but for the query the stepSize is always 1. | |||
For each tile of the query, | |||
blat does a lookup in the target database index. | |||
And then for most uses, the query | |||
is reverse-complemented and the process | |||
repeats. | |||
== mental model (July 2010) == | |||
This is (...) the way BLAT works. | |||
This is the smallest size that blat can find an exact match for | |||
as you pointed out: | |||
tileSize + stepSize - 1 = 11 + 5 - 1 = 15 | |||
*This provides the seed location. | |||
*Then blat takes a block of dna around the seed. | |||
* It searches for exact matches of size minPerfect in the block | |||
by direct memory compares. | |||
* Then for each exact match it searches the remaining "good" | |||
match allowing 1 of out 3 bases to be substitutions (but no indels). | |||
== mental model (Jan 2009) == | |||
Here is my understanding of how BLAT works. This may be flawed or incomplete, and cannot be taken as official. However, it should still be helpful. | |||
BLAT walks along the target sequence(s) in stepSize increments, adding position hits to each tile of size tileSize, e.g. a sequence of nucleotides or amino acids, thus indexing the target, usually a genome. | |||
If a tile is 4 nucleotides in 11 (tileSize) positions, you can have 4^11 different tiles. This is implemented as an array with 4,194,304 tile slots. Each array element is a list of genome positions where the tile's sequence occurs. For nucleotides, only the positive target strand is indexed. | |||
Standalone BLAT creates the index in memory and uses it for one batch run. It is also helpful in low-memory situations because you can run just one chromosome at a time if you want. | |||
-- | gfServer is designed to keep the target index in memory in a server that can be queried interactively with gfClient. However doing this may require more memory. Batch jobs are usually best done with standalone blat. | ||
To query nucleotides, BLAT walks along the query sequence looking up tiles of size tileSize, but stepping along the query with stepSize=1. It uses the target index to find lists of (default 2 or more) hits along a diagonal and this corresponds to a seed or high-scoring pair. | |||
Then this seed is run through banded dynamic programming algorithm similar to Smith-Waterman/Needleman-Wunsch to extend the alignment. Further filtering eliminates low-scoring alignments. Exons are chained together and the ends are fine-tuned. (In contrast, the tool BLAST just gives the exons alone.) | |||
Then BLAT reverse-complements the query and repeats the same process. If a hit is found, the strand is reported as (-) meaning the query was reverse-complemented. | |||
Normally, for nucleotides, stepSize=tileSize=11. This is the optimal setting for a wide variety of uses. | |||
However, because PCR sequences are so short, we can gain some extra sensitivity by setting stepSize to 5. This means that there are more than twice as many tiles in the target or genome to index, which takes more memory. | |||
BLAT automatically ignores tiles that are "over-used", so it will not index a short sequence that appears too-many (thousands of) times. | |||
So for PCR work you should probably use tileSize=11, stepSize=5. | |||
isPCR has some extensions to support PCR applications, such as the ability to calculate melting points and deal with the alignment job as a pair of ends. | |||
Protein or translated blat works a little differently. Various default parameters differ. It has to index both the positive and negative strand of the target. It will still reverse complement the query and search again. This means it reports the "strand" as any of ++, +-, -+, --. If the first character is (-), this means the query was RC'd. If the second char is (-) it means the target's negative strand index was searched. | |||
Although BLAT is typically used for nucleotide or protein targets, recent extensions at UCSC have adapted gfServer/hgBlat to use RNA targets. This can be useful for a variety of purposes. One of the obvious advantages is that query and target can match better without intron breaks. This can increase sensitivity, and is especially useful with PCR since the probes are often short and if a probe is broken by an intron in the target, it might not be findable in the index. This is also helpful because introns are not considered in the maximum product size (i.e. the maximum distance between probe pairs). | |||
the | |||
Several human and mouse genomes at UCSC now have PCR searching for genes. Look for target "UCSC genes". | |||
Rather than index the negative nucleotide strand, | == Memory consumption == | ||
The query is simply reverse-complemented and then | Some more info from a [[https://lists.soe.ucsc.edu/pipermail/genome/2009-December/020859.html post in 2010]]: | ||
the run again on the target index. | |||
*'''tileSize''' means the number of contiguous bases in the target database (often a genome assembly) that are used as an index key. If it is longer it is more | |||
specific, but also uses more ram. The default value good for for many nucleotide alignments is 11. stepSize means the number of bases to shift forward to read the next tile when building the target database index. Of course the query side is processed as tiles too, but in that case the stepSize is always 1. Rather than index the negative nucleotide strand, The query is simply reverse-complemented and then the run again on the target index. | |||
repMatch refers to the number of hits on a tile | *'''repMatch''' refers to the number of hits on a tile | ||
before the tile is marked as over-used and becomes | before the tile is marked as over-used and becomes | ||
masked out. | masked out. | ||
* maxGap and minMatch: This is a useful way to filter out noisy time-consuming hits that are of | |||
This is a useful way to filter out noisy time-consuming hits that are of | |||
low-value. By requiring that there be two hits along the same diagonal | low-value. By requiring that there be two hits along the same diagonal | ||
for the query before further expensive processing is done, speed is | for the query before further expensive processing is done, speed is | ||
greatly increased with only a small decrease in sensitivity. | greatly increased with only a small decrease in sensitivity. | ||
A maxGap of 0 means that the two hits have to be exactly on the same | * A maxGap of 0 means that the two hits have to be exactly on the same | ||
diagonal, whereas increasing maxGap allows it to tolerate small indels | diagonal, whereas increasing maxGap allows it to tolerate small indels | ||
of 1 to 3 | of 1 to 3 | ||
== Memory consumption II == | |||
From an email in March 2010. | |||
*Higher tileSize increases memory, increases speed, decreases sensitivity slightly. | |||
*The default tileSize 11 is very good. On rare occasions you see 10 or 12 used. Smaller tileSizes tend to lead to dramatically longer runtime. It's a little complex to state easily in a formula because there are multiple phases internally that have each different characteristics. | |||
*The default stepSize is just tileSize. This means that you are sampling a position of the genome every stepSize bases. | |||
*For PCR primer searching, we leave tileSize at 11 and lower stepSize to 5 for increased sensitivity. Of course this will also cause the runtime to grow. | |||
*Increasing sensitivity means increasing the number of hits, and each hit that has to be explored can take a lot of processing. | |||
*And of course, whatever generalizations one would make, the real power, speed, and memory-required will depend on the characteristics of the genome, the queries. Not to mention several command-line switches that are available. | |||
But luckily the defaults have good performance and sensitivity for a wide-range of applications. | |||
If you are doing short-reads then perhaps one of the many good freely available short-read aligners like would be useful. | |||
== Sensitivity == | |||
>> -stepSize=5 is less sensitive than the default stepSize. | |||
This does not seem generally true. Of course it may be that blat | |||
sees many new things at stepSize 5 compared to 11, | |||
but misses a few old things that it used to see. | |||
It is after all sampling every 5th position of the target | |||
genome instead of every 11th position. That is all. | |||
In general, blat is good for cDna and RNA of the size you mentioned | |||
(100-500bp). However, as Jim pointed out, as the %Identity drops | |||
over greater evolutionary distance, it's harder for BLAT to find | |||
the exact tile hits which reduces its sensitivity. Lastz tends to do | |||
better for human-rodent distances or greater. | |||
You can try various things to increase BLAT's sensitivity, | |||
but you may find that the speed runs much slower at high-sensitivity | |||
settings. This could make it 10x to 100x slower than the default. | |||
Certainly setting -repMatch higher may help with borderline repetitive | |||
regions, but again at a time cost. | |||
Here is the default formula for repMatch: | |||
repMatch = 1024 * (tileSize/stepSize). | |||
You can increase it from there. | |||
You might also run it with or without -fine | |||
and see if that helps you get more exons. | |||
You could also try these. | |||
-oneOff=N If set to 1 this allows one mismatch in tile and still | |||
triggers an alignments. Default is 0. | |||
-minMatch=N sets the number of tile matches. Usually set from 2 to 4 | |||
Default is 2 for nucleotide, 1 for protein. | |||
-maxGap=N sets the size of maximum gap between tiles in a clump. | |||
Usually set from 0 to 3. Default is 2. | |||
Only relevent for minMatch > 1. | |||
As noted before, extra sensitivity runs slower: | |||
oneOff=1 | |||
minMatch=1 | |||
minMatch=2 maxGap=3 | |||
[[Category:Technical FAQ]] |
Latest revision as of 16:52, 5 August 2010
All of these are from Galt Barber, in emails to the mailing list. The formatting is bad, the emails were simply copied in here without too much wiki-formatting...
What is a tile?
A tile is a contiguous set of nucleotides (or amino-acids with translated blat). The default DNA tileSize is 11 which means that 11 nucleotides in a row are read and used as a key, either to store or read information.
When indexing a DNA target genome database, BLAT reads the first tile from position 0, then steps stepSize bases along and reads the next tile (index-key) at position 11. This continues with 22, 33, etc. The default stepSize is set to tileSize. So the default is non-overlapping tiles.
But for extra sensitivity with short primer probes we set stepSize to 5. So in that case the tiles actually overlap. In that case you are taking a key of size 11 nucleotides from each position: 0, 5, 10, 15, 20, 25, etc.
BLAT does not use "spaced-seeds".
Similarly, when processing the query, BLAT turns it into tiles and positions, but for the query the stepSize is always 1. For each tile of the query, blat does a lookup in the target database index.
And then for most uses, the query is reverse-complemented and the process repeats.
mental model (July 2010)
This is (...) the way BLAT works.
This is the smallest size that blat can find an exact match for as you pointed out:
tileSize + stepSize - 1 = 11 + 5 - 1 = 15
- This provides the seed location.
- Then blat takes a block of dna around the seed.
- It searches for exact matches of size minPerfect in the block
by direct memory compares.
- Then for each exact match it searches the remaining "good"
match allowing 1 of out 3 bases to be substitutions (but no indels).
mental model (Jan 2009)
Here is my understanding of how BLAT works. This may be flawed or incomplete, and cannot be taken as official. However, it should still be helpful.
BLAT walks along the target sequence(s) in stepSize increments, adding position hits to each tile of size tileSize, e.g. a sequence of nucleotides or amino acids, thus indexing the target, usually a genome.
If a tile is 4 nucleotides in 11 (tileSize) positions, you can have 4^11 different tiles. This is implemented as an array with 4,194,304 tile slots. Each array element is a list of genome positions where the tile's sequence occurs. For nucleotides, only the positive target strand is indexed.
Standalone BLAT creates the index in memory and uses it for one batch run. It is also helpful in low-memory situations because you can run just one chromosome at a time if you want.
gfServer is designed to keep the target index in memory in a server that can be queried interactively with gfClient. However doing this may require more memory. Batch jobs are usually best done with standalone blat.
To query nucleotides, BLAT walks along the query sequence looking up tiles of size tileSize, but stepping along the query with stepSize=1. It uses the target index to find lists of (default 2 or more) hits along a diagonal and this corresponds to a seed or high-scoring pair.
Then this seed is run through banded dynamic programming algorithm similar to Smith-Waterman/Needleman-Wunsch to extend the alignment. Further filtering eliminates low-scoring alignments. Exons are chained together and the ends are fine-tuned. (In contrast, the tool BLAST just gives the exons alone.)
Then BLAT reverse-complements the query and repeats the same process. If a hit is found, the strand is reported as (-) meaning the query was reverse-complemented.
Normally, for nucleotides, stepSize=tileSize=11. This is the optimal setting for a wide variety of uses.
However, because PCR sequences are so short, we can gain some extra sensitivity by setting stepSize to 5. This means that there are more than twice as many tiles in the target or genome to index, which takes more memory.
BLAT automatically ignores tiles that are "over-used", so it will not index a short sequence that appears too-many (thousands of) times.
So for PCR work you should probably use tileSize=11, stepSize=5.
isPCR has some extensions to support PCR applications, such as the ability to calculate melting points and deal with the alignment job as a pair of ends.
Protein or translated blat works a little differently. Various default parameters differ. It has to index both the positive and negative strand of the target. It will still reverse complement the query and search again. This means it reports the "strand" as any of ++, +-, -+, --. If the first character is (-), this means the query was RC'd. If the second char is (-) it means the target's negative strand index was searched.
Although BLAT is typically used for nucleotide or protein targets, recent extensions at UCSC have adapted gfServer/hgBlat to use RNA targets. This can be useful for a variety of purposes. One of the obvious advantages is that query and target can match better without intron breaks. This can increase sensitivity, and is especially useful with PCR since the probes are often short and if a probe is broken by an intron in the target, it might not be findable in the index. This is also helpful because introns are not considered in the maximum product size (i.e. the maximum distance between probe pairs).
Several human and mouse genomes at UCSC now have PCR searching for genes. Look for target "UCSC genes".
Memory consumption
Some more info from a [post in 2010]:
- tileSize means the number of contiguous bases in the target database (often a genome assembly) that are used as an index key. If it is longer it is more
specific, but also uses more ram. The default value good for for many nucleotide alignments is 11. stepSize means the number of bases to shift forward to read the next tile when building the target database index. Of course the query side is processed as tiles too, but in that case the stepSize is always 1. Rather than index the negative nucleotide strand, The query is simply reverse-complemented and then the run again on the target index.
- repMatch refers to the number of hits on a tile
before the tile is marked as over-used and becomes masked out.
- maxGap and minMatch: This is a useful way to filter out noisy time-consuming hits that are of
low-value. By requiring that there be two hits along the same diagonal for the query before further expensive processing is done, speed is greatly increased with only a small decrease in sensitivity.
- A maxGap of 0 means that the two hits have to be exactly on the same
diagonal, whereas increasing maxGap allows it to tolerate small indels of 1 to 3
Memory consumption II
From an email in March 2010.
- Higher tileSize increases memory, increases speed, decreases sensitivity slightly.
- The default tileSize 11 is very good. On rare occasions you see 10 or 12 used. Smaller tileSizes tend to lead to dramatically longer runtime. It's a little complex to state easily in a formula because there are multiple phases internally that have each different characteristics.
- The default stepSize is just tileSize. This means that you are sampling a position of the genome every stepSize bases.
- For PCR primer searching, we leave tileSize at 11 and lower stepSize to 5 for increased sensitivity. Of course this will also cause the runtime to grow.
- Increasing sensitivity means increasing the number of hits, and each hit that has to be explored can take a lot of processing.
- And of course, whatever generalizations one would make, the real power, speed, and memory-required will depend on the characteristics of the genome, the queries. Not to mention several command-line switches that are available.
But luckily the defaults have good performance and sensitivity for a wide-range of applications.
If you are doing short-reads then perhaps one of the many good freely available short-read aligners like would be useful.
Sensitivity
>> -stepSize=5 is less sensitive than the default stepSize.
This does not seem generally true. Of course it may be that blat sees many new things at stepSize 5 compared to 11, but misses a few old things that it used to see. It is after all sampling every 5th position of the target genome instead of every 11th position. That is all.
In general, blat is good for cDna and RNA of the size you mentioned (100-500bp). However, as Jim pointed out, as the %Identity drops over greater evolutionary distance, it's harder for BLAT to find the exact tile hits which reduces its sensitivity. Lastz tends to do better for human-rodent distances or greater.
You can try various things to increase BLAT's sensitivity, but you may find that the speed runs much slower at high-sensitivity settings. This could make it 10x to 100x slower than the default.
Certainly setting -repMatch higher may help with borderline repetitive regions, but again at a time cost.
Here is the default formula for repMatch:
repMatch = 1024 * (tileSize/stepSize).
You can increase it from there.
You might also run it with or without -fine and see if that helps you get more exons.
You could also try these.
-oneOff=N If set to 1 this allows one mismatch in tile and still
triggers an alignments. Default is 0.
-minMatch=N sets the number of tile matches. Usually set from 2 to 4
Default is 2 for nucleotide, 1 for protein.
-maxGap=N sets the size of maximum gap between tiles in a clump.
Usually set from 0 to 3. Default is 2. Only relevent for minMatch > 1.
As noted before, extra sensitivity runs slower: oneOff=1 minMatch=1 minMatch=2 maxGap=3