Blat-FAQ: Difference between revisions

From genomewiki
Jump to navigationJump to search
m (adding category)
No edit summary
Line 1: Line 1:
This email from Galt Barber to the mailing list in jan 2009 should solve most questions on the inner workings of blat:
This email from Galt Barber to the mailing list in jan 2009 should solve most questions on the inner workings of blat:


Here is my understanding of how BLAT works.
Here is my understanding of how BLAT works. This may be flawed or incomplete, and cannot be taken as official. However, it should still be helpful.
This may be flawed or incomplete, and cannot be taken as official.
However, it should still be helpful.


BLAT walks along the target sequence(s) in stepSize
BLAT walks along the target sequence(s) in stepSize increments, adding position hits to each tile of size tileSize, e.g. a sequence of nucleotides or amino acids, thus indexing the target, usually a genome.
increments, adding position hits to each tile of size tileSize,
e.g. a sequence of nucleotides or amino acids,
thus indexing the target, usually a genome.


If a tile is 4 nucleotides in 11 (tileSize) positions,
If a tile is 4 nucleotides in 11 (tileSize) positions, you can have 4^11 different tiles.  This is implemented as an array with 4,194,304 tile slots.  Each array element is a list of genome positions where the tile's sequence occurs. For nucleotides, only the positive target strand is indexed.
you can have 4^11 different tiles.  This is implemented
as an array with 4,194,304 tile slots.  Each array element
is a list of genome positions where the tile's sequence occurs.
For nucleotides, only the positive target strand is indexed.


Standalone BLAT creates the index
Standalone BLAT creates the index in memory and uses it for one batch run. It is also helpful in low-memory situations because you can run just one chromosome at a time if you want.
in memory and uses it for one batch run.
It is also helpful in low-memory situations
because you can run just one chromosome at
a time if you want.


gfServer is designed to keep the target index in memory in a server
gfServer is designed to keep the target index in memory in a server that can be queried interactively with gfClient. However doing this may require more memory. Batch jobs are usually best done with standalone blat.
that can be queried interactively with gfClient.
However doing this may require more memory.


Batch jobs are usually best done with standalone blat.
To query nucleotides, BLAT walks along the query sequence looking up tiles of size tileSize, but stepping along the query with stepSize=1. It uses the target index to find lists of (default 2 or more) hits along a diagonal and this corresponds to a seed or high-scoring pair.


To query nucleotides, BLAT walks along the query sequence
Then this seed is run through banded dynamic programming algorithm similar to Smith-Waterman/Needleman-Wunsch to extend the alignment. Further filtering eliminates low-scoring alignments. Exons are chained together and the ends are fine-tuned. (In contrast, the tool BLAST just gives the exons alone.)
looking up tiles of size tileSize, but stepping along
the query with stepSize=1. It uses the target index to find lists
of (default 2 or more) hits along a diagonal and
this corresponds to a seed or high-scoring pair.


Then this seed is run through banded dynamic programming algorithm
Then BLAT reverse-complements the query and repeats the same process. If a hit is found, the strand is reported as (-) meaning the query was reverse-complemented.
similar to Smith-Waterman/Needleman-Wunsch to extend the alignment.
Further filtering eliminates low-scoring alignments.
Exons are chained together and the ends are fine-tuned.
(In contrast, the tool BLAST just gives the exons alone.)


Then BLAT reverse-complements the query and repeats
Normally, for nucleotides, stepSize=tileSize=11. This is the optimal setting for a wide variety of uses.
the same process. If a hit is found, the strand is
reported as (-) meaning the query was reverse-complemented.


Normally, for nucleotides, stepSize=tileSize=11.
However, because PCR sequences are so short, we can gain some extra sensitivity by setting stepSize to 5. This means that there are more than twice as many tiles in the target or genome to index, which takes more memory.
This is the optimal setting for a wide variety of uses.


However, because PCR sequences are so short, we can
BLAT automatically ignores tiles that are "over-used", so it will not index a short sequence that appears too-many (thousands of) times.
gain some extra sensitivity by setting stepSize to 5.
This means that there are more than twice as many tiles
in the target or genome to index, which takes more memory.


BLAT automatically ignores tiles that are "over-used",
So for PCR work you should probably use tileSize=11, stepSize=5.  
so it will not index a short sequence that appears
too-many (thousands of) times.


So for PCR work you should probably use tileSize=11, stepSize=5.
isPCR has some extensions to support PCR applications, such as the ability to calculate melting points and deal with the alignment job as a pair of ends.


isPCR has some extensions to support PCR applications,
Protein or translated blat works a little differently. Various default parameters differ.  It has to index both the positive and negative strand of the target. It will still reverse complement the query and search again.  This means it reports the "strand" as any of ++, +-, -+, --. If the first character is (-), this means the query was RC'd. If the second char is (-) it means the target's negative strand index was searched.
such as the ability to calculate melting points and
deal with the alignment job as a pair of ends.


Protein or translated blat works a little differently.
Although BLAT is typically used for nucleotide or protein targets, recent extensions at UCSC have adapted gfServer/hgBlat to use RNA targetsThis can be useful for a variety of purposes. One of the obvious advantages is that query and target can match better without intron breaks. This can increase sensitivity, and is especially useful with PCR since the probes are often short and if a probe is broken by an intron in the target, it might not be findable in the index. This is also helpful because introns are not considered in the maximum product size (i.e. the maximum distance between probe pairs).
Various default parameters differIt has to index both
the positive and negative strand of the target.
It will still reverse complement the query and search
again. This means it reports the "strand" as any of
++, +-, -+, --.
If the first character is (-), this means the query was RC'd.
If the second char is (-) it means the target's negative strand
index was searched.


Although BLAT is typically used for nucleotide or protein
Several human and mouse genomes at UCSC now have PCR searching for genes. Look for target "UCSC genes".
targets, recent extensions at UCSC have adapted gfServer/hgBlat
to use RNA targets.  This can be useful for a variety
of purposes. One of the obvious advantages is that query
and target can match better without intron breaks.
This can increase sensitivity, and is especially useful with PCR
since the probes are often short and if a probe is broken
by an intron in the target, it might not be findable in the index.
This is also helpful because introns are not considered in the maximum
product size (i.e. the maximum distance between probe pairs).
 
Several human and mouse genomes at UCSC now have PCR searching for genes.
Look for target "UCSC genes".


--  
--  
Some more info from a [[https://lists.soe.ucsc.edu/pipermail/genome/2009-December/020859.html post in 2010]]:
Some more info from a [[https://lists.soe.ucsc.edu/pipermail/genome/2009-December/020859.html post in 2010]]:
'''tileSize''' means the number of contiguous bases in the target database (often a genome assembly) that are used as an index key.  If it is longer it is more
specific, but also uses more ram. The default value good for for many nucleotide alignments is 11. stepSize means the number of bases to shift forward to read the next tile when building the target database index. Of course the query side is processed as tiles too, but in that case the stepSize is always 1. Rather than index the negative nucleotide strand, The query is simply reverse-complemented and then the run again on the target index.


tileSize means the number of contiguous bases
'''repMatch''' refers to the number of hits on a tile
in the target database (often a genome assembly) that are
used as an index key.  If it is longer it is more
specific, but also uses more ram. The default value good for
for many nucleotide alignments is 11.
 
stepSize means the number of bases to shift forward to read
the next tile when building the target database index.
 
Of course the query side is processed as tiles too,
but in that case the stepSize is always 1.
 
Rather than index the negative nucleotide strand,
The query is simply reverse-complemented and then
the run again on the target index.
 
repMatch refers to the number of hits on a tile
before the tile is marked as over-used and becomes
before the tile is marked as over-used and becomes
masked out.
masked out.
Line 122: Line 56:
A maxGap of 0 means that the two hits have to be exactly on the same
A maxGap of 0 means that the two hits have to be exactly on the same
diagonal, whereas increasing maxGap allows it to tolerate small indels
diagonal, whereas increasing maxGap allows it to tolerate small indels
of 1 to 3 bp.
of 1 to 3  
 
----
 
Another email from Galt, early 2010:
 
Higher tileSize increases memory, increases speed, decreases sensitivity slightly.
 
The default tileSize 11 is very good. On rare occasions you see 10 or 12 used. Smaller tileSizes tend to lead to dramatically longer runtime. It's a little complex to state easily in a formula because there are multiple phases internally that have each different characteristics.
 
The default stepSize is just tileSize. This means that you are sampling a position of the genome every stepSize bases.
 
For PCR primer searching, we leave tileSize at 11 and lower stepSize to 5 for increased sensitivity.  Of course this will also cause the runtime to grow.
 
Increasing sensitivity means increasing the number of hits, and each hit that has to be explored can take a lot of processing.
 
And of course, whatever generalizations one would make, the real power, speed, and memory-required will depend on the characteristics of the genome, the queries.  Not to mention several command-line switches that are available.
 
But luckily the defaults have good performance and sensitivity for a wide-range of applications.
 
If you are doing short-reads then perhaps one of the many good freely available short-read aligners like would be useful.
 
BLAT is free for non-commercial use.


[[Category:Technical FAQ]]
[[Category:Technical FAQ]]

Revision as of 18:16, 8 March 2010

This email from Galt Barber to the mailing list in jan 2009 should solve most questions on the inner workings of blat:

Here is my understanding of how BLAT works. This may be flawed or incomplete, and cannot be taken as official. However, it should still be helpful.

BLAT walks along the target sequence(s) in stepSize increments, adding position hits to each tile of size tileSize, e.g. a sequence of nucleotides or amino acids, thus indexing the target, usually a genome.

If a tile is 4 nucleotides in 11 (tileSize) positions, you can have 4^11 different tiles. This is implemented as an array with 4,194,304 tile slots. Each array element is a list of genome positions where the tile's sequence occurs. For nucleotides, only the positive target strand is indexed.

Standalone BLAT creates the index in memory and uses it for one batch run. It is also helpful in low-memory situations because you can run just one chromosome at a time if you want.

gfServer is designed to keep the target index in memory in a server that can be queried interactively with gfClient. However doing this may require more memory. Batch jobs are usually best done with standalone blat.

To query nucleotides, BLAT walks along the query sequence looking up tiles of size tileSize, but stepping along the query with stepSize=1. It uses the target index to find lists of (default 2 or more) hits along a diagonal and this corresponds to a seed or high-scoring pair.

Then this seed is run through banded dynamic programming algorithm similar to Smith-Waterman/Needleman-Wunsch to extend the alignment. Further filtering eliminates low-scoring alignments. Exons are chained together and the ends are fine-tuned. (In contrast, the tool BLAST just gives the exons alone.)

Then BLAT reverse-complements the query and repeats the same process. If a hit is found, the strand is reported as (-) meaning the query was reverse-complemented.

Normally, for nucleotides, stepSize=tileSize=11. This is the optimal setting for a wide variety of uses.

However, because PCR sequences are so short, we can gain some extra sensitivity by setting stepSize to 5. This means that there are more than twice as many tiles in the target or genome to index, which takes more memory.

BLAT automatically ignores tiles that are "over-used", so it will not index a short sequence that appears too-many (thousands of) times.

So for PCR work you should probably use tileSize=11, stepSize=5.

isPCR has some extensions to support PCR applications, such as the ability to calculate melting points and deal with the alignment job as a pair of ends.

Protein or translated blat works a little differently. Various default parameters differ. It has to index both the positive and negative strand of the target. It will still reverse complement the query and search again. This means it reports the "strand" as any of ++, +-, -+, --. If the first character is (-), this means the query was RC'd. If the second char is (-) it means the target's negative strand index was searched.

Although BLAT is typically used for nucleotide or protein targets, recent extensions at UCSC have adapted gfServer/hgBlat to use RNA targets. This can be useful for a variety of purposes. One of the obvious advantages is that query and target can match better without intron breaks. This can increase sensitivity, and is especially useful with PCR since the probes are often short and if a probe is broken by an intron in the target, it might not be findable in the index. This is also helpful because introns are not considered in the maximum product size (i.e. the maximum distance between probe pairs).

Several human and mouse genomes at UCSC now have PCR searching for genes. Look for target "UCSC genes".

-- Some more info from a [post in 2010]:

tileSize means the number of contiguous bases in the target database (often a genome assembly) that are used as an index key. If it is longer it is more specific, but also uses more ram. The default value good for for many nucleotide alignments is 11. stepSize means the number of bases to shift forward to read the next tile when building the target database index. Of course the query side is processed as tiles too, but in that case the stepSize is always 1. Rather than index the negative nucleotide strand, The query is simply reverse-complemented and then the run again on the target index.

repMatch refers to the number of hits on a tile before the tile is marked as over-used and becomes masked out.

-maxGap=N   sets the size of maximum gap between tiles in a clump.

Usually set from 0 to 3. Default is 2. Only relevent for minMatch > 1.

 -minMatch=N sets the number of tile matches.  Usually set from 2 to 4
             Default is 2 for nucleotide, 1 for protein.

This is a useful way to filter out noisy time-consuming hits that are of low-value. By requiring that there be two hits along the same diagonal for the query before further expensive processing is done, speed is greatly increased with only a small decrease in sensitivity.

A maxGap of 0 means that the two hits have to be exactly on the same diagonal, whereas increasing maxGap allows it to tolerate small indels of 1 to 3


Another email from Galt, early 2010:

Higher tileSize increases memory, increases speed, decreases sensitivity slightly.

The default tileSize 11 is very good. On rare occasions you see 10 or 12 used. Smaller tileSizes tend to lead to dramatically longer runtime. It's a little complex to state easily in a formula because there are multiple phases internally that have each different characteristics.

The default stepSize is just tileSize. This means that you are sampling a position of the genome every stepSize bases.

For PCR primer searching, we leave tileSize at 11 and lower stepSize to 5 for increased sensitivity. Of course this will also cause the runtime to grow.

Increasing sensitivity means increasing the number of hits, and each hit that has to be explored can take a lot of processing.

And of course, whatever generalizations one would make, the real power, speed, and memory-required will depend on the characteristics of the genome, the queries. Not to mention several command-line switches that are available.

But luckily the defaults have good performance and sensitivity for a wide-range of applications.

If you are doing short-reads then perhaps one of the many good freely available short-read aligners like would be useful.

BLAT is free for non-commercial use.