Blat All Genomes: Difference between revisions

From genomewiki
Jump to navigationJump to search
(Created page with " Background: When we use Blat All Genomes feature on hgBlat, it simultaneously searches in parallel with multiple threads the query and possibly the reverse-complemented quer...")
 
No edit summary
 
(7 intermediate revisions by the same user not shown)
Line 6: Line 6:
One of the reasons that it is fast is that most of those gfServer get relatively little traffic. Human and mouse might be busy, but the rest tend to be idle much of the time. So that gives us quick responses from the gfServers.  On top of that, because we avoid doing full alignments that fetch the actual sequence from the target genome (.2bit) and the make an expensive dynamic programming local alignment, instead we process the raw hits that we get back from Blat locally. So it is very fast, although ultimately only produces an approximation of the full alignment that the BLAT client can make.  
One of the reasons that it is fast is that most of those gfServer get relatively little traffic. Human and mouse might be busy, but the rest tend to be idle much of the time. So that gives us quick responses from the gfServers.  On top of that, because we avoid doing full alignments that fetch the actual sequence from the target genome (.2bit) and the make an expensive dynamic programming local alignment, instead we process the raw hits that we get back from Blat locally. So it is very fast, although ultimately only produces an approximation of the full alignment that the BLAT client can make.  


COMBINE ACROSS MULTIPLE QUERY-STRANDS.
COMBINE ACROSS MULTIPLE QUERY-STRANDS
If the result required reverse complement and resubmit,
 
If the result required reverse complement of the query and resubmit,
now there is a best scoring alignment for each strand.
now there is a best scoring alignment for each strand.
Pick the best of these and hide the other result.
Pick the best of these and hide the other result.


SINGLE BIGGEST EXON
SINGLE BIGGEST EXON
At first, we tried just finding the biggest scoring exon, but that did not produce satisfactory results.  
At first, we tried just finding the biggest scoring exon, but that did not produce satisfactory results.  
For instance, a primitive organism which happens to have fewer introns would score higher with a single largest exon.
For instance, a primitive organism which happens to have fewer introns would score higher with a single largest exon.


COUNT ALL HITS
COUNT ALL HITS
Then we tried counting tile hits for all chromosomes no matter what.
Then we tried counting tile hits for all chromosomes no matter what.
Ends up being massive overkill whenenver the user's query happens  
Ends up being massive overkill whenenver the user's query happens  
Line 21: Line 24:


GENE MODEL
GENE MODEL
Breaking it up by chromosome is a start in the right direction,
Breaking it up by chromosome is a start in the right direction,
but we needed to just go ahead and try to construct a simple gene model.
but we needed to just go ahead and try to construct a simple gene model.
Line 34: Line 38:


TRANSLATED SPACE
TRANSLATED SPACE
For handling protein and rnax and dnax, we have multiple frames in protein space.
For handling protein and rnax and dnax, we have multiple frames in protein space.
To deal with merging the results of the 3 frames, I divided the counts by 3 since
To deal with merging the results of the 3 frames, I divided the counts by 3 since
Line 42: Line 47:


WE HIDE THE APPROXIMATE POSITION
WE HIDE THE APPROXIMATE POSITION
We actually have a surprisingly accurate position where the gene lies,
We actually have a surprisingly accurate position where the gene lies,
but it can be truncated by having a missing bit if DNA at
but it can be truncated by having a missing bit if DNA at
Line 47: Line 53:
Rather than confuse users with these approximate positions which users
Rather than confuse users with these approximate positions which users
might have mistakenly cut-and-pasted as final results,
might have mistakenly cut-and-pasted as final results,
we hide the real position and only reveal the actual chromosome
we hide the approximate position and only reveal the actual chromosome
since we know that for sure.
since we know that for sure.


WE ONLY SHOW THE BEST ALIGNMENT FOUND  
WE ONLY SHOW THE BEST ALIGNMENT FOUND  
Just like blat, we find multiple gene alignments, but
Just like blat, we find multiple gene alignments, but
for the Blat All Genomes output, we only show the best single alignment.
for the Blat All Genomes output, we only show the best single alignment.
Line 56: Line 63:
But we would not want anyone to think that the single best alignment we show
But we would not want anyone to think that the single best alignment we show
is the only one that exists.
is the only one that exists.
USERS SEE ALIGNMENT WITH SMALL TILE HIT COUNT
BUT FULL BLAT ALIGNMENT SHOWS NO RESULTS
In a typical case, there were a couple of tiles that were enough
to get BLAT to return some DNA hits, but when it when to actually
make an alignment, it found gaps and mismatches so that the final
score was < 20, and then hgBlat filters those results out, showing the
user nothing.


EXPECTED MAXIMUM TILE COUNTS
EXPECTED MAXIMUM TILE COUNTS
You can figure out from the query size and type and tileSize and stepSize,
You can figure out from the query size and type and tileSize and stepSize,
what the maximum tile count could possibly be for your query.
what the maximum tile count could possibly be for your query.
Line 64: Line 81:


--
--
DNA
DNA
For DNA, our gfServers use tileSize=11 and stepSize=5,
For DNA, our gfServers use tileSize=11 and stepSize=5,
so if L is our DNA length of the input query,
so if L is our DNA length of the input query,
the maximum tile count T we could expect to see is:
the maximum tile count T we could expect to see is:
<pre>
T <= (L - (tileSize-stepSize)) / stepSize
T <= (L - (tileSize-stepSize)) / stepSize
or
or
Line 73: Line 93:


If L = 4017 then T <= (4017-6)/5 == 802.
If L = 4017 then T <= (4017-6)/5 == 802.
 
</pre>
tileSize-stepSize = amount of the last tile that extends beyond the previous tile.
tileSize-stepSize = amount of the last tile that extends beyond the previous tile.


--
--
PROTEIN
PROTEIN


Line 82: Line 103:
we have the tileSize=stepSize=4.
we have the tileSize=stepSize=4.
Each position is 1 of 22 amino acids.
Each position is 1 of 22 amino acids.
 
<pre>
T <= L / stepSize
T <= L / stepSize
or
or
Line 88: Line 109:


If L = 1339 then T <= 1339/4 == 334.
If L = 1339 then T <= 1339/4 == 334.
</pre>


--
--


RNAX, DNAX
RNAX, DNAX
DNAX is just RNAX repeated  with a reverse-complemented query.
DNAX is just RNAX repeated  with a reverse-complemented query.
So the results are the same for both.
So the results are the same for both.
Line 97: Line 120:
You have to convert the query to protein though, which means dividing
You have to convert the query to protein though, which means dividing
the query DNA Length L by 3.
the query DNA Length L by 3.
 
<pre>
T <= (L/3)/stepSize = (L/3)/4 = L/12
T <= (L/3)/stepSize = (L/3)/4 = L/12


T <= 4017/12 = 334
T <= 4017/12 = 334
</pre>
---
Normalizing between dna and dnax?
We could normalize them to DNA span.
<pre>
DNA
T * stepSize = 802*5 = 4010
DNAX,RNAX,PROT
T * stepSize * bases-per-codon = ((334)*4) * 3 = 4008
</pre>

Latest revision as of 19:47, 1 May 2019

Background:

When we use Blat All Genomes feature on hgBlat, it simultaneously searches in parallel with multiple threads the query and possibly the reverse-complemented query against all of the default assemblies which have gfServers.

One of the reasons that it is fast is that most of those gfServer get relatively little traffic. Human and mouse might be busy, but the rest tend to be idle much of the time. So that gives us quick responses from the gfServers. On top of that, because we avoid doing full alignments that fetch the actual sequence from the target genome (.2bit) and the make an expensive dynamic programming local alignment, instead we process the raw hits that we get back from Blat locally. So it is very fast, although ultimately only produces an approximation of the full alignment that the BLAT client can make.

COMBINE ACROSS MULTIPLE QUERY-STRANDS

If the result required reverse complement of the query and resubmit, now there is a best scoring alignment for each strand. Pick the best of these and hide the other result.

SINGLE BIGGEST EXON

At first, we tried just finding the biggest scoring exon, but that did not produce satisfactory results. For instance, a primitive organism which happens to have fewer introns would score higher with a single largest exon.

COUNT ALL HITS

Then we tried counting tile hits for all chromosomes no matter what. Ends up being massive overkill whenenver the user's query happens to contain some repetitive tiles.

GENE MODEL

Breaking it up by chromosome is a start in the right direction, but we needed to just go ahead and try to construct a simple gene model. Most real genes have coordinates which are increasing and non-overlapping for the exons in a gene, and we already in BLAT use a gap of more than 1MB to terminate a gene. So it was fairly easy to construct that algorithm, and it was working fairly well. The find-the-best gene-model code is ripping fast and runs in parallel too.

I discovered that the algorithm is so effective that small remaining inaccuracy in the gene model could only be improved by loading actual DNA so it could see mismatches while scoring one possible exon over another -- however we do not want to do that because of speed issues.

TRANSLATED SPACE

For handling protein and rnax and dnax, we have multiple frames in protein space. To deal with merging the results of the 3 frames, I divided the counts by 3 since otherwise you would get nearly all of the output triplicated if there were only perfect matching tiles. Also, it keeps track of the position on the chromosome where the gene model lies, especially the span from the start of the first exon to the end of the last exon. So it pick the lowest start value and the highest end value among frames to figure out the position.

WE HIDE THE APPROXIMATE POSITION

We actually have a surprisingly accurate position where the gene lies, but it can be truncated by having a missing bit if DNA at the start and end that the fuller dynamic alignment would have found. Rather than confuse users with these approximate positions which users might have mistakenly cut-and-pasted as final results, we hide the approximate position and only reveal the actual chromosome since we know that for sure.

WE ONLY SHOW THE BEST ALIGNMENT FOUND

Just like blat, we find multiple gene alignments, but for the Blat All Genomes output, we only show the best single alignment. It is a little bit like the Feeling Lucky button, in that limited sense. But we would not want anyone to think that the single best alignment we show is the only one that exists.

USERS SEE ALIGNMENT WITH SMALL TILE HIT COUNT BUT FULL BLAT ALIGNMENT SHOWS NO RESULTS

In a typical case, there were a couple of tiles that were enough to get BLAT to return some DNA hits, but when it when to actually make an alignment, it found gaps and mismatches so that the final score was < 20, and then hgBlat filters those results out, showing the user nothing.

EXPECTED MAXIMUM TILE COUNTS

You can figure out from the query size and type and tileSize and stepSize, what the maximum tile count could possibly be for your query. Here we are assuming that there are NO GAPS and NO MISMATCHES, and NO REPEAT MASKING, all of which lower the score.

--

DNA

For DNA, our gfServers use tileSize=11 and stepSize=5, so if L is our DNA length of the input query, the maximum tile count T we could expect to see is:

T <= (L - (tileSize-stepSize)) / stepSize
or
T <= (L - (11 - 5))/5  =  (L-6)/5.

If L = 4017 then T <= (4017-6)/5 == 802.

tileSize-stepSize = amount of the last tile that extends beyond the previous tile.

--

PROTEIN

For TRANSLATED queries like protein, we have the tileSize=stepSize=4. Each position is 1 of 22 amino acids.

T <= L / stepSize
or
T <= L / 4.

If L = 1339 then T <= 1339/4 == 334.

--

RNAX, DNAX

DNAX is just RNAX repeated with a reverse-complemented query. So the results are the same for both.

You have to convert the query to protein though, which means dividing the query DNA Length L by 3.

T <= (L/3)/stepSize = (L/3)/4 = L/12

T <= 4017/12 = 334

---

Normalizing between dna and dnax? We could normalize them to DNA span.

DNA
T * stepSize = 802*5 = 4010

DNAX,RNAX,PROT
T * stepSize * bases-per-codon = ((334)*4) * 3 = 4008