Blat All Genomes: Difference between revisions

Latest revision as of 19:47, 1 May 2019

Background:

When we use Blat All Genomes feature on hgBlat, it simultaneously searches in parallel with multiple threads the query and possibly the reverse-complemented query against all of the default assemblies which have gfServers.

One of the reasons that it is fast is that most of those gfServer get relatively little traffic. Human and mouse might be busy, but the rest tend to be idle much of the time. So that gives us quick responses from the gfServers. On top of that, because we avoid doing full alignments that fetch the actual sequence from the target genome (.2bit) and the make an expensive dynamic programming local alignment, instead we process the raw hits that we get back from Blat locally. So it is very fast, although ultimately only produces an approximation of the full alignment that the BLAT client can make.

COMBINE ACROSS MULTIPLE QUERY-STRANDS

If the result required reverse complement of the query and resubmit, now there is a best scoring alignment for each strand. Pick the best of these and hide the other result.

SINGLE BIGGEST EXON

At first, we tried just finding the biggest scoring exon, but that did not produce satisfactory results. For instance, a primitive organism which happens to have fewer introns would score higher with a single largest exon.

COUNT ALL HITS

Then we tried counting tile hits for all chromosomes no matter what. Ends up being massive overkill whenenver the user's query happens to contain some repetitive tiles.

GENE MODEL

Breaking it up by chromosome is a start in the right direction, but we needed to just go ahead and try to construct a simple gene model. Most real genes have coordinates which are increasing and non-overlapping for the exons in a gene, and we already in BLAT use a gap of more than 1MB to terminate a gene. So it was fairly easy to construct that algorithm, and it was working fairly well. The find-the-best gene-model code is ripping fast and runs in parallel too.

I discovered that the algorithm is so effective that small remaining inaccuracy in the gene model could only be improved by loading actual DNA so it could see mismatches while scoring one possible exon over another -- however we do not want to do that because of speed issues.

TRANSLATED SPACE

For handling protein and rnax and dnax, we have multiple frames in protein space. To deal with merging the results of the 3 frames, I divided the counts by 3 since otherwise you would get nearly all of the output triplicated if there were only perfect matching tiles. Also, it keeps track of the position on the chromosome where the gene model lies, especially the span from the start of the first exon to the end of the last exon. So it pick the lowest start value and the highest end value among frames to figure out the position.

WE HIDE THE APPROXIMATE POSITION

We actually have a surprisingly accurate position where the gene lies, but it can be truncated by having a missing bit if DNA at the start and end that the fuller dynamic alignment would have found. Rather than confuse users with these approximate positions which users might have mistakenly cut-and-pasted as final results, we hide the approximate position and only reveal the actual chromosome since we know that for sure.

WE ONLY SHOW THE BEST ALIGNMENT FOUND

Just like blat, we find multiple gene alignments, but for the Blat All Genomes output, we only show the best single alignment. It is a little bit like the Feeling Lucky button, in that limited sense. But we would not want anyone to think that the single best alignment we show is the only one that exists.

USERS SEE ALIGNMENT WITH SMALL TILE HIT COUNT BUT FULL BLAT ALIGNMENT SHOWS NO RESULTS

In a typical case, there were a couple of tiles that were enough to get BLAT to return some DNA hits, but when it when to actually make an alignment, it found gaps and mismatches so that the final score was < 20, and then hgBlat filters those results out, showing the user nothing.

EXPECTED MAXIMUM TILE COUNTS

You can figure out from the query size and type and tileSize and stepSize, what the maximum tile count could possibly be for your query. Here we are assuming that there are NO GAPS and NO MISMATCHES, and NO REPEAT MASKING, all of which lower the score.

--

DNA

For DNA, our gfServers use tileSize=11 and stepSize=5, so if L is our DNA length of the input query, the maximum tile count T we could expect to see is:

T <= (L - (tileSize-stepSize)) / stepSize
or
T <= (L - (11 - 5))/5  =  (L-6)/5.

If L = 4017 then T <= (4017-6)/5 == 802.

tileSize-stepSize = amount of the last tile that extends beyond the previous tile.

--

PROTEIN

For TRANSLATED queries like protein, we have the tileSize=stepSize=4. Each position is 1 of 22 amino acids.

T <= L / stepSize
or
T <= L / 4.

If L = 1339 then T <= 1339/4 == 334.

--

RNAX, DNAX

DNAX is just RNAX repeated with a reverse-complemented query. So the results are the same for both.

You have to convert the query to protein though, which means dividing the query DNA Length L by 3.

T <= (L/3)/stepSize = (L/3)/4 = L/12

T <= 4017/12 = 334

---

Normalizing between dna and dnax? We could normalize them to DNA span.

DNA
T * stepSize = 802*5 = 4010

DNAX,RNAX,PROT
T * stepSize * bases-per-codon = ((334)*4) * 3 = 4008

Blat All Genomes: Difference between revisions

Latest revision as of 19:47, 1 May 2019

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools

@@ Line 6: / Line 6: @@
 One of the reasons that it is fast is that most of those gfServer get relatively little traffic. Human and mouse might be busy, but the rest tend to be idle much of the time. So that gives us quick responses from the gfServers.  On top of that, because we avoid doing full alignments that fetch the actual sequence from the target genome (.2bit) and the make an expensive dynamic programming local alignment, instead we process the raw hits that we get back from Blat locally. So it is very fast, although ultimately only produces an approximation of the full alignment that the BLAT client can make.
-COMBINE ACROSS MULTIPLE QUERY-STRANDS.
+COMBINE ACROSS MULTIPLE QUERY-STRANDS
-If the result required reverse complement and resubmit,
+If the result required reverse complement of the query and resubmit,
 now there is a best scoring alignment for each strand.
 Pick the best of these and hide the other result.
 SINGLE BIGGEST EXON
 At first, we tried just finding the biggest scoring exon, but that did not produce satisfactory results.
 For instance, a primitive organism which happens to have fewer introns would score higher with a single largest exon.
 COUNT ALL HITS
 Then we tried counting tile hits for all chromosomes no matter what.
 Ends up being massive overkill whenenver the user's query happens
@@ Line 21: / Line 24: @@
 GENE MODEL
 Breaking it up by chromosome is a start in the right direction,
 but we needed to just go ahead and try to construct a simple gene model.
@@ Line 34: / Line 38: @@
 TRANSLATED SPACE
 For handling protein and rnax and dnax, we have multiple frames in protein space.
 To deal with merging the results of the 3 frames, I divided the counts by 3 since
@@ Line 42: / Line 47: @@
 WE HIDE THE APPROXIMATE POSITION
 We actually have a surprisingly accurate position where the gene lies,
 but it can be truncated by having a missing bit if DNA at
@@ Line 47: / Line 53: @@
 Rather than confuse users with these approximate positions which users
 might have mistakenly cut-and-pasted as final results,
-we hide the real position and only reveal the actual chromosome
+we hide the approximate position and only reveal the actual chromosome
 since we know that for sure.
 WE ONLY SHOW THE BEST ALIGNMENT FOUND
 Just like blat, we find multiple gene alignments, but
 for the Blat All Genomes output, we only show the best single alignment.
@@ Line 56: / Line 63: @@
 But we would not want anyone to think that the single best alignment we show
 is the only one that exists.
+USERS SEE ALIGNMENT WITH SMALL TILE HIT COUNT
+BUT FULL BLAT ALIGNMENT SHOWS NO RESULTS
+In a typical case, there were a couple of tiles that were enough
+to get BLAT to return some DNA hits, but when it when to actually
+make an alignment, it found gaps and mismatches so that the final
+score was < 20, and then hgBlat filters those results out, showing the
+user nothing.
 EXPECTED MAXIMUM TILE COUNTS
 You can figure out from the query size and type and tileSize and stepSize,
 what the maximum tile count could possibly be for your query.
@@ Line 64: / Line 81: @@
 --
 DNA
 For DNA, our gfServers use tileSize=11 and stepSize=5,
 so if L is our DNA length of the input query,
 the maximum tile count T we could expect to see is:
+<pre>
 T <= (L - (tileSize-stepSize)) / stepSize
 or
@@ Line 73: / Line 93: @@
 If L = 4017 then T <= (4017-6)/5 == 802.
+</pre>
 tileSize-stepSize = amount of the last tile that extends beyond the previous tile.
 --
 PROTEIN
@@ Line 82: / Line 103: @@
 we have the tileSize=stepSize=4.
 Each position is 1 of 22 amino acids.
+<pre>
 T <= L / stepSize
 or
@@ Line 88: / Line 109: @@
 If L = 1339 then T <= 1339/4 == 334.
+</pre>
 --
 RNAX, DNAX
 DNAX is just RNAX repeated  with a reverse-complemented query.
 So the results are the same for both.
@@ Line 97: / Line 120: @@
 You have to convert the query to protein though, which means dividing
 the query DNA Length L by 3.
+<pre>
 T <= (L/3)/stepSize = (L/3)/4 = L/12
 T <= 4017/12 = 334
+</pre>
+---
+Normalizing between dna and dnax?
+We could normalize them to DNA span.
+<pre>
+DNA
+T * stepSize = 802*5 = 4010
+DNAX,RNAX,PROT
+T * stepSize * bases-per-codon = ((334)*4) * 3 = 4008
+</pre>