ExonMostlyInitialDesignMeeting: Difference between revisions

From Genecats
Jump to navigationJump to search
(At this point just pointing to ExonMostlyInitialDesignMeetingWhiteboard which I'm about to add...)
 
(Some higher-level issues, things that didn't make it onto the whiteboard, and follow-on thoughts from the initial meeting.)
Line 1: Line 1:
On Sep 3, 2014, Jim, Galt and Angie met to discuss some implementation issues for the "exon-mostly" display.
On Sep 3, 2014, Jim, Galt and Angie met to discuss some implementation issues for the "exon-mostly" display.
[[ExonMostlyInitialDesignMeetingWhiteboard]] has fuzzy snapshots of the whiteboard and a transcription of the text (as well as Angie could make out).


=Whiteboard snapshots=
=Performance considerations=
[[File:14_09_03_exonMostlyDesignMtg_1.jpg|200px|thumb|left|left whiteboard]] [[File:14_09_03_exonMostlyDesignMtg_2.jpg|200px|thumb||right whiteboard]]
When viewing a transcript with ''N'' exons, if each track's loading code makes a separate query for each exon region, then there will be ''N'' times as many queries as before.  For hg19.knownGene, average ''N'' is ~9.
Text transcription of whiteboard notes [[ExonMostlyInitialDesignMeetingWhiteboard|here]]
* Will an order of magnitude increase in the number of mysql queries and/or big file queries cause performance problems?
* What is the distance between regions at which it becomes more efficient to do one query over the regions and everything between than to do separate queries per region?  This distance most likely be different for mysql vs. bigFile and for huge-table mysql (like hg19.snp138) vs small-table mysql (like hg19.gap).
Galt has the most experience with the thread-unsafeness of mysql and the incompatibilities of pthreads and forking, and will do some experiments to characterize how many simultaneous threads, processes, mysql requests etc are optimal for performance.
 
=Display=
We will use some kind of vertical marks to show the boundaries of regions.  We most likely will want to show a few bases of padding on either side of exons, to see splice sites and have a little visual separation.
* Would a user sometimes want to see 500 bp upstream of TSS too?  (or 2kb??).
* What is the upper limit for how many distinct regions we can display?  If the image (ignoring left label) is 1000 px wide, and each separator is 3px wide, at 167 regions half of the width is taken up by separators.  extreme cases: hg19.knownGene item uc031qqx.1 (antibody parts) on chr14 has exonCount=5065!  hg19.refGene's current max exonCount is 363 (NM_001267550/TTN on chr2).]
 
=Regions, regions, regions=
There may be several incarnations of "the region list", and different parts of the code will have to be sure to use the right one:
# user/logical regions: exons (or if we're ambitious, unconstrained genomic regions)
# displayed regions: we might want to pad exons with a few bases on either side, and then merge regions that overlap (or that are so close to each other that the separator would waste space), and possibly clip to a zoomed subset of user/logical regions
# fetched regions: some displayed regions may be close enough so that it would be more efficient to do one mysql query covering multiple regions instead of per-region mysql queries
 
=Text=
If it weren't for left item labels, we could just render the entire transcript region and then display only some vertical slices of that image. However, we still need those labels and they may extend into region(s) to the left of an item.  (Consider a gene's left label, or a SNP that falls near the beginning of an exon.)  Therefore rendering of text, and packing of items, must be done in post-slice pixel coordinates. 
 
For labels that appear outside the item, it would be good to have a function that takes chromStart, chromEnd, and text, and then translates chromStart (or chromEnd) into a post-slice pixel offset and draws the text relative to that offset.  For labels that appear inside an item, a similar function could center the text on the post-slice pixel center between post-slice pixel offsets for chromStart and chromEnd.
 
=Regions to pixels=
Displayed regions will have a well-defined mapping to pixel X coordinate ranges.  This mapping could be implemented efficiently using a chromosome-range tree structure.  Any given genomic position range could map to 0 pixel ranges (not in any displayed region), one pixel range when the position range is a subset of one displayed region, or more than one pixel range when the position range spans multiple displayed regions.  For example, if an assembly contig spans all exons of a gene, then it would be rendered in all displayed regions / pixel ranges.
 
The same pixel-scaling factor should be used in all displayed regions so all items are drawn at the same scale even if the regions are of different sizes. With separatorWidth being the width of whatever vertical separator we draw between displayed regions, and pixelWidth being the width of the image excluding the left label area:
  uint totalSeparatorWidth = separatorWidth * (slCount(displayedRegions) - 1);
  uint totalBasesInRegions = sumLengths(displayedRegions);
  double pixelsPerBase = (pixelWidth - totalSeparatorWidth) / totalBasesInRegions;
So for example, if our separator is 3px wide, the image area is 1000 pixels wide, and there are 10 displayedRegions that sum to 25000 bases, then pixelsPerBase works out like this:
  totalSeparatorWidth = 3 * (10 - 1) = 27;
  totalBasesInRegions = 25000;
  pixelsPerBase = (1000 - 27) / 25000 = 0.038920;
* Will we need to account for rounding at edges of displayed regions?  Small items at the end of the position range sometimes fall past the rightmost pixel in hgTracks and don't appear in the image; could we have the same problem here?

Revision as of 19:10, 9 September 2014

On Sep 3, 2014, Jim, Galt and Angie met to discuss some implementation issues for the "exon-mostly" display. ExonMostlyInitialDesignMeetingWhiteboard has fuzzy snapshots of the whiteboard and a transcription of the text (as well as Angie could make out).

Performance considerations

When viewing a transcript with N exons, if each track's loading code makes a separate query for each exon region, then there will be N times as many queries as before. For hg19.knownGene, average N is ~9.

  • Will an order of magnitude increase in the number of mysql queries and/or big file queries cause performance problems?
  • What is the distance between regions at which it becomes more efficient to do one query over the regions and everything between than to do separate queries per region? This distance most likely be different for mysql vs. bigFile and for huge-table mysql (like hg19.snp138) vs small-table mysql (like hg19.gap).

Galt has the most experience with the thread-unsafeness of mysql and the incompatibilities of pthreads and forking, and will do some experiments to characterize how many simultaneous threads, processes, mysql requests etc are optimal for performance.

Display

We will use some kind of vertical marks to show the boundaries of regions. We most likely will want to show a few bases of padding on either side of exons, to see splice sites and have a little visual separation.

  • Would a user sometimes want to see 500 bp upstream of TSS too? (or 2kb??).
  • What is the upper limit for how many distinct regions we can display? If the image (ignoring left label) is 1000 px wide, and each separator is 3px wide, at 167 regions half of the width is taken up by separators. extreme cases: hg19.knownGene item uc031qqx.1 (antibody parts) on chr14 has exonCount=5065! hg19.refGene's current max exonCount is 363 (NM_001267550/TTN on chr2).]

Regions, regions, regions

There may be several incarnations of "the region list", and different parts of the code will have to be sure to use the right one:

  1. user/logical regions: exons (or if we're ambitious, unconstrained genomic regions)
  2. displayed regions: we might want to pad exons with a few bases on either side, and then merge regions that overlap (or that are so close to each other that the separator would waste space), and possibly clip to a zoomed subset of user/logical regions
  3. fetched regions: some displayed regions may be close enough so that it would be more efficient to do one mysql query covering multiple regions instead of per-region mysql queries

Text

If it weren't for left item labels, we could just render the entire transcript region and then display only some vertical slices of that image. However, we still need those labels and they may extend into region(s) to the left of an item. (Consider a gene's left label, or a SNP that falls near the beginning of an exon.) Therefore rendering of text, and packing of items, must be done in post-slice pixel coordinates.

For labels that appear outside the item, it would be good to have a function that takes chromStart, chromEnd, and text, and then translates chromStart (or chromEnd) into a post-slice pixel offset and draws the text relative to that offset. For labels that appear inside an item, a similar function could center the text on the post-slice pixel center between post-slice pixel offsets for chromStart and chromEnd.

Regions to pixels

Displayed regions will have a well-defined mapping to pixel X coordinate ranges. This mapping could be implemented efficiently using a chromosome-range tree structure. Any given genomic position range could map to 0 pixel ranges (not in any displayed region), one pixel range when the position range is a subset of one displayed region, or more than one pixel range when the position range spans multiple displayed regions. For example, if an assembly contig spans all exons of a gene, then it would be rendered in all displayed regions / pixel ranges.

The same pixel-scaling factor should be used in all displayed regions so all items are drawn at the same scale even if the regions are of different sizes. With separatorWidth being the width of whatever vertical separator we draw between displayed regions, and pixelWidth being the width of the image excluding the left label area:

 uint totalSeparatorWidth = separatorWidth * (slCount(displayedRegions) - 1);
 uint totalBasesInRegions = sumLengths(displayedRegions);
 double pixelsPerBase = (pixelWidth - totalSeparatorWidth) / totalBasesInRegions;

So for example, if our separator is 3px wide, the image area is 1000 pixels wide, and there are 10 displayedRegions that sum to 25000 bases, then pixelsPerBase works out like this:

 totalSeparatorWidth = 3 * (10 - 1) = 27;
 totalBasesInRegions = 25000;
 pixelsPerBase = (1000 - 27) / 25000 = 0.038920;
  • Will we need to account for rounding at edges of displayed regions? Small items at the end of the position range sometimes fall past the rightmost pixel in hgTracks and don't appear in the image; could we have the same problem here?