Graph data format discussion: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 18: Line 18:
recent discussions) are included here:
recent discussions) are included here:


* A simple binary binary file that represents each data point in
* A simple binary file that represents each data point in 16 bits would capture all the precision necessary.
16 bits would capture all the precision necessary.


* Memory loading for this would be efficient, but to assure adequate
* Memory loading for this would be efficient, but to assure adequate performance for large regions (e.g >1MB) ,a summary table or file could be precomputed using our existing graph averaging methods (e.g. mean, median, max).
performance for large regions (e.g >1MB) ,a summary table or file could
be precomputed using our existing graph averaging methods (e.g. mean, median, max).


* The existing ascii input wiggle formats could be maintained --
* The existing ascii input wiggle formats could be maintained -- wigEncode could be expanded to generate the simple graph binary format instead of wiggle binary.
wigEncode could be expanded to generate the simple graph binary
format instead of wiggle binary.


* There is already existing code with similar functionality in
* There is already existing code with similar functionality in hgGenome.
hgGenome.


* NOTE: Nearly all of the high-density ENCODE graphing data provided
* NOTE: Nearly all of the high-density ENCODE graphing data provided so far fits one of two patterns:
so far fits one of two patterns:
   - one value per base across the genome (or ENCODE region)
   - one value per base across the genome (or ENCODE region)
   - fixed-width regions (e.g. 22mers) tiled across the genome
   - fixed-width regions (e.g. 22mers) tiled across the genome
Line 51: Line 44:
* Which correlation functions there?
* Which correlation functions there?
* How do you get it to be fast enough for a browser display on a whole chrom?
* How do you get it to be fast enough for a browser display on a whole chrom?
* Are there different graphing options, bar graph, points graph, line through
* Are there different graphing options, bar graph, points graph, line through points?
points?
 
[[Category:ENCODE]]

Latest revision as of 19:44, 5 November 2007

The wiggle format designed by UCSC for graphical display in the browser provides high performance and configurability to the user. However, it does not maintain the original data beyond a few digits of precision. Although the majority of data does not have significance beyond these digits, it has been a point of confusion to the users when the data extracted from a UCSC wiggle table or file does not match the submitted data.

It has been proposed (e.g. in the ENCODE DCC grant proposal) that we design a new graphing data format that will preserve the submitted data exactly. Lacking this, we have promised to provide for download an additional copy of each ENCODE dataset as originally submitted. This has the disadvantage of doubling the disk space required, and also raises the risk of out-of-sync copies on the site (already a problem with our existing wiggle downloads).

A few thoughts on what a new data format might look like (from recent discussions) are included here:

  • A simple binary file that represents each data point in 16 bits would capture all the precision necessary.
  • Memory loading for this would be efficient, but to assure adequate performance for large regions (e.g >1MB) ,a summary table or file could be precomputed using our existing graph averaging methods (e.g. mean, median, max).
  • The existing ascii input wiggle formats could be maintained -- wigEncode could be expanded to generate the simple graph binary format instead of wiggle binary.
  • There is already existing code with similar functionality in hgGenome.
  • NOTE: Nearly all of the high-density ENCODE graphing data provided so far fits one of two patterns:
  - one value per base across the genome (or ENCODE region)
  - fixed-width regions (e.g. 22mers) tiled across the genome

Variable-width data (e.g. sites) tends to be sparser.

Some design issues are:

  • run-length encoding - how to specify that a number applies to N bases
  • sparse data
  • fully populated, every base, data

And questions to answer:

  • Does it participate in hgc click-throughs?
  • Can it be a custom track?
  • How does it function in the table browser?
  • Which correlation functions there?
  • How do you get it to be fast enough for a browser display on a whole chrom?
  • Are there different graphing options, bar graph, points graph, line through points?