Graph data format discussion

From genomewiki
Revision as of 19:44, 5 November 2007 by Kate (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

The wiggle format designed by UCSC for graphical display in the browser provides high performance and configurability to the user. However, it does not maintain the original data beyond a few digits of precision. Although the majority of data does not have significance beyond these digits, it has been a point of confusion to the users when the data extracted from a UCSC wiggle table or file does not match the submitted data.

It has been proposed (e.g. in the ENCODE DCC grant proposal) that we design a new graphing data format that will preserve the submitted data exactly. Lacking this, we have promised to provide for download an additional copy of each ENCODE dataset as originally submitted. This has the disadvantage of doubling the disk space required, and also raises the risk of out-of-sync copies on the site (already a problem with our existing wiggle downloads).

A few thoughts on what a new data format might look like (from recent discussions) are included here:

  • A simple binary file that represents each data point in 16 bits would capture all the precision necessary.
  • Memory loading for this would be efficient, but to assure adequate performance for large regions (e.g >1MB) ,a summary table or file could be precomputed using our existing graph averaging methods (e.g. mean, median, max).
  • The existing ascii input wiggle formats could be maintained -- wigEncode could be expanded to generate the simple graph binary format instead of wiggle binary.
  • There is already existing code with similar functionality in hgGenome.
  • NOTE: Nearly all of the high-density ENCODE graphing data provided so far fits one of two patterns:
  - one value per base across the genome (or ENCODE region)
  - fixed-width regions (e.g. 22mers) tiled across the genome

Variable-width data (e.g. sites) tends to be sparser.

Some design issues are:

  • run-length encoding - how to specify that a number applies to N bases
  • sparse data
  • fully populated, every base, data

And questions to answer:

  • Does it participate in hgc click-throughs?
  • Can it be a custom track?
  • How does it function in the table browser?
  • Which correlation functions there?
  • How do you get it to be fast enough for a browser display on a whole chrom?
  • Are there different graphing options, bar graph, points graph, line through points?