Graph data format discussion
The wiggle format designed by UCSC for graphical display in the browser provides high performance and configurability to the user. However, it does not maintain the original data beyond a few digits of precision. Although the majority of data does not have significance beyond these digits, it has been a point of confusion to the users when the data extracted from a UCSC wiggle table or file does not match the submitted data.
It has been proposed (e.g. in the ENCODE DCC grant proposal) that we design a new graphing data format that will preserve the submitted data exactly. Lacking this, we have promised to provide for download an additional copy of each ENCODE dataset as originally submitted. This has the disadvantage of doubling the disk space required, and also raises the risk of out-of-sync copies on the site (already a problem with our existing wiggle downloads).
A few thoughts on what a new data format might look like (from recent discussions) are included here:
- A simple binary binary file that represents each data point in
16 bits would capture all the precision necessary.
- Memory loading for this would be efficient, but to assure adequate
performance for large regions (e.g >1MB) ,a summary table or file could be precomputed using our existing graph averaging methods (e.g. mean, median, max).
- The existing ascii input wiggle formats could be maintained --
wigEncode could be expanded to generate the simple graph binary format instead of wiggle binary.
- There is already existing code with similar functionality in
hgGenome.
- NOTE: Nearly all of the high-density ENCODE graphing data provided
so far fits one of two patterns:
- one value per base across the genome (or ENCODE region) - fixed-width regions (e.g. 22mers) tiled across the genome
Variable-width data (e.g. sites) tends to be sparser.