Selecting a graphing track data format

From genomewiki
Revision as of 20:10, 14 August 2008 by Hiram (talk | contribs) (initial discussion)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Introduction

There are several different types of data submission formats to enable drawing graphs on the genome browser. The structure of the data should be considered in order to select the appropriate type of data format. Proper selection of the data format is important to avoid very large data submission files and to allow efficient display in the genome browser. Proper selection of graphing options is critical to portray accurately the intended meaning in the data.

Genome Graphs

  1. Draws line graph through specified chromosome positions
  2. Best used for genome-wide sparse data points (sparse == less than three hundred thousand)
  3. Not recommend for dense data sets (dense == values closer than 100,000 bases to each other)
  4. See also: Genome Graphs

Bed Graph

  1. Draws bar graph at specified chromosome segment region
  2. Best used for genome-wide data sets on the order of several million to perhaps 10 million positions
  3. Best used when data is not spaced at regular intervals, and the size of the specified regions is not a constant
  4. See also: Bed Graph

Wiggle Variable Step

  1. Draws bar graph at specified chromosome segment region
  2. Best used for genome-wide data sets on the order of several 10's of million data points
  3. Specified regions must be a constant size (specified by the span argument)
  4. Chromosome positions can be at irregular intervals, but caution is advised in certain cases
  5. This is the second most efficient space format for wiggle data input
  6. This format can be inefficient during encoding and display if the irregular spacing of the data points is just too extreme. In this case, the Bed Graph is the backup format.
  7. See also: Wiggle Formats

Wiggle Fixed Step

  1. Draws bar graph at specified chromosome segment region
  2. Best used for genome-wide data sets on the order of several 10's of million data points
  3. Specified regions must be a constant size (specified by the span argument)
  4. Chromosome positions are precisely at regular intervals (specified by step argument)
  5. This is the most efficient space format for wiggle data input
  6. See also: Wiggle Formats

Wiggle Bed Graph

  1. Obsolete data format, use the Bed Graph instead

Notes

  1. At the current time, data sets with more than 100 million data points are impractical due to network transmission time and data transformation and database loading times. Larger sets of data can be attempted, but are not guaranteed to survive the various time-out mechanisms in the pipeline. The visible symptom of a timeout during loading will be a blank WEB browser screen.
  2. It does help to compress (gzip) the submitted data file, resulting in a better network transmission time.
  3. Pseudo line graphs can be drawn with the wiggle tracks by setting optional drawing parameters in the display of the track to draw points instead of bars with smoothing on to smear the points together into a line.
  4. Beware of optional data graphing parameters when viewing the resulting data track. The selection of windowingFunction, viewLimits, autoScaling, etc... can dramatically change the apparent meaning of the data display.
  5. There is no graphing format that draws multiple data values at identical chromosome locations. The loading mechanism does attempt to prevent this situation, but not all cases of overlapping data values can be detected. If multiple data values at identical chromosome locations sneak under the detection mechanisms, graphing behavior is not guaranteed.