Selecting a graphing track data format

From genomewiki
Jump to navigationJump to search

Introduction

There are several different types of data submission formats to enable drawing graphs on the genome browser. The structure of the data should be considered in order to select the appropriate type of data format. Proper selection of the data format is important to avoid very large data submission files and to allow efficient display in the genome browser. Proper selection of graphing options is critical to portray accurately the intended meaning in the data.

bigWig/bigBed

See also: Bioinformatics vol 26 no 17 pp 2204-2207

The bigWig and bigBed data types have become (mid-2009) a better option than the data types mentioned below. It is highly recommended to use the bigWig/bigBed data formats in place of the types mentioned below. See also: bigBed documentation and: bigWig documentation. These encoding formats overcome the limitation of sending very large data sets to UCSC. The bigWig encoder overcomes previous limitations in the wiggle encoding scheme and is efficient with sparse data sets, or data sets with variable sized data points.

If you have such a large data set that the upload is timing out, you should use these alternatives. Your large data set remains on your web server and only the portions of the files needed to display a particular region are transferred to UCSC. The bigBed and bigWig formats are much more efficient than their bed/wig equivalents.

Genome Graphs

  1. Draws line graph through specified chromosome positions
  2. Best used for genome-wide sparse data points (sparse == less than three hundred thousand)
  3. Not recommend for dense data sets (dense == values closer than 100,000 bases to each other)
  4. See also: Genome Graphs

Bed Graph

  1. Draws bar graph at specified chromosome segment region
  2. Best used for genome-wide data sets on the order of several million to perhaps 10 million positions
  3. Best used when data is not spaced at regular intervals, and the size of the specified regions is not a constant
  4. See also: Bed Graph

Wiggle Variable Step

  1. Draws bar graph at specified chromosome segment region
  2. Best used for genome-wide data sets on the order of several 10's of million data points
  3. Specified regions must be a constant size (specified by the span argument)
  4. Chromosome positions can be at irregular intervals, but caution is advised in certain cases
  5. This is the second most efficient space format for wiggle data input
  6. This format can be inefficient during encoding and display if the irregular spacing of the data points is just too extreme. In this case, the Bed Graph is the backup format.
  7. See also: Wiggle Formats

Wiggle Fixed Step

  1. Draws bar graph at specified chromosome segment region
  2. Best used for genome-wide data sets on the order of several 10's of million data points
  3. Specified regions must be a constant size (specified by the span argument)
  4. Chromosome positions are precisely at regular intervals (specified by step argument)
  5. This is the most efficient space format for wiggle data input
  6. See also: Wiggle Formats

Wiggle Bed Graph

  1. Obsolete data format, use the Bed Graph instead


Notes

  1. Data sets with more than 100 million data points are impractical due to network transmission time and data transformation and database loading times. Larger sets of data can be attempted, but are not guaranteed to survive the various time-out mechanisms in the pipeline. The visible symptom of a timeout during loading will be a blank WEB browser screen. If this happens to you, consider converting your data into bigWig or bigBed format.
  2. It does help to compress (gzip) the submitted data file, resulting in a better network transmission time.
  3. Pseudo line graphs can be drawn with the wiggle tracks by setting optional drawing parameters in the display of the track to draw points instead of bars with smoothing on to smear the points together into a line.
  4. Beware of optional data graphing parameters when viewing the resulting data track. The selection of windowingFunction, viewLimits, autoScaling, etc... can dramatically change the apparent meaning of the data display.
  5. There is no graphing format that draws multiple data values at identical chromosome locations. The loading mechanism does attempt to prevent this situation, but not all cases of overlapping data values can be detected. If multiple data values at identical chromosome locations sneak under the detection mechanisms, graphing behavior is not guaranteed.
  6. See also: Wiggle Bed to variableStep format conversion