Big file converters

From genomewiki
Revision as of 16:57, 16 March 2011 by Hiram (talk | contribs) (New page: Some brief measurements on big file conversion commands and their consumed memory size: Working with a fixedStep wiggle file covering most of the hg19 human sequence, the phyloP data for ...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Some brief measurements on big file conversion commands and their consumed memory size:

Working with a fixedStep wiggle file covering most of the hg19 human sequence, the phyloP data for the 46-way vertebrate track on hg19, which is a data set that covers 2,845,303,719 bases of hg19.

The worst case memory usage is for a variableStep wiggle file, where the coordinates specified happen to be consecutive. Normally the most efficient encoding for this type of data would be fixedStep.

This phyloP data set, when used in its original fixedStep ascii encoding, consumes 32 Gb of memory with wigToBigWig in 35 minutes of running time. When that data is in variableStep format, the wigToBigWig consumes 60 Gb of memory in 2 hours 20 minutes run time. When that data is in bedGraph format, the bedGraphToBigWig converter consumes 3 Gb of memory for 1 hour 40 minutes run time.

Using that bedGraph file as an ordinary bed file, the bedToBigBed converter consumes 19 Gb of memory in 1 hour 15 minutes run time to produce a big bed file.