Big file converters

From genomewiki
Jump to navigationJump to search

Some brief measurements on big file conversion commands and their consumed memory size:

Working with a fixedStep wiggle file covering most of the hg19 human sequence, the phyloP data for the 46-way vertebrate track on hg19, which is a data set that covers 2,845,303,719 bases of hg19.

The worst case memory usage is for a variableStep wiggle file, where the coordinates specified happen to be consecutive. Normally the most efficient encoding for this type of data would be fixedStep.

This phyloP data set, when used in its original fixedStep ascii encoding, consumes 32 Gb of memory with wigToBigWig in 35 minutes of running time. When that data is in variableStep format, the wigToBigWig consumes 60 Gb of memory in 2 hours 20 minutes run time. When that data is in bedGraph format, the bedGraphToBigWig converter consumes 3 Gb of memory for 1 hour 40 minutes run time.

Using that bedGraph file as an ordinary bed file, the bedToBigBed converter consumes 19 Gb of memory in 1 hour 15 minutes run time to produce a big bed file.

Shell ulimit

To configure your shell environment to allow the use of all the memory in your system, change the ulimit parameters. For example, in the bash shell, to allow use of 64 Gb of memory, the command is:

ulimit -S -m 67108864 -v 67108864

The numbers are units of 1K bytes, thus: 64*1024*1024 == 67108864

The equivalent csh/tcsh shell commands are:

limit datasize 65536m
limit vmemoryuse 65536m