Bin indexing system

From genomewiki
Jump to navigationJump to search

Introduction

The binning index system used in the genome browser is a mechanism used in concert with MySQL indexes to speed up selection of MySQL rows for genome coordinate overlapping items. This type of search is sometimes called a range request. The system as first used in the genome browser is described in: "The Human Genome Browser at UCSC" Kent, et. al. Genome Research 2002.12:996-1006, see Figure 7, quote:

We settled on a binning scheme suggested by Lincoln Stein and Richard Durbin. A simple version of this scheme is shown in
Figure7. In the browser itself, we use five different sizes of bins: 128 kb, 1 Mb, 8 Mb, 64 Mb, and 512 Mb.

That initial implementation has since been enhanced by an additional level of bins to allow items of size up to 4 Gb (actually only to 2Gb given integer size limits). The new and the old system coexist together. Given an item with a chromEnd coordinate of less than or equal to 512 Mb, a bin number in the old system will be used. An item with a chromEnd coordinate greater than 512 Mb, a bin number in the new system will be used.

Since all of these bins are in sizes of powers of two, the calculation of the bin number is a simple matter of bit shifting of the chromStart and chromEnd coordinates. The C code for the bin calculation can be seen in the kent source tree in src/lib/binRange.c.

Initial implementation

Used when chromEnd is less than or equal to 536,870,912 = 229

 bin numbersbin
level#binsstartendsize
0100512 Mb
181864 Mb
2649728 Mb
3512735841 Mb
440965854680128 kb

Extended implementation

Used when chromEnd is greater than 536,870,912 = 229 and less than 2,147,483,647 = 231 - 1

 bin numbersbin
level#binsstartendsize
01469146912 Gb
1846834685512 Mb
2644698472164 Mb
3512481850098 Mb
44,096577873131 Mb
532,7681345825745128 kb

Initial implementation C code

/* This file is copyright 2002 Jim Kent, but license is hereby
 * granted for all use - public, private or commercial. */

static int binOffsets[] = {512+64+8+1, 64+8+1, 8+1, 1, 0};
#define _binFirstShift 17       /* How much to shift to get to finest bin. */
#define _binNextShift 3         /* How much to shift to get to next larger bin.

static int binFromRangeStandard(int start, int end)
/* Given start,end in chromosome coordinates assign it
 * a bin.   There's a bin for each 128k segment, for each
 * 1M segment, for each 8M segment, for each 64M segment,
 * and for each chromosome (which is assumed to be less than
 * 512M.)  A range goes into the smallest bin it will fit in. */
{
int startBin = start, endBin = end-1, i;
startBin >>= _binFirstShift;
endBin >>= _binFirstShift;
for (i=0; i<ArraySize(binOffsets); ++i)
    {
    if (startBin == endBin)
        return binOffsets[i] + startBin;
    startBin >>= _binNextShift;
    endBin >>= _binNextShift;
    }
errAbort("start %d, end %d out of range in findBin (max is 512M)", start, end);
return 0;
}

Extended implementation C code

/* This file is copyright 2002 Jim Kent, but license is hereby
 * granted for all use - public, private or commercial. */

/* add one new level to get coverage past chrom sizes of 512 Mb
 *      effective limit is now the size of an integer since chrom start
 *      and end coordinates are always being used in int's == 2Gb-1 */
static int binOffsetsExtended[] =
        {4096+512+64+8+1, 512+64+8+1, 64+8+1, 8+1, 1, 0};

static int binFromRangeExtended(int start, int end)
/* Given start,end in chromosome coordinates assign it
 * a bin.   There's a bin for each 128k segment, for each
 * 1M segment, for each 8M segment, for each 64M segment,
 * for each 512M segment, and one top level bin for 4Gb.
 *      Note, since start and end are int's, the practical limit
 *      is up to 2Gb-1, and thus, only four result bins on the second
 *      level.
 * A range goes into the smallest bin it will fit in. */
{
int startBin = start, endBin = end-1, i;
startBin >>= _binFirstShift;
endBin >>= _binFirstShift;
for (i=0; i<ArraySize(binOffsetsExtended); ++i)
    {
    if (startBin == endBin)
        return _binOffsetOldToExtended + binOffsetsExtended[i] + startBin;
    startBin >>= _binNextShift;
    endBin >>= _binNextShift;
    }
errAbort("start %d, end %d out of range in findBin (max is 2Gb)", start, end);
return 0;
}

Unix Command Line, Perl, Python and Ruby

A generic Perl script for any Unix command line, by Heng Li:

http://lh3lh3.users.sourceforge.net/ucsc-mysql.shtml

For Python, see:

https://github.com/brentp/cruzdb/blob/master/cruzdb/__init__.py#L489

For Ruby, see:

https://github.com/misshie/bioruby-ucsc-api/blob/1915b710a9064209dcffd8eef39bd548ad199fc6/lib/bio-ucsc/ucsc_bin.rb