Bin indexing system

From genomewiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Introduction

The binning index system used in the genome browser is a mechanism used in concert with MySQL indexes to speed up selection of MySQL rows for genome coordinate overlapping items. This type of search is sometimes called a range request. The system as first used in the genome browser is described in: "The Human Genome Browser at UCSC" Kent, et. al. Genome Research 2002.12:996-1006, see Figure 7, quote:

We settled on a binning scheme suggested by Lincoln Stein and Richard Durbin. A simple version of this scheme is shown in
Figure7. In the browser itself, we use five different sizes of bins: 128 kb, 1 Mb, 8 Mb, 64 Mb, and 512 Mb.

That initial implementation has since been enhanced by an additional level of bins to allow items of size up to 4 Gb (actually only to 2Gb given integer size limits). The new and the old system coexist together. Given an item with a chromEnd coordinate of less than or equal to 512 Mb, a bin number in the old system will be used. An item with a chromEnd coordinate greater than 512 Mb, a bin number in the new system will be used.

Since all of these bins are in sizes of powers of two, the calculation of the bin number is a simple matter of bit shifting of the chromStart and chromEnd coordinates. The C code for the bin calculation can be seen in the kent source tree in src/lib/binRange.c.

Initial implementation

Used when chromEnd is less than or equal to 536,870,912 = 229

 bin numbersbin
level#binsstartendsize
0100512 Mb
181864 Mb
2649728 Mb
3512735841 Mb
440965854680128 kb

Extended implementation

Used when chromEnd is greater than 536,870,912 = 229 and less than 2,147,483,647 = 231 - 1

 bin numbersbin
level#binsstartendsize
01469146912 Gb
1846834685512 Mb
2644698472164 Mb
3512481850098 Mb
44,096577873131 Mb
532,7681345825745128 kb

Initial implementation C code

/* This file is copyright 2002 Jim Kent, but license is hereby
 * granted for all use - public, private or commercial. */

static int binOffsets[] = {512+64+8+1, 64+8+1, 8+1, 1, 0};
#define _binFirstShift 17       /* How much to shift to get to finest bin. */
#define _binNextShift 3         /* How much to shift to get to next larger bin.

static int binFromRangeStandard(int start, int end)
/* Given start,end in chromosome coordinates assign it
 * a bin.   There's a bin for each 128k segment, for each
 * 1M segment, for each 8M segment, for each 64M segment,
 * and for each chromosome (which is assumed to be less than
 * 512M.)  A range goes into the smallest bin it will fit in. */
{
int startBin = start, endBin = end-1, i;
startBin >>= _binFirstShift;
endBin >>= _binFirstShift;
for (i=0; i<ArraySize(binOffsets); ++i)
    {
    if (startBin == endBin)
        return binOffsets[i] + startBin;
    startBin >>= _binNextShift;
    endBin >>= _binNextShift;
    }
errAbort("start %d, end %d out of range in findBin (max is 512M)", start, end);
return 0;
}

Extended implementation C code

/* This file is copyright 2002 Jim Kent, but license is hereby
 * granted for all use - public, private or commercial. */

/* add one new level to get coverage past chrom sizes of 512 Mb
 *      effective limit is now the size of an integer since chrom start
 *      and end coordinates are always being used in int's == 2Gb-1 */
static int binOffsetsExtended[] =
        {4096+512+64+8+1, 512+64+8+1, 64+8+1, 8+1, 1, 0};

static int binFromRangeExtended(int start, int end)
/* Given start,end in chromosome coordinates assign it
 * a bin.   There's a bin for each 128k segment, for each
 * 1M segment, for each 8M segment, for each 64M segment,
 * for each 512M segment, and one top level bin for 4Gb.
 *      Note, since start and end are int's, the practical limit
 *      is up to 2Gb-1, and thus, only four result bins on the second
 *      level.
 * A range goes into the smallest bin it will fit in. */
{
int startBin = start, endBin = end-1, i;
startBin >>= _binFirstShift;
endBin >>= _binFirstShift;
for (i=0; i<ArraySize(binOffsetsExtended); ++i)
    {
    if (startBin == endBin)
        return _binOffsetOldToExtended + binOffsetsExtended[i] + startBin;
    startBin >>= _binNextShift;
    endBin >>= _binNextShift;
    }
errAbort("start %d, end %d out of range in findBin (max is 2Gb)", start, end);
return 0;
}

Unix Command Line, Perl, Python and Ruby

A generic Perl script for any Unix command line, by Heng Li:

http://lh3lh3.users.sourceforge.net/ucsc-mysql.shtml

For Python, see:

https://github.com/brentp/cruzdb/blob/master/cruzdb/__init__.py#L489

For Ruby, see:

https://github.com/misshie/bioruby-ucsc-api/blob/1915b710a9064209dcffd8eef39bd548ad199fc6/lib/bio-ucsc/ucsc_bin.rb