Orrb

From genomewiki
Jump to: navigation, search

Basically you are looking for k-mers (shown below with 11 mers) that are overrepresented in the genome similar to blat's 11mer ooc mask.

There difference is that is it only masks the 5 bits in the middle (see pattern) to allow seeds using the ends of the masked region. thresalpha is like a p-value cutoff at 1x10E13 and can be changed.

With a little tweaking this may outperform WM with 10 times faster execute times.

command line parameters that I tried: orrb -mask $input -target $input -out $input.out -wordsize 11 -pattern 00011111000 -threshalpha 13

http://www.drive5.com/orrb/

Below is a blurb from the source:

Two genomes, a threshold and a word size k (k should normally be odd).

Create a kmer histogram for each genome. That is, for each genome it calculates how many times each unqiue word (kmer) appears. Note that a word and its reverse complement are counted together.

Now for each kmer you have two numbers (N1 = the number of times it appears in the first genome, and N2 = the number of times it appears in the second genome). Those two numbers are then multiplied; the result R=N1*N2 is the number of seeds that a blast-like program would initiate because of that kmer. If f(R) > threshold for some function f(), then all instances of that kmer are masked by lowermasking its center letter. So if for example, AATGACA should be masked, whenever AATGACA (or TGTCATT) appears in the genomes, the letter in the middle is masked: AATgACA and its reverse complement TGTcATT; or, for hardmasking, AATNACA and TGTNATT. To prevent seeding, it suffices to mask the middle letter, this allows the other letters to participate in potentially useful seeds. This is why k should be odd, there is no middle letter if k is even.

Usually, because wublast doesn't support masked databases, and the process is symmetrical, it suffices to mask the query genome.