CpG Islands
The CpG Islands track, cpgIslandExt, shows islands found by a program originally written by Gos Micklem. We got the program from WUSTL, where LaDeana Hillier made some edits to it, and then Angie edited it again to ensure that the ratio of observed to expected CpGs was calculated as stated in
Gardiner-Garden and Frommer, J. Mol. Biol. (1987) 196 (2), 261-282: observed * length / (number of C * number of G)
The "(AL)" track shows islands found by a program written by Andy Law of the Roslin Institute (again with small corrections by Angie).
Andy's program performs a sliding window search on the locations of CG's in the genome (as opposed to a sliding window search over all bases). It simply finds all stretches of sequence that meet the parameters that both programs claim to use:
length >= 200bp, %GC >= 50%, observed/expected CpG >= 0.6.
Gos's program is pickier about which islands it reports, for two reasons:
- it imposes an additional constraint, that a certain running score *must* remain above 0 for the entire length of an island, and
- it also chops up islands at their max-running-score point and evaluates the two halves separately.
The running score is computed as follows: it starts at 0; every time a CG is encountered, it's incremented by 17; at every other base, it's decremented by 1, but never allowed to fall below 0. The running score is used to identify stretches of sequence to evaluate according to length, %GC and O/E, but the running score constraint itself precludes a lot of stretches that would qualify by those 3 stated params.
We've been using Gos's program for years. Then Andy wrote his program as part of the chicken analysis project (2004), and suddenly it found about 3 times as many islands as Gos's, which I found kind of alarming so I dug into the source code to find out how they differed. I told Jim about the discrepancy in number of islands found, and expected that he would want to show the track that identified all stretches that meet the stated params. However, Jim still finds Gos's track more pleasing because its number of islands is closer to the number of genes and it gets a better enrichment score for upstream regions of known genes -- i.e. Gos's picky-islands are more likely to intersect with promoters than Andy's comprehensive-islands.
Terry Furey got some interesting results for an even simpler method of identifying CpG-rich regions during the 2005 ENCODE analysis fair, but I don't think that should go on a public wiki page before publication of the ENCODE analysis papers so ask Terry or me if you're curious. We may eventually want to offer 3 versions of CpG islands at different points on the sensitivity/specificity curve.
Navigation: back to Implementation_Notes