CpG Islands: Difference between revisions
No edit summary |
No edit summary |
||
Line 9: | Line 9: | ||
The "(AL)" track shows islands found by a program written by Andy Law | The "(AL)" track shows islands found by a program written by Andy Law | ||
of the Roslin Institute (again with small corrections by | of the Roslin Institute (again with small corrections by | ||
[[User:AngieHinrichs|Angie]]). | [[User:AngieHinrichs|Angie]]). It is relatively new: developed in 2004, | ||
initially pushed only for chicken, but starting to appear as an alternative | |||
in other genomes, especially on hgwdev. | |||
Andy's program performs a sliding window search on the locations of | Andy's program performs a sliding window search on the locations of | ||
Line 30: | Line 32: | ||
that would qualify by those 3 stated params. | that would qualify by those 3 stated params. | ||
We | The following is mostly [[User:AngieHinrichs|Angie]]'s opinion... | ||
We have used Gos's program to generate the CpG Islands track since ~2002 | |||
and before that, WUSTL folks used the program to generate islands which | |||
we loaded. Then Andy wrote his program | |||
as part of the chicken analysis project (2004), and suddenly it found | as part of the chicken analysis project (2004), and suddenly it found | ||
about 3 times as many islands as Gos's, which I found kind of alarming | about 3 times as many islands as Gos's, which I found kind of alarming |
Revision as of 23:56, 7 April 2006
The CpG Islands track, cpgIslandExt, shows islands found by a program originally written by Gos Micklem. We got the program from WUSTL, where LaDeana Hillier made some edits to it, and then Angie edited it again to ensure that the ratio of observed to expected CpGs was calculated as stated in
Gardiner-Garden and Frommer, J. Mol. Biol. (1987) 196 (2), 261-282: observed * length / (number of C * number of G)
The "(AL)" track shows islands found by a program written by Andy Law of the Roslin Institute (again with small corrections by Angie). It is relatively new: developed in 2004, initially pushed only for chicken, but starting to appear as an alternative in other genomes, especially on hgwdev.
Andy's program performs a sliding window search on the locations of CG's in the genome (as opposed to a sliding window search over all bases). It simply finds all stretches of sequence that meet the parameters that both programs claim to use:
length >= 200bp, %GC >= 50%, observed/expected CpG >= 0.6.
Gos's program is pickier about which islands it reports, for two reasons:
- it imposes an additional constraint, that a certain running score *must* remain above 0 for the entire length of an island, and
- it also chops up islands at their max-running-score point and evaluates the two halves separately.
The running score is computed as follows: it starts at 0; every time a CG is encountered, it's incremented by 17; at every other base, it's decremented by 1, but never allowed to fall below 0. The running score is used to identify stretches of sequence to evaluate according to length, %GC and O/E, but the running score constraint itself precludes a lot of stretches that would qualify by those 3 stated params.
The following is mostly Angie's opinion...
We have used Gos's program to generate the CpG Islands track since ~2002 and before that, WUSTL folks used the program to generate islands which we loaded. Then Andy wrote his program as part of the chicken analysis project (2004), and suddenly it found about 3 times as many islands as Gos's, which I found kind of alarming so I dug into the source code to find out how they differed. I told Jim about the discrepancy in number of islands found, and expected that he would want to show the track that identified all stretches that meet the stated params. However, Jim still finds Gos's track more pleasing because its number of islands is closer to the number of genes and it gets a better enrichment score for upstream regions of known genes -- i.e. Gos's picky-islands are more likely to intersect with promoters than Andy's comprehensive-islands.
Terry Furey got some interesting results for an even simpler method of identifying CpG-rich regions during the 2005 ENCODE analysis fair, but I don't think that should go on a public wiki page before publication of the ENCODE analysis papers so ask Terry or me if you're curious. We may eventually want to offer 3 versions of CpG islands at different points on the sensitivity/specificity curve.
Navigation: back to Implementation_Notes