CPG Islands: Difference between revisions
From genomewiki
Jump to navigationJump to search
(initial contents) |
(add category tags) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
UCSC uses a local copy of the [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=tree;f=src/utils/cpgIslandExt CPG Island Ext] software. | UCSC uses a local copy of the [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=tree;f=src/utils/cpgIslandExt CPG Island Ext] software. | ||
The same as with the [[Genscan]] this works with hard masked sequence. | The same as with the [[Genscan]] this works with hard masked sequence. | ||
==Hard mask== | |||
All masked sequence is converted to '''N's'''. | |||
UCSC runs this as a cluster job, although it is a simple loop if you are willing to wait. The gensub2 template is: | |||
#LOOP | |||
./runOne.csh $(root1) ../hardMaskedFa/$(lastDir1)/$(file1) | |||
#ENDLOOP | |||
The '''runOne.csh''' script: | |||
#!/bin/csh -ef | |||
set chrom = $1 | |||
set result = $2 | |||
twoBitToFa /data/genomes/ricCom1/ricCom1.2bit:$chrom stdout \ | |||
| maskOutFa stdin hard $result | |||
C-shell script syntax here to manually construct a directory hierarchy to contain all the resulting | |||
hard masked fasta files and the ''chr.list'' file to use with gensub2: | |||
mkdir -p /data/genomes/ricCom1/bed/cpgIslands/run.hardMask | |||
cd /data/genomes/ricCom1/bed/cpgIslands/run.hardMask | |||
mkdir -p ../hardMaskedFa | |||
set perDirLimit = 4000 | |||
set ctgCount = `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | wc -l` | |||
set subDirCount = `echo $ctgCount | awk '{printf "%d", 1+$1/4000}'` | |||
@ dirCount = 0 | |||
set dirName = `echo $dirCount | awk '{printf "%03d", $1}'` | |||
@ perDirCount = 0 | |||
mkdir ../hardMaskedFa/$dirName | |||
/bin/rm -f chr.list | |||
/bin/touch chr.list | |||
foreach chrom ( `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | cut -f1` ) | |||
if ($perDirCount < $perDirLimit) then | |||
@ perDirCount += 1 | |||
else | |||
@ dirCount += 1 | |||
set dirName = `echo $dirCount | awk '{printf "%03d", $1}'` | |||
set perDirCount = 1 | |||
mkdir ../hardMaskedFa/$dirName | |||
endif | |||
echo $dirName/$chrom.fa >> chr.list | |||
end | |||
The gensub2 constructs the ''jobList'': | |||
gensub2 chr.list single template jobList | |||
Typical parasol cluster run: | |||
para make jobList | |||
para check | |||
para time > run.time | |||
cat run.time | |||
Completed: 25763 of 25763 jobs | |||
CPU time in finished jobs: 86s 1.43m 0.02h 0.00d 0.000 y | |||
IO & Wait Time: 65591s 1093.18m 18.22h 0.76d 0.002 y | |||
Average job time: 3s 0.04m 0.00h 0.00d | |||
Longest finished job: 6s 0.10m 0.00h 0.00d | |||
Submission to last job: 261s 4.35m 0.07h 0.00d | |||
==CPG Island cluster run== | |||
The gensub2 template: | |||
#LOOP | |||
./runCpg.csh $(root1) $(lastDir1) {check out exists results/$(lastDir1)/$(root1).cpg} | |||
#ENDLOOP | |||
The '''runCpg.csh''' script: | |||
#!/bin/csh -ef | |||
set chrom = $1 | |||
set dir = $2 | |||
set result = $3 | |||
set resultDir = $result:h | |||
mkdir -p $resultDir | |||
set seqFile = hardMaskedFa/$dir/$chrom.fa | |||
set C = `faCount $seqFile | egrep -v "^#seq|^total" | awk '{print $2 - $7}'` | |||
if ( $C > 200 ) then | |||
cpg_lh $seqFile > $result | |||
else | |||
touch $result | |||
endif | |||
Typical parasol operation: | |||
cd /data/genomes/ricCom1/bed/cpgIslands | |||
mkdir -p results | |||
rm -f file.list | |||
find ./hardMaskedFa -type f > file.list | |||
gensub2 file.list single gsub jobList | |||
para make jobList | |||
para check | |||
para time > run.time | |||
cat run.time | |||
Completed: 25763 of 25763 jobs | |||
CPU time in finished jobs: 73s 1.21m 0.02h 0.00d 0.000 y | |||
IO & Wait Time: 66879s 1114.66m 18.58h 0.77d 0.002 y | |||
Average job time: 3s 0.04m 0.00h 0.00d | |||
Longest finished job: 7s 0.12m 0.00h 0.00d | |||
Submission to last job: 353s 5.88m 0.10h 0.00d | |||
==Collect results, create bigBed file== | |||
cd /data/genomes/ricCom1/bed/cpgIslands | |||
catDir -r results \ | |||
| awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | |||
| sort -k1,1 -k2,2n > cpgIsland.bed | |||
bedToBigBed -tab -type=bed4+6 -as=cpgIslandExt.as cpgIsland.bed \ | |||
../../chrom.sizes ricCom1.cpgIslandExt.bb | |||
Using the [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/hg/lib/cpgIslandExt.as cpgIslandExt.as] table definition | |||
from the source tree | |||
table cpgIslandExt | |||
"Describes the CpG Islands (includes observed/expected ratio)" | |||
( | |||
string chrom; "Reference sequence chromosome or scaffold" | |||
uint chromStart; "Start position in chromosome" | |||
uint chromEnd; "End position in chromosome" | |||
string name; "CpG Island" | |||
uint length; "Island Length" | |||
uint cpgNum; "Number of CpGs in island" | |||
uint gcNum; "Number of C and G in island" | |||
float perCpg; "Percentage of island that is CpG" | |||
float perGc; "Percentage of island that is C or G" | |||
float obsExp; "Ratio of observed(cpgNum) to expected(numC*numG/length) CpG in island" | |||
) | |||
==Track hub trackDb.txt== | |||
track cpgIslandExt_ | |||
shortLabel CpG Islands | |||
longLabel CpG Islands (Islands < 300 Bases are Light Green) | |||
group regulation | |||
priority 90 | |||
visibility pack | |||
color 0,100,0 | |||
altColor 128,228,128 | |||
type bigBed 4 + | |||
bigDataUrl bbi/ricCom1.cpgIslandExt.bb | |||
html ../trackDescriptions/cpgIslandExt | |||
[[Category:Technical FAQ]] | |||
[[Category:User Developed Scripts]] | |||
[[Category:Assembly/Track Hubs]] |
Latest revision as of 21:02, 23 April 2013
UCSC uses a local copy of the CPG Island Ext software. The same as with the Genscan this works with hard masked sequence.
Hard mask
All masked sequence is converted to N's. UCSC runs this as a cluster job, although it is a simple loop if you are willing to wait. The gensub2 template is:
#LOOP ./runOne.csh $(root1) ../hardMaskedFa/$(lastDir1)/$(file1) #ENDLOOP
The runOne.csh script:
#!/bin/csh -ef set chrom = $1 set result = $2 twoBitToFa /data/genomes/ricCom1/ricCom1.2bit:$chrom stdout \ | maskOutFa stdin hard $result
C-shell script syntax here to manually construct a directory hierarchy to contain all the resulting hard masked fasta files and the chr.list file to use with gensub2:
mkdir -p /data/genomes/ricCom1/bed/cpgIslands/run.hardMask cd /data/genomes/ricCom1/bed/cpgIslands/run.hardMask mkdir -p ../hardMaskedFa set perDirLimit = 4000 set ctgCount = `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | wc -l` set subDirCount = `echo $ctgCount | awk '{printf "%d", 1+$1/4000}'` @ dirCount = 0 set dirName = `echo $dirCount | awk '{printf "%03d", $1}'` @ perDirCount = 0 mkdir ../hardMaskedFa/$dirName /bin/rm -f chr.list /bin/touch chr.list foreach chrom ( `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | cut -f1` ) if ($perDirCount < $perDirLimit) then @ perDirCount += 1 else @ dirCount += 1 set dirName = `echo $dirCount | awk '{printf "%03d", $1}'` set perDirCount = 1 mkdir ../hardMaskedFa/$dirName endif echo $dirName/$chrom.fa >> chr.list end
The gensub2 constructs the jobList:
gensub2 chr.list single template jobList
Typical parasol cluster run:
para make jobList para check para time > run.time cat run.time Completed: 25763 of 25763 jobs CPU time in finished jobs: 86s 1.43m 0.02h 0.00d 0.000 y IO & Wait Time: 65591s 1093.18m 18.22h 0.76d 0.002 y Average job time: 3s 0.04m 0.00h 0.00d Longest finished job: 6s 0.10m 0.00h 0.00d Submission to last job: 261s 4.35m 0.07h 0.00d
CPG Island cluster run
The gensub2 template:
#LOOP ./runCpg.csh $(root1) $(lastDir1) {check out exists results/$(lastDir1)/$(root1).cpg} #ENDLOOP
The runCpg.csh script:
#!/bin/csh -ef set chrom = $1 set dir = $2 set result = $3 set resultDir = $result:h mkdir -p $resultDir set seqFile = hardMaskedFa/$dir/$chrom.fa set C = `faCount $seqFile | egrep -v "^#seq|^total" | awk '{print $2 - $7}'` if ( $C > 200 ) then cpg_lh $seqFile > $result else touch $result endif
Typical parasol operation:
cd /data/genomes/ricCom1/bed/cpgIslands mkdir -p results rm -f file.list find ./hardMaskedFa -type f > file.list gensub2 file.list single gsub jobList para make jobList para check para time > run.time cat run.time Completed: 25763 of 25763 jobs CPU time in finished jobs: 73s 1.21m 0.02h 0.00d 0.000 y IO & Wait Time: 66879s 1114.66m 18.58h 0.77d 0.002 y Average job time: 3s 0.04m 0.00h 0.00d Longest finished job: 7s 0.12m 0.00h 0.00d Submission to last job: 353s 5.88m 0.10h 0.00d
Collect results, create bigBed file
cd /data/genomes/ricCom1/bed/cpgIslands catDir -r results \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > cpgIsland.bed bedToBigBed -tab -type=bed4+6 -as=cpgIslandExt.as cpgIsland.bed \ ../../chrom.sizes ricCom1.cpgIslandExt.bb
Using the cpgIslandExt.as table definition from the source tree
table cpgIslandExt "Describes the CpG Islands (includes observed/expected ratio)" ( string chrom; "Reference sequence chromosome or scaffold" uint chromStart; "Start position in chromosome" uint chromEnd; "End position in chromosome" string name; "CpG Island" uint length; "Island Length" uint cpgNum; "Number of CpGs in island" uint gcNum; "Number of C and G in island" float perCpg; "Percentage of island that is CpG" float perGc; "Percentage of island that is C or G" float obsExp; "Ratio of observed(cpgNum) to expected(numC*numG/length) CpG in island" )
Track hub trackDb.txt
track cpgIslandExt_ shortLabel CpG Islands longLabel CpG Islands (Islands < 300 Bases are Light Green) group regulation priority 90 visibility pack color 0,100,0 altColor 128,228,128 type bigBed 4 + bigDataUrl bbi/ricCom1.cpgIslandExt.bb html ../trackDescriptions/cpgIslandExt