CPG Islands
From genomewiki
Jump to navigationJump to search
UCSC uses a local copy of the CPG Island Ext software. The same as with the Genscan this works with hard masked sequence.
Hard mask
All masked sequence is converted to N's. UCSC runs this as a cluster job, although it is a simple loop if you are willing to wait. The gensub2 template is:
#LOOP ./runOne.csh $(root1) ../hardMaskedFa/$(lastDir1)/$(file1) #ENDLOOP
The runOne.csh script:
#!/bin/csh -ef set chrom = $1 set result = $2 twoBitToFa /data/genomes/ricCom1/ricCom1.2bit:$chrom stdout \ | maskOutFa stdin hard $result
C-shell script syntax here to manually construct a directory hierarchy to contain all the resulting hard masked fasta files and the chr.list file to use with gensub2:
mkdir -p /data/genomes/ricCom1/bed/cpgIslands/run.hardMask cd /data/genomes/ricCom1/bed/cpgIslands/run.hardMask mkdir -p ../hardMaskedFa set perDirLimit = 4000 set ctgCount = `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | wc -l` set subDirCount = `echo $ctgCount | awk '{printf "%d", 1+$1/4000}'` @ dirCount = 0 set dirName = `echo $dirCount | awk '{printf "%03d", $1}'` @ perDirCount = 0 mkdir ../hardMaskedFa/$dirName /bin/rm -f chr.list /bin/touch chr.list foreach chrom ( `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | cut -f1` ) if ($perDirCount < $perDirLimit) then @ perDirCount += 1 else @ dirCount += 1 set dirName = `echo $dirCount | awk '{printf "%03d", $1}'` set perDirCount = 1 mkdir ../hardMaskedFa/$dirName endif echo $dirName/$chrom.fa >> chr.list end
The gensub2 constructs the jobList:
gensub2 chr.list single template jobList
Typical parasol cluster run:
para make jobList para check para time > run.time cat run.time Completed: 25763 of 25763 jobs CPU time in finished jobs: 86s 1.43m 0.02h 0.00d 0.000 y IO & Wait Time: 65591s 1093.18m 18.22h 0.76d 0.002 y Average job time: 3s 0.04m 0.00h 0.00d Longest finished job: 6s 0.10m 0.00h 0.00d Submission to last job: 261s 4.35m 0.07h 0.00d