Genscan

From genomewiki
Jump to: navigation, search

Hard mask

Genscan works with hard masked sequence. All masked sequence is converted to N's. UCSC runs this as a cluster job, although it is a simple loop if you are willing to wait. The gensub2 template is:

#LOOP
./runOne.csh $(root1) ../hardMaskedFa/$(lastDir1)/$(file1)
#ENDLOOP

The runOne.csh script:

#!/bin/csh -ef
set chrom = $1
set result = $2
twoBitToFa /data/genomes/ricCom1/ricCom1.2bit:$chrom stdout \
 | maskOutFa stdin hard $result

C-shell script syntax here to manually construct a directory hierarchy to contain all the resulting hard masked fasta files and the chr.list file to use with gensub2:

mkdir -p /data/genomes/ricCom1/bed/genscan/run.hardMask
cd /data/genomes/ricCom1/bed/genscan/run.hardMask
mkdir -p ../hardMaskedFa
set perDirLimit = 4000
set ctgCount = `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | wc -l`
set subDirCount = `echo $ctgCount | awk '{printf "%d", 1+$1/4000}'`
@ dirCount = 0
set dirName = `echo $dirCount | awk '{printf "%03d", $1}'`
@ perDirCount = 0
mkdir ../hardMaskedFa/$dirName
/bin/rm -f chr.list
/bin/touch chr.list
foreach chrom ( `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | cut -f1` )
  if ($perDirCount < $perDirLimit) then
    @ perDirCount += 1
  else
    @ dirCount += 1
    set dirName = `echo $dirCount | awk '{printf "%03d", $1}'`
    set perDirCount = 1
    mkdir ../hardMaskedFa/$dirName
  endif
  echo $dirName/$chrom.fa >> chr.list
end

The gensub2 constructs the jobList:

gensub2 chr.list single template jobList

Typical parasol cluster run:

para make jobList
para check
para time > run.time
cat run.time
Completed: 25763 of 25763 jobs
CPU time in finished jobs:         90s       1.50m     0.02h    0.00d  0.000 y
IO & Wait Time:                 65875s    1097.92m    18.30h    0.76d  0.002 y
Average job time:                   3s       0.04m     0.00h    0.00d
Longest finished job:               6s       0.10m     0.00h    0.00d
Submission to last job:           259s       4.32m     0.07h    0.00d

Genscan cluster run

The gensub2 template:

#LOOP
./runGsBig.csh $(root1) $(lastDir1) {check out exists gtf/$(lastDir1)/$(root1).gtf} {check out exists pep/$(lastDir1)/$(root1).pep} {check out exists subopt/$(lastDir1)/$(root1).bed}
#ENDLOOP

The runGsBig.csh script:

#!/bin/csh -ef
set chrom = $1
set dir = $2
set resultGtf = $3
set resultPep = $4
set resultSubopt = $5
mkdir -p gtf/$dir pep/$dir subopt/$dir
set seqFile = hardMaskedFa/$dir/$chrom.fa
gsBig $seqFile $resultGtf -trans=$resultPep -subopt=$resultSubopt -exe=/scratch/data/genscan/genscan -par=/scratch/data/genscan/HumanIso.smat -tmp=/tmp -window=2400000

The typical parasol cluster run:

cd /data/genomes/ricCom1/bed/genscan
rm -f file.list
find ./hardMaskedFa -type f > file.list
gensub2 file.list single template jobList
para make jobList
para check
para time > run.time
cat run.time
Completed: 25763 of 25763 jobs
CPU time in finished jobs:       5327s      88.78m     1.48h    0.06d  0.000 y
IO & Wait Time:                 70708s    1178.47m    19.64h    0.82d  0.002 y
Average job time:                   3s       0.05m     0.00h    0.00d
Longest finished job:             145s       2.42m     0.04h    0.00d
Submission to last job:           478s       7.97m     0.13h    0.01d

Collect results together into single files:

cd /data/genomes/ricCom1/bed/genscan
catDir -r gtf > genscan.gtf
catDir -r pep > genscan.pep
catDir -r subopt > genscanSubopt.bed

Big Bed track hub files

cd /data/genomes/ricCom1/bed/genscan
gtfToGenePred genscan.gtf genscan.gp
genePredCheck genscan.gp
genePredToBed genscan.gp genscan.bed
bedToBigBed genscan.bed ../../chrom.sizes genscan.bb
bedToBigBed genscanSubopt.bed ../../chrom.sizes genscanSubopt.bb

trackDb.txt entry

Your track hub trackDb.txt stanzas:

track genscan_
shortLabel Genscan Genes
longLabel Genscan Gene Predictions
group genes
priority 50
visibility pack
color 170,100,0
type bigBed 12 .
bigDataUrl bbi/ricCom1.genscan.bb
html ../trackDescriptions/genscan
track genscanSubopt_
shortLabel Genscan Subopt
longLabel Genscan Suboptimal Exon Predictions
group genes
priority 51
visibility hide
color 180,90,0
type bigBed 6 .
bigDataUrl bbi/ricCom1.genscanSubopt.bb