Genscan

From genomewiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Hard mask

Genscan works with hard masked sequence. All masked sequence is converted to N's. UCSC runs this as a cluster job, although it is a simple loop if you are willing to wait. The gensub2 template is:

#LOOP
./runOne.csh $(root1) ../hardMaskedFa/$(lastDir1)/$(file1)
#ENDLOOP

The runOne.csh script:

#!/bin/csh -ef
set chrom = $1
set result = $2
twoBitToFa /data/genomes/ricCom1/ricCom1.2bit:$chrom stdout \
 | maskOutFa stdin hard $result

C-shell script syntax here to manually construct a directory hierarchy to contain all the resulting hard masked fasta files and the chr.list file to use with gensub2:

mkdir -p /data/genomes/ricCom1/bed/genscan/run.hardMask
cd /data/genomes/ricCom1/bed/genscan/run.hardMask
mkdir -p ../hardMaskedFa
set perDirLimit = 4000
set ctgCount = `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | wc -l`
set subDirCount = `echo $ctgCount | awk '{printf "%d", 1+$1/4000}'`
@ dirCount = 0
set dirName = `echo $dirCount | awk '{printf "%03d", $1}'`
@ perDirCount = 0
mkdir ../hardMaskedFa/$dirName
/bin/rm -f chr.list
/bin/touch chr.list
foreach chrom ( `twoBitInfo /data/genomes/ricCom1/ricCom1.2bit stdout | cut -f1` )
  if ($perDirCount < $perDirLimit) then
    @ perDirCount += 1
  else
    @ dirCount += 1
    set dirName = `echo $dirCount | awk '{printf "%03d", $1}'`
    set perDirCount = 1
    mkdir ../hardMaskedFa/$dirName
  endif
  echo $dirName/$chrom.fa >> chr.list
end

The gensub2 constructs the jobList:

gensub2 chr.list single template jobList

Typical parasol cluster run:

para make jobList
para check
para time > run.time
cat run.time
Completed: 25763 of 25763 jobs
CPU time in finished jobs:         90s       1.50m     0.02h    0.00d  0.000 y
IO & Wait Time:                 65875s    1097.92m    18.30h    0.76d  0.002 y
Average job time:                   3s       0.04m     0.00h    0.00d
Longest finished job:               6s       0.10m     0.00h    0.00d
Submission to last job:           259s       4.32m     0.07h    0.00d

Genscan cluster run

The gensub2 template:

#LOOP
./runGsBig.csh $(root1) $(lastDir1) {check out exists gtf/$(lastDir1)/$(root1).gtf} {check out exists pep/$(lastDir1)/$(root1).pep} {check out exists subopt/$(lastDir1)/$(root1).bed}
#ENDLOOP

The runGsBig.csh script:

#!/bin/csh -ef
set chrom = $1
set dir = $2
set resultGtf = $3
set resultPep = $4
set resultSubopt = $5
mkdir -p gtf/$dir pep/$dir subopt/$dir
set seqFile = hardMaskedFa/$dir/$chrom.fa
gsBig $seqFile $resultGtf -trans=$resultPep -subopt=$resultSubopt -exe=/scratch/data/genscan/genscan -par=/scratch/data/genscan/HumanIso.smat -tmp=/tmp -window=2400000

The typical parasol cluster run:

cd /data/genomes/ricCom1/bed/genscan
rm -f file.list
find ./hardMaskedFa -type f > file.list
gensub2 file.list single template jobList
para make jobList
para check
para time > run.time
cat run.time
Completed: 25763 of 25763 jobs
CPU time in finished jobs:       5327s      88.78m     1.48h    0.06d  0.000 y
IO & Wait Time:                 70708s    1178.47m    19.64h    0.82d  0.002 y
Average job time:                   3s       0.05m     0.00h    0.00d
Longest finished job:             145s       2.42m     0.04h    0.00d
Submission to last job:           478s       7.97m     0.13h    0.01d

Collect results together into single files:

cd /data/genomes/ricCom1/bed/genscan
catDir -r gtf > genscan.gtf
catDir -r pep > genscan.pep
catDir -r subopt > genscanSubopt.bed

Big Bed track hub files

cd /data/genomes/ricCom1/bed/genscan
gtfToGenePred genscan.gtf genscan.gp
genePredCheck genscan.gp
genePredToBed genscan.gp genscan.bed
bedToBigBed genscan.bed ../../chrom.sizes genscan.bb
bedToBigBed genscanSubopt.bed ../../chrom.sizes genscanSubopt.bb

trackDb.txt entry

Your track hub trackDb.txt stanzas:

track genscan_
shortLabel Genscan Genes
longLabel Genscan Gene Predictions
group genes
priority 50
visibility pack
color 170,100,0
type bigBed 12 .
bigDataUrl bbi/ricCom1.genscan.bb
html ../trackDescriptions/genscan
track genscanSubopt_
shortLabel Genscan Subopt
longLabel Genscan Suboptimal Exon Predictions
group genes
priority 51
visibility hide
color 180,90,0
type bigBed 6 .
bigDataUrl bbi/ricCom1.genscanSubopt.bb