RepeatMasker: Difference between revisions
(initial contents) |
(complete partition operation, starting cluster run) |
||
Line 5: | Line 5: | ||
The source tree script [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/simplePartition.pl simplePartition.pl] can | The source tree script [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/simplePartition.pl simplePartition.pl] can | ||
be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of | be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of | ||
sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. | sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using | ||
the ''cross_match'' alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using | |||
the ''rmblast'' alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. | |||
This example constructs partition lists in the specified RMPart directory hierarchy: | |||
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/ | |||
cd /data/genomes/ricCom1/bed/repeatMasker | |||
simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart | |||
==Cluster run== | |||
The ''template'' file used here: | |||
#LOOP | |||
./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out | |||
#ENDLOOP | |||
with the gensub2 command constructs the command list ''jobList'' to run on the cluster. | |||
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster | |||
cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster | |||
ln -s /data/genomes/ricCom1/RMPart ./RMPart | |||
gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList | |||
This example constructs 880 commands to run in the ''jobList'' file, typically: | |||
./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out | |||
./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out | |||
./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out | |||
... etc. | |||
The ''RMRun.csh'' script performs the RepeatMasker operation on each partition: | |||
#!/bin/csh -ef | |||
set finalOut = $1 | |||
set inLst = $finalOut:r | |||
set inLft = $inLst:r.lft | |||
set alignOut = $finalOut:r.align | |||
set catOut = $finalOut:r.cat | |||
# Use local disk for output, and move the final result to $outPsl | |||
# when done, to minimize I/O. | |||
set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX` | |||
pushd $tmpDir | |||
# Initialize local library | |||
RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null | |||
foreach spec (`cat $inLst`) | |||
# Remove path and .2bit filename to get just the seq:start-end spec: | |||
set base = `echo $spec | sed -r -e 's/^[^:]+://'` | |||
# If $spec is the whole sequence, twoBitToFa removes the :start-end part, | |||
# which causes liftUp to barf later. So tweak the header back to | |||
# seq:start-end for liftUp's sake: | |||
twoBitToFa $spec stdout \ | |||
| sed -e "s/^>.*/>$base/" > $base.fa | |||
RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa | |||
if (-e $base.fa.cat) then | |||
mv $base.fa.cat $catOut | |||
endif | |||
end | |||
# Lift up (leave the RepeatMasker header in place because we'll liftUp | |||
# again later): | |||
liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out | |||
set nonomatch | |||
set alignFiles = ( *.align ) | |||
if ( ${#alignFiles} && -e $alignFiles[1] ) then | |||
liftUp -type=.align stdout $inLft error *.align \ | |||
> tmpOut__align | |||
else | |||
touch tmpOut__align | |||
endif | |||
# Move final result into place: | |||
mv tmpOut__out $finalOut | |||
mv tmpOut__align $alignOut | |||
popd | |||
rm -rf $tmpDir | |||
If you have a ''parasol'' cluster, this operation is performed: | |||
para -ram=2g make jobList | |||
para check | |||
para time > run.time | |||
cat run.time | |||
Completed: 880 of 880 jobs | |||
CPU time in finished jobs: 202665s 3377.75m 56.30h 2.35d 0.006 y | |||
IO & Wait Time: 119279s 1987.98m 33.13h 1.38d 0.004 y | |||
Average job time: 366s 6.10m 0.10h 0.00d | |||
Longest finished job: 6789s 113.15m 1.89h 0.08d | |||
Submission to last job: 6938s 115.63m 1.93h 0.08d |
Revision as of 17:20, 23 April 2013
Example procedure to operate RepeatMasker on a typical genome assembly sequence.
Partition
The source tree script simplePartition.pl can be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using the cross_match alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using the rmblast alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. This example constructs partition lists in the specified RMPart directory hierarchy:
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/ cd /data/genomes/ricCom1/bed/repeatMasker simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart
Cluster run
The template file used here:
#LOOP ./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out #ENDLOOP
with the gensub2 command constructs the command list jobList to run on the cluster.
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster ln -s /data/genomes/ricCom1/RMPart ./RMPart gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList
This example constructs 880 commands to run in the jobList file, typically:
./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out ... etc.
The RMRun.csh script performs the RepeatMasker operation on each partition:
#!/bin/csh -ef set finalOut = $1 set inLst = $finalOut:r set inLft = $inLst:r.lft set alignOut = $finalOut:r.align set catOut = $finalOut:r.cat # Use local disk for output, and move the final result to $outPsl # when done, to minimize I/O. set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX` pushd $tmpDir # Initialize local library RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null foreach spec (`cat $inLst`) # Remove path and .2bit filename to get just the seq:start-end spec: set base = `echo $spec | sed -r -e 's/^[^:]+://'` # If $spec is the whole sequence, twoBitToFa removes the :start-end part, # which causes liftUp to barf later. So tweak the header back to # seq:start-end for liftUp's sake: twoBitToFa $spec stdout \ | sed -e "s/^>.*/>$base/" > $base.fa RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa if (-e $base.fa.cat) then mv $base.fa.cat $catOut endif end # Lift up (leave the RepeatMasker header in place because we'll liftUp # again later): liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out set nonomatch set alignFiles = ( *.align ) if ( ${#alignFiles} && -e $alignFiles[1] ) then liftUp -type=.align stdout $inLft error *.align \ > tmpOut__align else touch tmpOut__align endif # Move final result into place: mv tmpOut__out $finalOut mv tmpOut__align $alignOut popd rm -rf $tmpDir
If you have a parasol cluster, this operation is performed:
para -ram=2g make jobList para check para time > run.time cat run.time Completed: 880 of 880 jobs CPU time in finished jobs: 202665s 3377.75m 56.30h 2.35d 0.006 y IO & Wait Time: 119279s 1987.98m 33.13h 1.38d 0.004 y Average job time: 366s 6.10m 0.10h 0.00d Longest finished job: 6789s 113.15m 1.89h 0.08d Submission to last job: 6938s 115.63m 1.93h 0.08d