RepeatMasker: Difference between revisions

From genomewiki
Jump to navigationJump to search
(initial contents)
 
(complete partition operation, starting cluster run)
Line 5: Line 5:
The source tree script [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/simplePartition.pl simplePartition.pl] can
The source tree script [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/simplePartition.pl simplePartition.pl] can
be used to partition the genome.  The output of the script is merely a listing of twoBit file references representing the partitions.  A specific bit of
be used to partition the genome.  The output of the script is merely a listing of twoBit file references representing the partitions.  A specific bit of
sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation.
sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using
the ''cross_match'' alignment engine with RepeatMasker.  That size of partition often results in a single job run-time of about 1.5 hours.  When using
the ''rmblast'' alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes.  The resulting masking is very similar.
This example constructs partition lists in the specified RMPart directory hierarchy:
 
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/
cd /data/genomes/ricCom1/bed/repeatMasker
simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart
 
==Cluster run==
The ''template'' file used here:
 
#LOOP
./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out
#ENDLOOP
 
with the gensub2 command constructs the command list ''jobList'' to run on the cluster.
 
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster
cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster
ln -s /data/genomes/ricCom1/RMPart ./RMPart
gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList
 
This example constructs 880 commands to run in the ''jobList'' file, typically:
 
./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out
... etc.
 
The ''RMRun.csh'' script performs the RepeatMasker operation on each partition:
 
#!/bin/csh -ef
set finalOut = $1
set inLst = $finalOut:r
set inLft = $inLst:r.lft
set alignOut = $finalOut:r.align
set catOut = $finalOut:r.cat
# Use local disk for output, and move the final result to $outPsl
# when done, to minimize I/O.
set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX`
pushd $tmpDir
# Initialize local library
RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null
foreach spec (`cat $inLst`)
  # Remove path and .2bit filename to get just the seq:start-end spec:
  set base = `echo $spec | sed -r -e 's/^[^:]+://'`
  # If $spec is the whole sequence, twoBitToFa removes the :start-end part,
  # which causes liftUp to barf later.  So tweak the header back to
  # seq:start-end for liftUp's sake:
  twoBitToFa $spec stdout \
  | sed -e "s/^>.*/>$base/" > $base.fa
  RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa
  if (-e $base.fa.cat) then
    mv $base.fa.cat $catOut
  endif
end
# Lift up (leave the RepeatMasker header in place because we'll liftUp
# again later):
liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out
set nonomatch
set alignFiles = ( *.align )
if ( ${#alignFiles} && -e $alignFiles[1] ) then
  liftUp -type=.align stdout $inLft error *.align \
    > tmpOut__align
else
  touch tmpOut__align
endif
# Move final result into place:
mv tmpOut__out $finalOut
mv tmpOut__align $alignOut
popd
rm -rf $tmpDir
 
If you have a ''parasol'' cluster, this operation is performed:
 
para -ram=2g make jobList
para check
para time > run.time
cat run.time
Completed: 880 of 880 jobs
CPU time in finished jobs:    202665s    3377.75m    56.30h    2.35d  0.006 y
IO & Wait Time:                119279s    1987.98m    33.13h    1.38d  0.004 y
Average job time:                366s      6.10m    0.10h    0.00d
Longest finished job:            6789s    113.15m    1.89h    0.08d
Submission to last job:          6938s    115.63m    1.93h    0.08d

Revision as of 17:20, 23 April 2013

Example procedure to operate RepeatMasker on a typical genome assembly sequence.

Partition

The source tree script simplePartition.pl can be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using the cross_match alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using the rmblast alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. This example constructs partition lists in the specified RMPart directory hierarchy:

mkdir -p /data/genomes/ricCom1/bed/repeatMasker/
cd /data/genomes/ricCom1/bed/repeatMasker
simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart

Cluster run

The template file used here:

#LOOP
./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out
#ENDLOOP

with the gensub2 command constructs the command list jobList to run on the cluster.

mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster
cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster
ln -s /data/genomes/ricCom1/RMPart ./RMPart
gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList

This example constructs 880 commands to run in the jobList file, typically:

./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out
... etc.

The RMRun.csh script performs the RepeatMasker operation on each partition:

#!/bin/csh -ef

set finalOut = $1

set inLst = $finalOut:r
set inLft = $inLst:r.lft
set alignOut = $finalOut:r.align
set catOut = $finalOut:r.cat

# Use local disk for output, and move the final result to $outPsl
# when done, to minimize I/O.
set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX`
pushd $tmpDir

# Initialize local library
RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null

foreach spec (`cat $inLst`)
  # Remove path and .2bit filename to get just the seq:start-end spec:
  set base = `echo $spec | sed -r -e 's/^[^:]+://'`
  # If $spec is the whole sequence, twoBitToFa removes the :start-end part,
  # which causes liftUp to barf later.  So tweak the header back to
  # seq:start-end for liftUp's sake:
  twoBitToFa $spec stdout \
  | sed -e "s/^>.*/>$base/" > $base.fa
  RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa
  if (-e $base.fa.cat) then
    mv $base.fa.cat $catOut
  endif
end

# Lift up (leave the RepeatMasker header in place because we'll liftUp
# again later):
liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out

set nonomatch
set alignFiles = ( *.align )
if ( ${#alignFiles} && -e $alignFiles[1] ) then
  liftUp -type=.align stdout $inLft error *.align \
    > tmpOut__align
else
  touch tmpOut__align
endif

# Move final result into place:
mv tmpOut__out $finalOut
mv tmpOut__align $alignOut

popd
rm -rf $tmpDir

If you have a parasol cluster, this operation is performed:

para -ram=2g make jobList
para check
para time > run.time
cat run.time
Completed: 880 of 880 jobs
CPU time in finished jobs:     202665s    3377.75m    56.30h    2.35d  0.006 y
IO & Wait Time:                119279s    1987.98m    33.13h    1.38d  0.004 y
Average job time:                 366s       6.10m     0.10h    0.00d
Longest finished job:            6789s     113.15m     1.89h    0.08d
Submission to last job:          6938s     115.63m     1.93h    0.08d