RepeatMasker
Example procedure to operate RepeatMasker on a typical genome assembly sequence.
Partition
The source tree script simplePartition.pl can be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using the cross_match alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using the rmblast alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. This example constructs partition lists in the specified RMPart directory hierarchy:
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/ cd /data/genomes/ricCom1/bed/repeatMasker simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart
Cluster run
The template file used here:
#LOOP ./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out #ENDLOOP
with the gensub2 command constructs the command list jobList to run on the cluster.
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster ln -s /data/genomes/ricCom1/RMPart ./RMPart gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList
This example constructs 880 commands to run in the jobList file, typically:
./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out ... etc.
The RMRun.csh script performs the RepeatMasker operation on each partition:
#!/bin/csh -ef set finalOut = $1 set inLst = $finalOut:r set inLft = $inLst:r.lft set alignOut = $finalOut:r.align set catOut = $finalOut:r.cat # Use local disk for output, and move the final result to $outPsl # when done, to minimize I/O. set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX` pushd $tmpDir # Initialize local library RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null foreach spec (`cat $inLst`) # Remove path and .2bit filename to get just the seq:start-end spec: set base = `echo $spec | sed -r -e 's/^[^:]+://'` # If $spec is the whole sequence, twoBitToFa removes the :start-end part, # which causes liftUp to barf later. So tweak the header back to # seq:start-end for liftUp's sake: twoBitToFa $spec stdout \ | sed -e "s/^>.*/>$base/" > $base.fa RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa if (-e $base.fa.cat) then mv $base.fa.cat $catOut endif end # Lift up (leave the RepeatMasker header in place because we'll liftUp # again later): liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out set nonomatch set alignFiles = ( *.align ) if ( ${#alignFiles} && -e $alignFiles[1] ) then liftUp -type=.align stdout $inLft error *.align \ > tmpOut__align else touch tmpOut__align endif # Move final result into place: mv tmpOut__out $finalOut mv tmpOut__align $alignOut popd rm -rf $tmpDir
If you have a parasol cluster, this operation is performed.
para -ram=2g make jobList para check para time > run.time cat run.time Completed: 880 of 880 jobs CPU time in finished jobs: 202665s 3377.75m 56.30h 2.35d 0.006 y IO & Wait Time: 119279s 1987.98m 33.13h 1.38d 0.004 y Average job time: 366s 6.10m 0.10h 0.00d Longest finished job: 6789s 113.15m 1.89h 0.08d Submission to last job: 6938s 115.63m 1.93h 0.08d
Collect cluster run results
Gather the partition results into a single file:
cd /data/genomes/ricCom1/bed/repeatMasker liftUp ricCom1.fa.out /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.out liftUp ricCom1.fa.align /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.align head -3 ricCom1.fa.out > ricCom1.sorted.fa.out tail -n +4 ricCom1.fa.out | sort -k5,5 -k6,6n >> ricCom1.sorted.fa.out
In addition, using the UCSC script: extractNestedRepeats.pl for the nestedRepeats track:
# Use the ID column to join up fragments of interrupted repeats for the # Nested Repeats track. extractNestedRepeats.pl ricCom1.fa.out | sort -k1,1 -k2,2n > ricCom1.nestedRepeats.bed
Masking sequence
The unmasked.2bit file is masked with the RepeatMasker results:
cd /data/genomes/ricCom1/bed/repeatMasker twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit ricCom1.sorted.fa.out ricCom1.rmsk.2bit
Measure the masking:
twoBitToFa ricCom1.rmsk.2bit stdout | faSize stdin > faSize.rmsk.txt
Results in the faSize.rmsk.txt file:
350621860 bases (13662715 N's 336959145 real 326566002 upper 10393143 lower) in 25763 sequences in 1 files Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 %2.96 masked total, %3.08 masked real
This example indicates a very low masking result from RepeatMasker. The Window Masker result masks the sequence much better and will be used for the eventual final masking.
See also
Please note specific details for running each of the masking operations: