RepeatMasker

From genomewiki
Revision as of 19:40, 23 April 2013 by Hiram (talk | contribs) (add links to other instructions)
Jump to navigationJump to search

Example procedure to operate RepeatMasker on a typical genome assembly sequence.

Partition

The source tree script simplePartition.pl can be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using the cross_match alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using the rmblast alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. This example constructs partition lists in the specified RMPart directory hierarchy:

mkdir -p /data/genomes/ricCom1/bed/repeatMasker/
cd /data/genomes/ricCom1/bed/repeatMasker
simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart

Cluster run

The template file used here:

#LOOP
./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out
#ENDLOOP

with the gensub2 command constructs the command list jobList to run on the cluster.

mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster
cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster
ln -s /data/genomes/ricCom1/RMPart ./RMPart
gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList

This example constructs 880 commands to run in the jobList file, typically:

./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out
... etc.

The RMRun.csh script performs the RepeatMasker operation on each partition:

#!/bin/csh -ef

set finalOut = $1

set inLst = $finalOut:r
set inLft = $inLst:r.lft
set alignOut = $finalOut:r.align
set catOut = $finalOut:r.cat

# Use local disk for output, and move the final result to $outPsl
# when done, to minimize I/O.
set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX`
pushd $tmpDir

# Initialize local library
RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null

foreach spec (`cat $inLst`)
  # Remove path and .2bit filename to get just the seq:start-end spec:
  set base = `echo $spec | sed -r -e 's/^[^:]+://'`
  # If $spec is the whole sequence, twoBitToFa removes the :start-end part,
  # which causes liftUp to barf later.  So tweak the header back to
  # seq:start-end for liftUp's sake:
  twoBitToFa $spec stdout \
  | sed -e "s/^>.*/>$base/" > $base.fa
  RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa
  if (-e $base.fa.cat) then
    mv $base.fa.cat $catOut
  endif
end

# Lift up (leave the RepeatMasker header in place because we'll liftUp
# again later):
liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out

set nonomatch
set alignFiles = ( *.align )
if ( ${#alignFiles} && -e $alignFiles[1] ) then
  liftUp -type=.align stdout $inLft error *.align \
    > tmpOut__align
else
  touch tmpOut__align
endif

# Move final result into place:
mv tmpOut__out $finalOut
mv tmpOut__align $alignOut

popd
rm -rf $tmpDir

If you have a parasol cluster, this operation is performed.

para -ram=2g make jobList
para check
para time > run.time
cat run.time
Completed: 880 of 880 jobs
CPU time in finished jobs:     202665s    3377.75m    56.30h    2.35d  0.006 y
IO & Wait Time:                119279s    1987.98m    33.13h    1.38d  0.004 y
Average job time:                 366s       6.10m     0.10h    0.00d
Longest finished job:            6789s     113.15m     1.89h    0.08d
Submission to last job:          6938s     115.63m     1.93h    0.08d

Collect cluster run results

Gather the partition results into a single file:

cd /data/genomes/ricCom1/bed/repeatMasker
liftUp ricCom1.fa.out /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.out
liftUp ricCom1.fa.align /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.align
head -3 ricCom1.fa.out > ricCom1.sorted.fa.out
tail -n +4 ricCom1.fa.out | sort -k5,5 -k6,6n >> ricCom1.sorted.fa.out

In addition, using the UCSC script: extractNestedRepeats.pl for the nestedRepeats track:

# Use the ID column to join up fragments of interrupted repeats for the
# Nested Repeats track.
extractNestedRepeats.pl ricCom1.fa.out | sort -k1,1 -k2,2n > ricCom1.nestedRepeats.bed

Masking sequence

The unmasked.2bit file is masked with the RepeatMasker results:

cd /data/genomes/ricCom1/bed/repeatMasker
twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit ricCom1.sorted.fa.out ricCom1.rmsk.2bit

Measure the masking:

twoBitToFa ricCom1.rmsk.2bit stdout | faSize stdin > faSize.rmsk.txt

Results in the faSize.rmsk.txt file:

350621860 bases (13662715 N's 336959145 real 326566002 upper 10393143 lower) in 25763 sequences in 1 files
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
%2.96 masked total, %3.08 masked real

This example indicates a very low masking result from RepeatMasker. The Window Masker result masks the sequence much better and will be used for the eventual final masking.

See also

Please note specific details for running each of the masking operations: