RepeatMasker

From genomewiki
Revision as of 17:34, 23 April 2013 by Hiram (talk | contribs) (adding masking sequence)
Jump to navigationJump to search

Example procedure to operate RepeatMasker on a typical genome assembly sequence.

Partition

The source tree script simplePartition.pl can be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using the cross_match alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using the rmblast alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. This example constructs partition lists in the specified RMPart directory hierarchy:

mkdir -p /data/genomes/ricCom1/bed/repeatMasker/
cd /data/genomes/ricCom1/bed/repeatMasker
simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart

Cluster run

The template file used here:

#LOOP
./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out
#ENDLOOP

with the gensub2 command constructs the command list jobList to run on the cluster.

mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster
cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster
ln -s /data/genomes/ricCom1/RMPart ./RMPart
gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList

This example constructs 880 commands to run in the jobList file, typically:

./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out
... etc.

The RMRun.csh script performs the RepeatMasker operation on each partition:

#!/bin/csh -ef

set finalOut = $1

set inLst = $finalOut:r
set inLft = $inLst:r.lft
set alignOut = $finalOut:r.align
set catOut = $finalOut:r.cat

# Use local disk for output, and move the final result to $outPsl
# when done, to minimize I/O.
set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX`
pushd $tmpDir

# Initialize local library
RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null

foreach spec (`cat $inLst`)
  # Remove path and .2bit filename to get just the seq:start-end spec:
  set base = `echo $spec | sed -r -e 's/^[^:]+://'`
  # If $spec is the whole sequence, twoBitToFa removes the :start-end part,
  # which causes liftUp to barf later.  So tweak the header back to
  # seq:start-end for liftUp's sake:
  twoBitToFa $spec stdout \
  | sed -e "s/^>.*/>$base/" > $base.fa
  RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa
  if (-e $base.fa.cat) then
    mv $base.fa.cat $catOut
  endif
end

# Lift up (leave the RepeatMasker header in place because we'll liftUp
# again later):
liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out

set nonomatch
set alignFiles = ( *.align )
if ( ${#alignFiles} && -e $alignFiles[1] ) then
  liftUp -type=.align stdout $inLft error *.align \
    > tmpOut__align
else
  touch tmpOut__align
endif

# Move final result into place:
mv tmpOut__out $finalOut
mv tmpOut__align $alignOut

popd
rm -rf $tmpDir

If you have a parasol cluster, this operation is performed.

para -ram=2g make jobList
para check
para time > run.time
cat run.time
Completed: 880 of 880 jobs
CPU time in finished jobs:     202665s    3377.75m    56.30h    2.35d  0.006 y
IO & Wait Time:                119279s    1987.98m    33.13h    1.38d  0.004 y
Average job time:                 366s       6.10m     0.10h    0.00d
Longest finished job:            6789s     113.15m     1.89h    0.08d
Submission to last job:          6938s     115.63m     1.93h    0.08d

Collect cluster run results

Gather the partition results into a single file:

cd /data/genomes/ricCom1/bed/repeatMasker
liftUp ricCom1.fa.out /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.out
liftUp ricCom1.fa.align /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.align
head -3 ricCom1.fa.out > ricCom1.sorted.fa.out
tail -n +4 ricCom1.fa.out | sort -k5,5 -k6,6n >> ricCom1.sorted.fa.out

In addition, using the UCSC script: extractNestedRepeats.pl for the nestedRepeats track:

# Use the ID column to join up fragments of interrupted repeats for the
# Nested Repeats track.
extractNestedRepeats.pl ricCom1.fa.out | sort -k1,1 -k2,2n > ricCom1.nestedRepeats.bed

Masking sequence

The unmasked.2bit file is masked with the RepeatMasker results:

cd /data/genomes/ricCom1/bed/repeatMasker
twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit ricCom1.sorted.fa.out ricCom1.rmsk.2bit

Measure the masking:

twoBitToFa ricCom1.rmsk.2bit stdout | faSize stdin > faSize.rmsk.txt

Results in the faSize.rmsk.txt file:

350621860 bases (13662715 N's 336959145 real 326566002 upper 10393143 lower) in 25763 sequences in 1 files
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
%2.96 masked total, %3.08 masked real

This example indicates a very low masking result from RepeatMasker. The WindowMasker result masks the sequence much better and will be used for the eventual final masking.