RepeatMasker
Example procedure to operate RepeatMasker on a typical genome assembly sequence.
Partition
The source tree script simplePartition.pl can be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using the cross_match alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using the rmblast alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. This example constructs partition lists in the specified RMPart directory hierarchy:
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/ cd /data/genomes/ricCom1/bed/repeatMasker simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart
Cluster run
The template file used here:
#LOOP ./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out #ENDLOOP
with the gensub2 command constructs the command list jobList to run on the cluster.
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster ln -s /data/genomes/ricCom1/RMPart ./RMPart gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList
This example constructs 880 commands to run in the jobList file, typically:
./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out ... etc.
The RMRun.csh script performs the RepeatMasker operation on each partition:
#!/bin/csh -ef set finalOut = $1 set inLst = $finalOut:r set inLft = $inLst:r.lft set alignOut = $finalOut:r.align set catOut = $finalOut:r.cat # Use local disk for output, and move the final result to $outPsl # when done, to minimize I/O. set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX` pushd $tmpDir # Initialize local library RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null foreach spec (`cat $inLst`) # Remove path and .2bit filename to get just the seq:start-end spec: set base = `echo $spec | sed -r -e 's/^[^:]+://'` # If $spec is the whole sequence, twoBitToFa removes the :start-end part, # which causes liftUp to barf later. So tweak the header back to # seq:start-end for liftUp's sake: twoBitToFa $spec stdout \ | sed -e "s/^>.*/>$base/" > $base.fa RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa if (-e $base.fa.cat) then mv $base.fa.cat $catOut endif end # Lift up (leave the RepeatMasker header in place because we'll liftUp # again later): liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out set nonomatch set alignFiles = ( *.align ) if ( ${#alignFiles} && -e $alignFiles[1] ) then liftUp -type=.align stdout $inLft error *.align \ > tmpOut__align else touch tmpOut__align endif # Move final result into place: mv tmpOut__out $finalOut mv tmpOut__align $alignOut popd rm -rf $tmpDir
If you have a parasol cluster, this operation is performed.
para -ram=2g make jobList para check para time > run.time cat run.time Completed: 880 of 880 jobs CPU time in finished jobs: 202665s 3377.75m 56.30h 2.35d 0.006 y IO & Wait Time: 119279s 1987.98m 33.13h 1.38d 0.004 y Average job time: 366s 6.10m 0.10h 0.00d Longest finished job: 6789s 113.15m 1.89h 0.08d Submission to last job: 6938s 115.63m 1.93h 0.08d
Collect cluster run results
Gather the partition results into a single file:
cd /data/genomes/ricCom1/bed/repeatMasker liftUp ricCom1.fa.out /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.out liftUp ricCom1.fa.align /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.align head -3 ricCom1.fa.out > ricCom1.sorted.fa.out tail -n +4 ricCom1.fa.out | sort -k5,5 -k6,6n >> ricCom1.sorted.fa.out
In addition, using the UCSC script: extractNestedRepeats.pl for the nestedRepeats track:
# Use the ID column to join up fragments of interrupted repeats for the # Nested Repeats track. extractNestedRepeats.pl ricCom1.fa.out | sort -k1,1 -k2,2n > ricCom1.nestedRepeats.bed
Masking sequence
The unmasked.2bit file is masked with the RepeatMasker results:
cd /data/genomes/ricCom1/bed/repeatMasker twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit ricCom1.sorted.fa.out ricCom1.rmsk.2bit
Measure the masking:
twoBitToFa ricCom1.rmsk.2bit stdout | faSize stdin > faSize.rmsk.txt
Results in the faSize.rmsk.txt file:
350621860 bases (13662715 N's 336959145 real 326566002 upper 10393143 lower) in 25763 sequences in 1 files Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 %2.96 masked total, %3.08 masked real
This example indicates a very low masking result from RepeatMasker. The Window Masker result masks the sequence much better and will be used for the eventual final masking.
Convert RM .out to bigBed
The standard RepeatMasker track in the genome browser is a special unique track type that displays several different aspects of the RepeatMasker data in side by side displays. Such a track type is not available in a track hub, but an approximation of the data display can be displayed with several different bigBed tracks in the track hub.
First, convert the RepeatMasker .out file format to the UCSC track table format:
hgLoadOut -tabFile=ricCom1.rmsk.tab -nosplit test ricCom1.sorted.fa.out
Split the resulting table rows by their RepeatMasker class attribute:
mkdir rmskClass sort -k12,12 ricCom1.rmsk.tab | splitFileByColumn -ending=tab -col=12 -tab stdin rmskClass
This produces a number of files in the directory ./rmskClass/ - for example
ls -ogrt rmskClass total 12096 -rw-rw-r-- 1 694267 Apr 18 10:31 DNA.tab -rw-rw-r-- 1 8842 Apr 18 10:31 DNA?.tab -rw-rw-r-- 1 129866 Apr 18 10:31 LINE.tab -rw-rw-r-- 1 6208 Apr 18 10:31 LINE?.tab -rw-rw-r-- 1 690106 Apr 18 10:31 LTR.tab -rw-rw-r-- 1 857390 Apr 18 10:31 Low_complexity.tab -rw-rw-r-- 1 14911 Apr 18 10:31 Other.tab ... etc.
Convert those files in the same groupings as the standard UCSC RepeatMasker class displays to bigBed file types:
for T in SINE LINE LTR DNA Simple Low_complexity Satellite do ./toBed6+10.pl rmskClass/${T}*.tab | sort -k1,1 -k2,2n > ${DB}.rmsk.${T}.bed done ./toBed6+10.pl rmskClass/rRNA.tab | sort -k1,1 -k2,2n > ${DB}.rmsk.RNA.bed ./toBed6+10.pl rmskClass/Other.tab rmskClass/RC*.tab | sort -k1,1 -k2,2n > ricCom1.rmsk.Other.bed for T in DNA LINE LTR Low_complexity Other RNA SINE Satellite Simple do bedToBigBed -tab -type=bed6+10 -as=rmsk16.as ricCom1.rmsk.${T}.bed \ ../../chrom.sizes ricCom1.rmsk.${T}.bb done
Where the script toBed6+10.pl is:
#!/usr/bin/env perl use strict; use warnings; my $sixteenSamples = 0; my $fifteenSamples = 0; my $argc = scalar(@ARGV); if ($argc < 1) { printf STDERR "usage: toBed6+10.pl rmskClass/file.tab > ricCom1.class.tab\n"; printf STDERR "e.g.: ./toBed6+10.pl rmskClass/rRNA.tab > ricCom1.rmsk.RNA.tab\n"; exit 255; } while (my $file = shift) { open (FH, "<$file") or die "can not read $file"; while (my $line = <FH>) { next if ($line =~ m/^\s+SW\s+perc.*/); next if ($line =~ m/^score\s+div.*/); next if ($line =~ m/^$/); chomp $line; my @a = split('\t', $line); shift @a; die "ERROR: not finding 16 fields in a line" if (scalar(@a) != 16); printf "%s\t%d\t%d\t%s\t0\t%s\t%d\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", $a[4], $a[5], $a[6], $a[9], $a[8], $a[0], $a[1], $a[2], $a[3], $a[7], $a[10], $a[11], $a[12], $a[13], $a[14]; } close (FH); }
And the rmsk16.as bigBed structure definition is:
table repeatMasker "RepeatMasker .out record" ( string chrom; "Genomic sequence name" uint chromStart; "Start in genomic sequence" uint chromEnd; "End in genomic sequence" string name; "Name of repeat" uint score; "always 0 place holder" char[1] strand; "Relative orientation + or -" uint swScore; "Smith Waterman alignment score" uint milliDiv; "Base mismatches in parts per thousand" uint milliDel; "Bases deleted in parts per thousand" uint milliIns; "Bases inserted in parts per thousand" int genoLeft; "-#bases after match in genomic sequence" string repClass; "Class of repeat" string repFamily; "Family of repeat" int repStart; "Start (if strand is +) or -#bases after match (if strand is -) in repeat sequence" uint repEnd; "End in repeat sequence" int repLeft; "-#bases after match (if strand is +) or start (if strand is -) in repeat sequence" )
Composite RMSK trackDb.txt
Please note the composite track arrangement to display the resulting class files in a fashion resembling the standard UCSC RepeatMasker track display:
http://genome-test.soe.ucsc.edu/~hiram/hubs/Plants/araTha1/trackDb.txt
In the section starting:
track repeatMasker_ compositeTrack on shortLabel RepeatMasker longLabel Repeating Elements by RepeatMasker group varRep priority 149.1 visibility dense type bed 3 . noInherit on html ../trackDescriptions/repeatMasker
See also
Please note specific details for running each of the masking operations: