RepeatMasker: Difference between revisions

From genomewiki
Jump to navigationJump to search
(add category tags)
(changed cse to soe)
 
(3 intermediate revisions by one other user not shown)
Line 3: Line 3:
==Partition==
==Partition==


The source tree script [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/simplePartition.pl simplePartition.pl] can
The source tree script [http://genome-source.soe.ucsc.edu/gitlist/kent.git/blob/master/src/hg/utils/automation/simplePartition.pl simplePartition.pl] can
be used to partition the genome.  The output of the script is merely a listing of twoBit file references representing the partitions.  A specific bit of
be used to partition the genome.  The output of the script is merely a listing of twoBit file references representing the partitions.  A specific bit of
sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation.  UCSC uses partitions of 500,000 bases when using
sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation.  UCSC uses partitions of 500,000 bases when using
Line 109: Line 109:
  tail -n +4 ricCom1.fa.out | sort -k5,5 -k6,6n >> ricCom1.sorted.fa.out
  tail -n +4 ricCom1.fa.out | sort -k5,5 -k6,6n >> ricCom1.sorted.fa.out
   
   
In addition, using the UCSC script: [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/extractNestedRepeats.pl extractNestedRepeats.pl]
In addition, using the UCSC script: [http://genome-source.soe.ucsc.edu/gitlist/kent.git/blob/master/src/hg/utils/automation/extractNestedRepeats.pl extractNestedRepeats.pl]
for the nestedRepeats track:
for the nestedRepeats track:


Line 134: Line 134:
This example indicates a very low masking result from RepeatMasker.  The [[Window Masker]] result masks the sequence
This example indicates a very low masking result from RepeatMasker.  The [[Window Masker]] result masks the sequence
much better and will be used for the eventual final masking.
much better and will be used for the eventual final masking.
==Convert RM .out to bigBed==
The standard RepeatMasker track in the genome browser is a special unique track type that displays several
different aspects of the RepeatMasker data in side by side displays.  Such a track type is not available in
a track hub, but an approximation of the data display can be displayed with several different bigBed
tracks in the track hub.
First, convert the RepeatMasker .out file format to the UCSC track table format:
  hgLoadOut -tabFile=ricCom1.rmsk.tab -nosplit test ricCom1.sorted.fa.out
Split the resulting table rows by their RepeatMasker '''class''' attribute:
mkdir rmskClass
sort -k12,12 ricCom1.rmsk.tab | splitFileByColumn -ending=tab -col=12 -tab stdin rmskClass
This produces a number of files in the directory ./rmskClass/ - for example
ls -ogrt rmskClass
total 12096
-rw-rw-r-- 1  694267 Apr 18 10:31 DNA.tab
-rw-rw-r-- 1    8842 Apr 18 10:31 DNA?.tab
-rw-rw-r-- 1  129866 Apr 18 10:31 LINE.tab
-rw-rw-r-- 1    6208 Apr 18 10:31 LINE?.tab
-rw-rw-r-- 1  690106 Apr 18 10:31 LTR.tab
-rw-rw-r-- 1  857390 Apr 18 10:31 Low_complexity.tab
-rw-rw-r-- 1  14911 Apr 18 10:31 Other.tab
... etc.
Convert those files in the same groupings as the standard UCSC RepeatMasker class displays to bigBed file types:
for T in SINE LINE LTR DNA Simple Low_complexity Satellite
do
  ./toBed6+10.pl rmskClass/${T}*.tab | sort -k1,1 -k2,2n > ${DB}.rmsk.${T}.bed
done
./toBed6+10.pl rmskClass/rRNA.tab | sort -k1,1 -k2,2n > ${DB}.rmsk.RNA.bed
./toBed6+10.pl rmskClass/Other.tab rmskClass/RC*.tab | sort -k1,1 -k2,2n > ricCom1.rmsk.Other.bed
for T in DNA LINE LTR Low_complexity Other RNA SINE Satellite Simple
do
  bedToBigBed -tab -type=bed6+10 -as=rmsk16.as ricCom1.rmsk.${T}.bed \
    ../../chrom.sizes ricCom1.rmsk.${T}.bb
done
Where the script '''toBed6+10.pl''' is:
#!/usr/bin/env perl
use strict;
use warnings;
my $sixteenSamples = 0;
my $fifteenSamples = 0;
my $argc = scalar(@ARGV);
if ($argc < 1) {
  printf STDERR "usage: toBed6+10.pl rmskClass/file.tab > ricCom1.class.tab\n";
  printf STDERR "e.g.: ./toBed6+10.pl rmskClass/rRNA.tab > ricCom1.rmsk.RNA.tab\n";
  exit 255;
}
while (my $file = shift) {
open (FH, "<$file") or die "can not read $file";
while (my $line = <FH>) {
  next if ($line =~ m/^\s+SW\s+perc.*/);
  next if ($line =~ m/^score\s+div.*/);
  next if ($line =~ m/^$/);
  chomp $line;
  my @a = split('\t', $line);
  shift @a;
  die "ERROR: not finding 16 fields in a line" if (scalar(@a) != 16);
  printf "%s\t%d\t%d\t%s\t0\t%s\t%d\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n",
    $a[4], $a[5], $a[6], $a[9], $a[8], $a[0],
    $a[1], $a[2], $a[3], $a[7], $a[10], $a[11], $a[12], $a[13], $a[14];
}
close (FH);
}
And the '''rmsk16.as''' bigBed structure definition is:
table repeatMasker
"RepeatMasker .out record"
    (
    string chrom;    "Genomic sequence name"
    uint chromStart;    "Start in genomic sequence"
    uint chromEnd;      "End in genomic sequence"
    string name;    "Name of repeat"
    uint score;      "always 0 place holder"
    char[1] strand;    "Relative orientation + or -"
    uint swScore;      "Smith Waterman alignment score"
    uint milliDiv;      "Base mismatches in parts per thousand"
    uint milliDel;      "Bases deleted in parts per thousand"
    uint milliIns;      "Bases inserted in parts per thousand"
    int  genoLeft;      "-#bases after match in genomic sequence"
    string repClass;    "Class of repeat"
    string repFamily;  "Family of repeat"
    int repStart;      "Start (if strand is +) or -#bases after match (if strand is -) in repeat sequence"
    uint repEnd;        "End in repeat sequence"
    int repLeft;        "-#bases after match (if strand is +) or start (if strand is -) in repeat sequence"
    )
==Composite RMSK trackDb.txt==
Please note the composite track arrangement to display the resulting class files in a fashion resembling the standard
UCSC RepeatMasker track display:
http://genome-test.soe.ucsc.edu/~hiram/hubs/Plants/araTha1/trackDb.txt
In the section starting:
track repeatMasker_
compositeTrack on
shortLabel RepeatMasker
longLabel Repeating Elements by RepeatMasker
group varRep
priority 149.1
visibility dense
type bed 3 .
noInherit on
html ../trackDescriptions/repeatMasker


==See also==
==See also==

Latest revision as of 08:09, 1 September 2018

Example procedure to operate RepeatMasker on a typical genome assembly sequence.

Partition

The source tree script simplePartition.pl can be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using the cross_match alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using the rmblast alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. This example constructs partition lists in the specified RMPart directory hierarchy:

mkdir -p /data/genomes/ricCom1/bed/repeatMasker/
cd /data/genomes/ricCom1/bed/repeatMasker
simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart

Cluster run

The template file used here:

#LOOP
./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out
#ENDLOOP

with the gensub2 command constructs the command list jobList to run on the cluster.

mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster
cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster
ln -s /data/genomes/ricCom1/RMPart ./RMPart
gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList

This example constructs 880 commands to run in the jobList file, typically:

./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out
./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out
... etc.

The RMRun.csh script performs the RepeatMasker operation on each partition:

#!/bin/csh -ef

set finalOut = $1

set inLst = $finalOut:r
set inLft = $inLst:r.lft
set alignOut = $finalOut:r.align
set catOut = $finalOut:r.cat

# Use local disk for output, and move the final result to $outPsl
# when done, to minimize I/O.
set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX`
pushd $tmpDir

# Initialize local library
RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null

foreach spec (`cat $inLst`)
  # Remove path and .2bit filename to get just the seq:start-end spec:
  set base = `echo $spec | sed -r -e 's/^[^:]+://'`
  # If $spec is the whole sequence, twoBitToFa removes the :start-end part,
  # which causes liftUp to barf later.  So tweak the header back to
  # seq:start-end for liftUp's sake:
  twoBitToFa $spec stdout \
  | sed -e "s/^>.*/>$base/" > $base.fa
  RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa
  if (-e $base.fa.cat) then
    mv $base.fa.cat $catOut
  endif
end

# Lift up (leave the RepeatMasker header in place because we'll liftUp
# again later):
liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out

set nonomatch
set alignFiles = ( *.align )
if ( ${#alignFiles} && -e $alignFiles[1] ) then
  liftUp -type=.align stdout $inLft error *.align \
    > tmpOut__align
else
  touch tmpOut__align
endif

# Move final result into place:
mv tmpOut__out $finalOut
mv tmpOut__align $alignOut

popd
rm -rf $tmpDir

If you have a parasol cluster, this operation is performed.

para -ram=2g make jobList
para check
para time > run.time
cat run.time
Completed: 880 of 880 jobs
CPU time in finished jobs:     202665s    3377.75m    56.30h    2.35d  0.006 y
IO & Wait Time:                119279s    1987.98m    33.13h    1.38d  0.004 y
Average job time:                 366s       6.10m     0.10h    0.00d
Longest finished job:            6789s     113.15m     1.89h    0.08d
Submission to last job:          6938s     115.63m     1.93h    0.08d

Collect cluster run results

Gather the partition results into a single file:

cd /data/genomes/ricCom1/bed/repeatMasker
liftUp ricCom1.fa.out /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.out
liftUp ricCom1.fa.align /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.align
head -3 ricCom1.fa.out > ricCom1.sorted.fa.out
tail -n +4 ricCom1.fa.out | sort -k5,5 -k6,6n >> ricCom1.sorted.fa.out

In addition, using the UCSC script: extractNestedRepeats.pl for the nestedRepeats track:

# Use the ID column to join up fragments of interrupted repeats for the
# Nested Repeats track.
extractNestedRepeats.pl ricCom1.fa.out | sort -k1,1 -k2,2n > ricCom1.nestedRepeats.bed

Masking sequence

The unmasked.2bit file is masked with the RepeatMasker results:

cd /data/genomes/ricCom1/bed/repeatMasker
twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit ricCom1.sorted.fa.out ricCom1.rmsk.2bit

Measure the masking:

twoBitToFa ricCom1.rmsk.2bit stdout | faSize stdin > faSize.rmsk.txt

Results in the faSize.rmsk.txt file:

350621860 bases (13662715 N's 336959145 real 326566002 upper 10393143 lower) in 25763 sequences in 1 files
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
%2.96 masked total, %3.08 masked real

This example indicates a very low masking result from RepeatMasker. The Window Masker result masks the sequence much better and will be used for the eventual final masking.

Convert RM .out to bigBed

The standard RepeatMasker track in the genome browser is a special unique track type that displays several different aspects of the RepeatMasker data in side by side displays. Such a track type is not available in a track hub, but an approximation of the data display can be displayed with several different bigBed tracks in the track hub.

First, convert the RepeatMasker .out file format to the UCSC track table format:

 hgLoadOut -tabFile=ricCom1.rmsk.tab -nosplit test ricCom1.sorted.fa.out

Split the resulting table rows by their RepeatMasker class attribute:

mkdir rmskClass
sort -k12,12 ricCom1.rmsk.tab | splitFileByColumn -ending=tab -col=12 -tab stdin rmskClass

This produces a number of files in the directory ./rmskClass/ - for example

ls -ogrt rmskClass
total 12096
-rw-rw-r-- 1  694267 Apr 18 10:31 DNA.tab
-rw-rw-r-- 1    8842 Apr 18 10:31 DNA?.tab
-rw-rw-r-- 1  129866 Apr 18 10:31 LINE.tab
-rw-rw-r-- 1    6208 Apr 18 10:31 LINE?.tab
-rw-rw-r-- 1  690106 Apr 18 10:31 LTR.tab
-rw-rw-r-- 1  857390 Apr 18 10:31 Low_complexity.tab
-rw-rw-r-- 1   14911 Apr 18 10:31 Other.tab
... etc.

Convert those files in the same groupings as the standard UCSC RepeatMasker class displays to bigBed file types:

for T in SINE LINE LTR DNA Simple Low_complexity Satellite
do
 ./toBed6+10.pl rmskClass/${T}*.tab | sort -k1,1 -k2,2n > ${DB}.rmsk.${T}.bed
done
./toBed6+10.pl rmskClass/rRNA.tab | sort -k1,1 -k2,2n > ${DB}.rmsk.RNA.bed
./toBed6+10.pl rmskClass/Other.tab rmskClass/RC*.tab | sort -k1,1 -k2,2n > ricCom1.rmsk.Other.bed
for T in DNA LINE LTR Low_complexity Other RNA SINE Satellite Simple
do
  bedToBigBed -tab -type=bed6+10 -as=rmsk16.as ricCom1.rmsk.${T}.bed \
   ../../chrom.sizes ricCom1.rmsk.${T}.bb
done

Where the script toBed6+10.pl is:

#!/usr/bin/env perl

use strict;
use warnings;

my $sixteenSamples = 0;
my $fifteenSamples = 0;

my $argc = scalar(@ARGV);
if ($argc < 1) {
  printf STDERR "usage: toBed6+10.pl rmskClass/file.tab > ricCom1.class.tab\n";
  printf STDERR "e.g.: ./toBed6+10.pl rmskClass/rRNA.tab > ricCom1.rmsk.RNA.tab\n";
  exit 255;
}

while (my $file = shift) {

open (FH, "<$file") or die "can not read $file";
while (my $line = <FH>) {
  next if ($line =~ m/^\s+SW\s+perc.*/);
  next if ($line =~ m/^score\s+div.*/);
  next if ($line =~ m/^$/);
  chomp $line;
  my @a = split('\t', $line);
  shift @a;
  die "ERROR: not finding 16 fields in a line" if (scalar(@a) != 16);
  printf "%s\t%d\t%d\t%s\t0\t%s\t%d\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n",
    $a[4], $a[5], $a[6], $a[9], $a[8], $a[0],
    $a[1], $a[2], $a[3], $a[7], $a[10], $a[11], $a[12], $a[13], $a[14];
}
close (FH);
}

And the rmsk16.as bigBed structure definition is:

table repeatMasker
"RepeatMasker .out record"
    (
    string chrom;    "Genomic sequence name"
    uint chromStart;     "Start in genomic sequence"
    uint chromEnd;       "End in genomic sequence"
    string name;     "Name of repeat"
    uint score;       "always 0 place holder"
    char[1] strand;     "Relative orientation + or -"
    uint swScore;       "Smith Waterman alignment score"
    uint milliDiv;      "Base mismatches in parts per thousand"
    uint milliDel;      "Bases deleted in parts per thousand"
    uint milliIns;      "Bases inserted in parts per thousand"
    int  genoLeft;      "-#bases after match in genomic sequence"
    string repClass;    "Class of repeat"
    string repFamily;   "Family of repeat"
    int repStart;       "Start (if strand is +) or -#bases after match (if strand is -) in repeat sequence"
    uint repEnd;        "End in repeat sequence"
    int repLeft;        "-#bases after match (if strand is +) or start (if strand is -) in repeat sequence"
    )

Composite RMSK trackDb.txt

Please note the composite track arrangement to display the resulting class files in a fashion resembling the standard UCSC RepeatMasker track display:

http://genome-test.soe.ucsc.edu/~hiram/hubs/Plants/araTha1/trackDb.txt

In the section starting:

track repeatMasker_
compositeTrack on
shortLabel RepeatMasker
longLabel Repeating Elements by RepeatMasker
group varRep
priority 149.1
visibility dense
type bed 3 .
noInherit on
html ../trackDescriptions/repeatMasker

See also

Please note specific details for running each of the masking operations: