TRF Simple Repeats

From genomewiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Partition

The TRF program can be obtained from [TRF/Gary Benson]

The TRF/SimpleRepeats operations is similar to the Repeat Masker procedure. Partition the sequence into bits, run trf on each bit, collect the results:

mkdir -p /data/genomes/ricCom1/bed/simpleRepeat/run.cluster
cd /data/genomes/ricCom1/bed/simpleRepeat/run.cluster

rm -rf /data/genomes/ricCom1/TrfPart
simplePartition.pl /data/genomes/ricCom1/ricCom1.unmasked.2bit 50000000 /data/genomes/ricCom1/TrfPart
rm -f /data/genomes/ricCom1/bed/simpleRepeat/TrfPart
ln -s /data/genomes/ricCom1/TrfPart /data/genomes/ricCom1/bed/simpleRepeat/TrfPart

Cluster run

gensub2 template file:

#LOOP
./TrfRun.csh /data/genomes/ricCom1/TrfPart/$(path1).bed
#ENDLOOP

gensub2 constructs the jobList:

gensub2 /data/genomes/ricCom1/TrfPart/partitions.lst single template jobList

Typical jobList commands:

./TrfRun.csh /data/genomes/ricCom1/TrfPart/000/000.lst.bed
./TrfRun.csh /data/genomes/ricCom1/TrfPart/001/001.lst.bed
./TrfRun.csh /data/genomes/ricCom1/TrfPart/002/002.lst.bed
... etc.

Typical parasol operation:

para make jobList
para check
para time > run.time
cat run.time
Completed: 8 of 8 jobs
CPU time in finished jobs:       4059s      67.66m     1.13h    0.05d  0.000 y
IO & Wait Time:                   407s       6.78m     0.11h    0.00d  0.000 y
Average job time:                 558s       9.30m     0.16h    0.01d
Longest finished job:            1391s      23.18m     0.39h    0.02d
Submission to last job:          1395s      23.25m     0.39h    0.02d

Each partition bit is processed by the script TrfRun.csh:

#!/bin/csh -ef

set finalOut = $1

set inLst = $finalOut:r
set inLft = $inLst:r.lft

# Use local disk for output, and move the final result to $finalOut
# when done, to minimize I/O.
set tmpDir = `mktemp -d -p /scratch/tmp doSimpleRepeat.cluster.XXXXXX`
pushd $tmpDir

foreach spec (`cat $inLst`)
  # Remove path and .2bit filename to get just the seq:start-end spec:
  set base = `echo $spec | sed -r -e 's/^[^:]+://'`

  # If $spec is the whole sequence, twoBitToFa removes the :start-end part,
  # which causes liftUp to barf later.  So tweak the header back to
  # seq:start-end for liftUp's sake:
  twoBitToFa $spec stdout \
  | sed -e "s/^>.*/>$base/" \
  | trfBig -trf=/cluster/bin/trf stdin /dev/null -bedAt=$base.bed -tempDir=/scratch/tmp
end

# Due to the large chunk size, .lft files can have thousands of lines.
# Even though the liftUp code does &lineFileClose, somehow we still
# run out of filehandles.  So limit the size of liftSpecs:
split -a 3 -d -l 500 $inLft SplitLft.

# Lift up:
foreach splitLft (SplitLft.*)
  set bedFiles = `awk '{print $2 ".bed"};' $splitLft`
  endsInLf -zeroOk $bedFiles
  cat $bedFiles \
  | liftUp -type=.bed tmpOut.$splitLft $splitLft error stdin
end
cat tmpOut.* > tmpOut__bed

# endsInLf is much faster than using para to {check out line}:
endsInLf -zeroOk tmpOut*

# Move final result into place:
mv tmpOut__bed $finalOut

popd
rm -rf $tmpDir

post-process Filter

UCSC uses the repeats of period less than or equal to 12 for the masking of the genome sequence:

cd /data/genomes/ricCom1/bed/simpleRepeat
cat /data/genomes/ricCom1/bed/simpleRepeat/TrfPart/???/*.bed > simpleRepeat.bed
endsInLf simpleRepeat.bed
if ($status) then
  echo Uh-oh -- simpleRepeat.bed fails endsInLf.  Look at /hive/data/genomes/ricCom1/bed/simpleRepeat/TrfPart/ bed files.
  exit 1
endif
awk '{if ($5 <= 12) print;}' simpleRepeat.bed > trfMask.bed

Final 2bit masking

WIth the trfMask.bed file and the decision of whether to use the RepeatMasker or Window Masker result:

twoBitMask -type=.bed ricCom1.unmasked.2bit bed/windowMasker/cleanWMask.bed.gz \
         ricCom1.wmsk.sdust.2bit
twoBitMask ricCom1.wmsk.sdust.2bit -add bed/simpleRepeat/trfMask.bed ricCom1.2bit
#   you can safely ignore the warning about fields >= 13
twoBitToFa ricCom1.2bit stdout | faSize stdin > faSize.ricCom1.2bit.txt

Resulting masking:

350621860 bases (13662715 N's 336959145 real 185500504 upper 151458641 lower) in 25763 sequences in 1 files
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
%43.20 masked total, %44.95 masked real

Construct bigBed track files

Rearranging the simpleRepeat.bed file a bit with this perl script bedToBed4+.pl.

It truncates the repeat string to 16 characters to use as the name of the item in the track:

#!/usr/bin/env perl

use strict;
use warnings;

open (FH, "<simpleRepeat.bed") or die "can not read simpleRepeat.bed";
while (my $line = <FH>) {
    chomp $line;
    my @a = split('\s+', $line);
    my $name = substr($a[15],0,16);
    printf "%s\t%d\t%d\t%s\t%d\t%s\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%s\t%s\n",
      $a[0], $a[1], $a[2], $name, $a[4], $a[5], $a[6], $a[7], $a[8], $a[9], $a[10], $a[11], $a[12], $a[13], $a[14], $a[15];
}
close (FH);

Sort the output of that script into the result file:

./bedToBed4+.pl  | sort -k1,1 -k2,2n > ricCom1.bed16.bed
bedToBigBed -tab -type=bed4+12 -as=simpleRepeat.as ricCom1.bed16.bed \
   ../../chrom.sizes ricCom1.simpleRepeat.bb

Using the source tree simpleRepeat.as definition of the fields in the track.

trackDb.txt

The track hub trackDb.txt entry:

track simpleRepeat_
shortLabel Simple Repeats
longLabel Simple Tandem Repeats by TRF
group varRep
priority 149.3
visibility dense
type bigBed 4 +
bigDataUrl bbi/ricCom1.simpleRepeat.bb
html ../trackDescriptions/simpleRepeat

See also

Please note specific details for running each of the masking operations: