Window Masker

From genomewiki
Revision as of 19:41, 23 April 2013 by Hiram (talk | contribs) (add see also to other masking)
Jump to navigationJump to search

Construct counts file

cd /data/genomes/ricCom1/bed/windowMasker
set fa = ricCom1.fa
set tmpDir = `mktemp -d -p /scratch/tmp doWindowMasker.XXXXXX`
chmod 775 $tmpDir
set inputTwoBit = /data/genomes/ricCom1/ricCom1.unmasked.2bit
pushd $tmpDir
twoBitToFa $inputTwoBit $fa
windowMasker -mk_counts true -input $fa -output windowmasker.counts
popd 
cp $tmpDir/windowmasker.counts .
rm -rf $tmpDir

Window Masker with sdust option

mkdir /data/genomes/ricCom1/bed/windowMasker
cd /data/genomes/ricCom1/bed/windowMasker
set fa = ricCom1.fa
set tmpDir = `mktemp -d -p /scratch/tmp doWindowMasker.XXXXXX`
chmod 775 $tmpDir
set inputTwoBit = /data/genomes/ricCom1/ricCom1.unmasked.2bit
cp windowmasker.counts $tmpDir
pushd $tmpDir
twoBitToFa $inputTwoBit $fa
windowMasker -ustat windowmasker.counts -sdust true -input $fa -output windowmasker.intervals
perl -wpe 'if (s/^>lcl\|(.*)\n$//) { $chr = $1; } \
   if (/^(\d+) - (\d+)/) { \
   $s=$1; $e=$2+1; s/(\d+) - (\d+)/$chr\t$s\t$e/; \
   }' windowmasker.intervals > windowmasker.sdust.bed
popd 
cp $tmpDir/windowmasker.sdust.bed .
rm -rf $tmpDir

Construct masked 2bit file

cd /data/genomes/ricCom1/bed/windowMasker
twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit windowmasker.sdust.bed ricCom1.wmsk.sdust.2bit

Measure masking result:

twoBitToFa ricCom1.wmsk.sdust.2bit stdout | faSize stdin > faSize.ricCom1.wmsk.sdust.txt 2>&1

Results:

350621860 bases (13662715 N's 336959145 real 185528058 upper 151431087 lower) in 25763 sequences in 1 files
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
%43.19 masked total, %44.94 masked real

Clean mask result

Window Masker masks the N's in the sequence. I would prefer to have a clean mask result without these gaps masked. The construction of the clean Window Masker result is done with featureBits to intersect with 'not gaps' so that only the sequence is masked. To construct the gap and not gap bed files for this intersection:

findMotif -motif=gattaca -verbose=4 -strand=+ ../../ricCom1.unmasked.2bit > findMotif.txt 2>&1
grep "^#GAP " findMotif.txt | sed -e "s/^#GAP //" | sort -k1,1 -k2,2n > allGaps.bed
awk '{printf "%s\t%d\tricCom1.2bit\n", $1, $2}' ../../chrom.sizes > chromInfo.tab
featureBits test -chromSize=chromInfo.tab -not -bed=notGap.bed allGaps.bed

Using featureBits requires the $HOME/.hg.conf file setup to use the public (or local) MySQL server even though there is no database involved in this operation. It is a drawback of featureBits that it connects to the MySQL server even though it isn't needed.

Using the notGap.bed file, intersect with the Window Masker result:

featureBits test -chromSize=chromInfo.tab windowmasker.sdust.bed notGap.bed -bed=stdout 2> /dev/null \
 | sort -k1,1 -k2,2n | gzip -c > cleanWMask.bed.gz

Construct bigBed file to display in a track/assembly hub genome browser. Do not need the names in the bed file, just the position:

zcat cleanWMask.bed.gz | cut -f1-3 > cleanWMask.bed3.bed
bedToBigBed -tab cleanWMask.bed3.bed ../../chrom.sizes ricCom1.windowMasker.bb
rm -f cleanWMask.bed3.bed

trackDb.txt

The track hub trackDb.txt entry is:

track windowMasker
shortLabel WM + SDust
longLabel Genomic Intervals Masked by WindowMasker + SDust
group varRep
priority 149.26
visibility dense
type bigBed 3 .
bigDataUrl bbi/ricCom1.windowMasker.bb
html ../trackDescriptions/windowMasker

See also

Please note specific details for running each of the masking operations: