Window Masker
Construct counts file
cd /data/genomes/ricCom1/bed/windowMasker set fa = ricCom1.fa set tmpDir = `mktemp -d -p /scratch/tmp doWindowMasker.XXXXXX` chmod 775 $tmpDir set inputTwoBit = /data/genomes/ricCom1/ricCom1.unmasked.2bit pushd $tmpDir twoBitToFa $inputTwoBit $fa windowMasker -mk_counts true -input $fa -output windowmasker.counts popd cp $tmpDir/windowmasker.counts . rm -rf $tmpDir
Window Masker with sdust option
mkdir /data/genomes/ricCom1/bed/windowMasker cd /data/genomes/ricCom1/bed/windowMasker set fa = ricCom1.fa set tmpDir = `mktemp -d -p /scratch/tmp doWindowMasker.XXXXXX` chmod 775 $tmpDir set inputTwoBit = /data/genomes/ricCom1/ricCom1.unmasked.2bit cp windowmasker.counts $tmpDir pushd $tmpDir twoBitToFa $inputTwoBit $fa windowMasker -ustat windowmasker.counts -sdust true -input $fa -output windowmasker.intervals perl -wpe 'if (s/^>lcl\|(.*)\n$//) { $chr = $1; } \ if (/^(\d+) - (\d+)/) { \ $s=$1; $e=$2+1; s/(\d+) - (\d+)/$chr\t$s\t$e/; \ }' windowmasker.intervals > windowmasker.sdust.bed popd cp $tmpDir/windowmasker.sdust.bed . rm -rf $tmpDir
Construct masked 2bit file
cd /data/genomes/ricCom1/bed/windowMasker twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit windowmasker.sdust.bed ricCom1.wmsk.sdust.2bit
Measure masking result:
twoBitToFa ricCom1.wmsk.sdust.2bit stdout | faSize stdin > faSize.ricCom1.wmsk.sdust.txt 2>&1
Results:
350621860 bases (13662715 N's 336959145 real 185528058 upper 151431087 lower) in 25763 sequences in 1 files Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 %43.19 masked total, %44.94 masked real
Clean mask result
Window Masker masks the N's in the sequence. I would prefer to have a clean mask result without these gaps masked. The construction of the clean Window Masker result is done with featureBits to intersect with 'not gaps' so that only the sequence is masked. To construct the gap and not gap bed files for this intersection:
findMotif -motif=gattaca -verbose=4 -strand=+ ../../ricCom1.unmasked.2bit > findMotif.txt 2>&1 grep "^#GAP " findMotif.txt | sed -e "s/^#GAP //" | sort -k1,1 -k2,2n > allGaps.bed awk '{printf "%s\t%d\tricCom1.2bit\n", $1, $2}' ../../chrom.sizes > chromInfo.tab featureBits test -chromSize=chromInfo.tab -not -bed=notGap.bed allGaps.bed
Using featureBits requires the $HOME/.hg.conf file setup to use the public (or local) MySQL server even though there is no database involved in this operation. It is a drawback of featureBits that it connects to the MySQL server even though it isn't needed.
Using the notGap.bed file, intersect with the Window Masker result:
featureBits test -chromSize=chromInfo.tab windowmasker.sdust.bed notGap.bed -bed=stdout 2> /dev/null \ | sort -k1,1 -k2,2n | gzip -c > cleanWMask.bed.gz
Construct bigBed file to display in a track/assembly hub genome browser. Do not need the names in the bed file, just the position:
zcat cleanWMask.bed.gz | cut -f1-3 > cleanWMask.bed3.bed bedToBigBed -tab cleanWMask.bed3.bed ../../chrom.sizes ricCom1.windowMasker.bb rm -f cleanWMask.bed3.bed
trackDb.txt
The track hub trackDb.txt entry is:
track windowMasker shortLabel WM + SDust longLabel Genomic Intervals Masked by WindowMasker + SDust group varRep priority 149.26 visibility dense type bigBed 3 . bigDataUrl bbi/ricCom1.windowMasker.bb html ../trackDescriptions/windowMasker
See also
Please note specific details for running each of the masking operations: