Window Masker: Difference between revisions

From genomewiki
Jump to navigationJump to search
(initial contents)
 
(clean masking)
Line 43: Line 43:
  Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
  Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
  %43.19 masked total, %44.94 masked real
  %43.19 masked total, %44.94 masked real
==Clean mask result==
Window Masker masks the '''N's''' in the sequence.  I would prefer to have a ''clean'' mask result without these gaps masked.
The construction of the ''clean'' Window Masker result is done with ''featureBits'' to intersect with 'not gaps' so that
only the sequence is masked.  To construct the ''gap'' and ''not gap'' bed files for this intersection:
findMotif -motif=gattaca -verbose=4 -strand=+ ../../ricCom1.unmasked.2bit > findMotif.txt 2>&1
grep "^#GAP " findMotif.txt | sed -e "s/^#GAP //" | sort -k1,1 -k2,2n > allGaps.bed
awk '{printf "%s\t%d\tricCom1.2bit\n", $1, $2}' ../../chrom.sizes > chromInfo.tab
featureBits test -chromSize=chromInfo.tab -not -bed=notGap.bed allGaps.bed
Using featureBits requires the '''$HOME/.hg.conf''' [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/product/README.mysql.setup file setup]
to use the public (or local) MySQL server even though
there is no database involved in this operation.  It is a drawback of ''featureBits'' that it connects to the MySQL server
even though it isn't needed.
Using the ''notGap.bed'' file, intersect with the Window Masker result:
featureBits test -chromSize=chromInfo.tab windowmasker.sdust.bed notGap.bed -bed=stdout 2> /dev/null \
  | sort -k1,1 -k2,2n | gzip -c > cleanWMask.bed.gz
Construct bigBed file to display in a track/assembly hub genome browser.
Do not need the names in the bed file, just the position:
zcat cleanWMask.bed.gz | cut -f1-3 > cleanWMask.bed3.bed
bedToBigBed -tab cleanWMask.bed3.bed ../../chrom.sizes ricCom1.windowMasker.bb
rm -f cleanWMask.bed3.bed

Revision as of 18:00, 23 April 2013

Construct counts file

cd /data/genomes/ricCom1/bed/windowMasker
set fa = ricCom1.fa
set tmpDir = `mktemp -d -p /scratch/tmp doWindowMasker.XXXXXX`
chmod 775 $tmpDir
set inputTwoBit = /data/genomes/ricCom1/ricCom1.unmasked.2bit
pushd $tmpDir
twoBitToFa $inputTwoBit $fa
windowMasker -mk_counts true -input $fa -output windowmasker.counts
popd 
cp $tmpDir/windowmasker.counts .
rm -rf $tmpDir

Window Masker with sdust option

mkdir /data/genomes/ricCom1/bed/windowMasker
cd /data/genomes/ricCom1/bed/windowMasker
set fa = ricCom1.fa
set tmpDir = `mktemp -d -p /scratch/tmp doWindowMasker.XXXXXX`
chmod 775 $tmpDir
set inputTwoBit = /data/genomes/ricCom1/ricCom1.unmasked.2bit
cp windowmasker.counts $tmpDir
pushd $tmpDir
twoBitToFa $inputTwoBit $fa
windowMasker -ustat windowmasker.counts -sdust true -input $fa -output windowmasker.intervals
perl -wpe 'if (s/^>lcl\|(.*)\n$//) { $chr = $1; } \
   if (/^(\d+) - (\d+)/) { \
   $s=$1; $e=$2+1; s/(\d+) - (\d+)/$chr\t$s\t$e/; \
   }' windowmasker.intervals > windowmasker.sdust.bed
popd 
cp $tmpDir/windowmasker.sdust.bed .
rm -rf $tmpDir

Construct masked 2bit file

cd /data/genomes/ricCom1/bed/windowMasker
twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit windowmasker.sdust.bed ricCom1.wmsk.sdust.2bit

Measure masking result:

twoBitToFa ricCom1.wmsk.sdust.2bit stdout | faSize stdin > faSize.ricCom1.wmsk.sdust.txt 2>&1

Results:

350621860 bases (13662715 N's 336959145 real 185528058 upper 151431087 lower) in 25763 sequences in 1 files
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094
%43.19 masked total, %44.94 masked real

Clean mask result

Window Masker masks the N's in the sequence. I would prefer to have a clean mask result without these gaps masked. The construction of the clean Window Masker result is done with featureBits to intersect with 'not gaps' so that only the sequence is masked. To construct the gap and not gap bed files for this intersection:

findMotif -motif=gattaca -verbose=4 -strand=+ ../../ricCom1.unmasked.2bit > findMotif.txt 2>&1
grep "^#GAP " findMotif.txt | sed -e "s/^#GAP //" | sort -k1,1 -k2,2n > allGaps.bed
awk '{printf "%s\t%d\tricCom1.2bit\n", $1, $2}' ../../chrom.sizes > chromInfo.tab
featureBits test -chromSize=chromInfo.tab -not -bed=notGap.bed allGaps.bed

Using featureBits requires the $HOME/.hg.conf file setup to use the public (or local) MySQL server even though there is no database involved in this operation. It is a drawback of featureBits that it connects to the MySQL server even though it isn't needed.

Using the notGap.bed file, intersect with the Window Masker result:

featureBits test -chromSize=chromInfo.tab windowmasker.sdust.bed notGap.bed -bed=stdout 2> /dev/null \
 | sort -k1,1 -k2,2n | gzip -c > cleanWMask.bed.gz

Construct bigBed file to display in a track/assembly hub genome browser. Do not need the names in the bed file, just the position:

zcat cleanWMask.bed.gz | cut -f1-3 > cleanWMask.bed3.bed
bedToBigBed -tab cleanWMask.bed3.bed ../../chrom.sizes ricCom1.windowMasker.bb
rm -f cleanWMask.bed3.bed