Window Masker: Difference between revisions
(initial contents) |
(clean masking) |
||
Line 43: | Line 43: | ||
Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 | Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 | ||
%43.19 masked total, %44.94 masked real | %43.19 masked total, %44.94 masked real | ||
==Clean mask result== | |||
Window Masker masks the '''N's''' in the sequence. I would prefer to have a ''clean'' mask result without these gaps masked. | |||
The construction of the ''clean'' Window Masker result is done with ''featureBits'' to intersect with 'not gaps' so that | |||
only the sequence is masked. To construct the ''gap'' and ''not gap'' bed files for this intersection: | |||
findMotif -motif=gattaca -verbose=4 -strand=+ ../../ricCom1.unmasked.2bit > findMotif.txt 2>&1 | |||
grep "^#GAP " findMotif.txt | sed -e "s/^#GAP //" | sort -k1,1 -k2,2n > allGaps.bed | |||
awk '{printf "%s\t%d\tricCom1.2bit\n", $1, $2}' ../../chrom.sizes > chromInfo.tab | |||
featureBits test -chromSize=chromInfo.tab -not -bed=notGap.bed allGaps.bed | |||
Using featureBits requires the '''$HOME/.hg.conf''' [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/product/README.mysql.setup file setup] | |||
to use the public (or local) MySQL server even though | |||
there is no database involved in this operation. It is a drawback of ''featureBits'' that it connects to the MySQL server | |||
even though it isn't needed. | |||
Using the ''notGap.bed'' file, intersect with the Window Masker result: | |||
featureBits test -chromSize=chromInfo.tab windowmasker.sdust.bed notGap.bed -bed=stdout 2> /dev/null \ | |||
| sort -k1,1 -k2,2n | gzip -c > cleanWMask.bed.gz | |||
Construct bigBed file to display in a track/assembly hub genome browser. | |||
Do not need the names in the bed file, just the position: | |||
zcat cleanWMask.bed.gz | cut -f1-3 > cleanWMask.bed3.bed | |||
bedToBigBed -tab cleanWMask.bed3.bed ../../chrom.sizes ricCom1.windowMasker.bb | |||
rm -f cleanWMask.bed3.bed |
Revision as of 18:00, 23 April 2013
Construct counts file
cd /data/genomes/ricCom1/bed/windowMasker set fa = ricCom1.fa set tmpDir = `mktemp -d -p /scratch/tmp doWindowMasker.XXXXXX` chmod 775 $tmpDir set inputTwoBit = /data/genomes/ricCom1/ricCom1.unmasked.2bit pushd $tmpDir twoBitToFa $inputTwoBit $fa windowMasker -mk_counts true -input $fa -output windowmasker.counts popd cp $tmpDir/windowmasker.counts . rm -rf $tmpDir
Window Masker with sdust option
mkdir /data/genomes/ricCom1/bed/windowMasker cd /data/genomes/ricCom1/bed/windowMasker set fa = ricCom1.fa set tmpDir = `mktemp -d -p /scratch/tmp doWindowMasker.XXXXXX` chmod 775 $tmpDir set inputTwoBit = /data/genomes/ricCom1/ricCom1.unmasked.2bit cp windowmasker.counts $tmpDir pushd $tmpDir twoBitToFa $inputTwoBit $fa windowMasker -ustat windowmasker.counts -sdust true -input $fa -output windowmasker.intervals perl -wpe 'if (s/^>lcl\|(.*)\n$//) { $chr = $1; } \ if (/^(\d+) - (\d+)/) { \ $s=$1; $e=$2+1; s/(\d+) - (\d+)/$chr\t$s\t$e/; \ }' windowmasker.intervals > windowmasker.sdust.bed popd cp $tmpDir/windowmasker.sdust.bed . rm -rf $tmpDir
Construct masked 2bit file
cd /data/genomes/ricCom1/bed/windowMasker twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit windowmasker.sdust.bed ricCom1.wmsk.sdust.2bit
Measure masking result:
twoBitToFa ricCom1.wmsk.sdust.2bit stdout | faSize stdin > faSize.ricCom1.wmsk.sdust.txt 2>&1
Results:
350621860 bases (13662715 N's 336959145 real 185528058 upper 151431087 lower) in 25763 sequences in 1 files Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 %43.19 masked total, %44.94 masked real
Clean mask result
Window Masker masks the N's in the sequence. I would prefer to have a clean mask result without these gaps masked. The construction of the clean Window Masker result is done with featureBits to intersect with 'not gaps' so that only the sequence is masked. To construct the gap and not gap bed files for this intersection:
findMotif -motif=gattaca -verbose=4 -strand=+ ../../ricCom1.unmasked.2bit > findMotif.txt 2>&1 grep "^#GAP " findMotif.txt | sed -e "s/^#GAP //" | sort -k1,1 -k2,2n > allGaps.bed awk '{printf "%s\t%d\tricCom1.2bit\n", $1, $2}' ../../chrom.sizes > chromInfo.tab featureBits test -chromSize=chromInfo.tab -not -bed=notGap.bed allGaps.bed
Using featureBits requires the $HOME/.hg.conf file setup to use the public (or local) MySQL server even though there is no database involved in this operation. It is a drawback of featureBits that it connects to the MySQL server even though it isn't needed.
Using the notGap.bed file, intersect with the Window Masker result:
featureBits test -chromSize=chromInfo.tab windowmasker.sdust.bed notGap.bed -bed=stdout 2> /dev/null \ | sort -k1,1 -k2,2n | gzip -c > cleanWMask.bed.gz
Construct bigBed file to display in a track/assembly hub genome browser. Do not need the names in the bed file, just the position:
zcat cleanWMask.bed.gz | cut -f1-3 > cleanWMask.bed3.bed bedToBigBed -tab cleanWMask.bed3.bed ../../chrom.sizes ricCom1.windowMasker.bb rm -f cleanWMask.bed3.bed