Mm9 multiple alignment: Difference between revisions

From genomewiki
Jump to navigationJump to search
(add link to other multiple alignments)
 
(63 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==See also:==
<UL>
<LI>all other [[UCSC Multiple Alignments]]</LI>
</UL>
==Mouse mm9 multiple alignment/conservation track==
<TABLE BORDER=0>
<TR><TD>
To avoid artifacts in downstream processing of the UCSC multiple alignments, it is important to be careful on the use of the parameters used in the blastz processing pipeline.  There are a number of steps in the pipeline and a variety of tunable parameters involved.  This page will track the various parameters used in the alignments as they proceed toward the completion of a multiple alignment conservation track on the mm9 mouse (NCBI build 37) assembly
To avoid artifacts in downstream processing of the UCSC multiple alignments, it is important to be careful on the use of the parameters used in the blastz processing pipeline.  There are a number of steps in the pipeline and a variety of tunable parameters involved.  This page will track the various parameters used in the alignments as they proceed toward the completion of a multiple alignment conservation track on the mm9 mouse (NCBI build 37) assembly
The chrom count in the table below does not include haplotypes, chr*_random, chrUn or chrM unless chrUn or scaffolds are the only sequences for that assembly.
The genome size has the same limitation as the chrom count, no randoms.
Tree distances are from the hg18 28-way measurements, with ponAbe2 and calJac1 manually inserted into the tree.
1 Mb = 1000 * 1000 = 1,000,000 bases
Types of mafs used in the multiple alignment:
<UL>
<LI> 2X mammal assemblies == reciprocal best net<BR>
calJac1, cavPor2, tupBel1, otoGar1, dasNov1,
oryCun1, felCat3, loxAfr1, eriEur1, sorAra1, echTel1
<LI> high coverage placental mammals == syntenic net<BR>
rn4, hg18, rheMac2, ponAbe2, panTro2, equCab1, canFam2, bosTau3
<LI> all others == standard net<BR>
monDom4, ornAna1, galGal3, anoCar1, xenTro2, gasAcu1,
danRer5, tetNig1, fr2, oryLat1
</UL>
</TD><TD>
[[Image:Mm9_30way.gif|frame|30-way phylogenetic tree]]
</TD></TR></TABLE>


==axtChain parameters and end results==
==axtChain parameters and end results==
Line 14: Line 47:
   <TH>% of other<BR>matched by mm9</TH>
   <TH>% of other<BR>matched by mm9</TH>
   <TH>done</TH>
   <TH>done</TH>
</TR>
<TR>
  <TH>mouse mm9<BR>self alignment</TH>
  <TD>21</TD>
  <TD>2654 Mb</TD>
  <TD>0.0</TD>
  <TD>10000</TD>
  <TD>medium</TD>
  <TD>14.458</TD>
  <TD>N/A</TD>
  <TD>2007-08-31</TD>
</TR>
</TR>


Line 19: Line 64:
   <TH>rat rn4</TH>
   <TH>rat rn4</TH>
   <TD>21</TD>
   <TD>21</TD>
   <TD>2,702 Mb</TD>
   <TD>2718 Mb</TD>
   <TD>0.1587</TD>  
   <TD>0.160657</TD>  
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>68.357</TD>
   <TD>65.380</TD>
   <TD>69.541</TD>
   <TD>66.538</TD>
   <TD>16 August</TD>
   <TD>2007-08-31</TD>
</TR>
</TR>


Line 31: Line 76:
   <TH>human hg18</TH>
   <TH>human hg18</TH>
   <TD>24</TD>
   <TD>24</TD>
   <TD>2,963 Mb</TD>
   <TD>3080 Mb</TD>
   <TD>0.4667</TD>  
   <TD>0.452619</TD>  
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>38.499</TD>
   <TD>38.499</TD>
   <TD>35.201</TD>
   <TD>35.201</TD>
   <TD>16 August</TD>
   <TD>2007-08-16</TD>
<TR>
<TR>


<TR>
<TR>
   <TD>Chimp panTro2</TD>
   <TH>Rhesus rheMac2</TH>
   <TD>25</TD>
   <TD>22</TD>
   <TD>3028 Mb</TD>
   <TD>2864 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.452745</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>38.087</TD>
   <TD>xx.456</TD>
   <TD>41.335</TD>
   <TD>tbd</TD>
   <TD>2007-09-07</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>tbd ponAbe1</TD>
   <TH>Orangutan ponAbe2</TH>
   <TD>79553</TD>
   <TD>25</TD>
   <TD>3090 Mb</TD>
   <TD>3029 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.453809</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>34.902</TD>
   <TD>xx.456</TD>
   <TD>30.659</TD>
   <TD>tbd</TD>
   <TD>2007-09-21</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Rhesus rheMac2</TD>
   <TH>Marmoset calJac1</TH>
   <TD>22</TD>
   <TD>49724</TD>
   <TD>2731 Mb</TD>
   <TD>3029 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.454272</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>32.971</TD>
   <TD>xx.456</TD>
   <TD>30.302</TD>
   <TD>tbd</TD>
   <TD>2007-09-07</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Bushbaby otoGar1</TD>
   <TH>Chimp panTro2</TH>
   <TD>120882</TD>
   <TD>25</TD>
   <TD>3261 Mb</TD>
   <TD>3175 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.454514</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>37.674</TD>
   <TD>xx.456</TD>
   <TD>34.269</TD>
   <TD>tbd</TD>
   <TD>2007-09-25</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>tbd calJac1</TD>
   <TH>GuineaPig cavPor2</TH>
   <TD>49724</TD>
   <TD>295514</TD>
   <TD>2889 Mb</TD>
   <TD>3403 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.479871</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>18.326</TD>
   <TD>xx.456</TD>
   <TD>N/A</TD>
   <TD>tbd</TD>
   <TD>2007-09-21</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>TreeShrew tupBel1</TD>
   <TH>Horse equCab1</TH>
   <TD>150851</TD>
   <TD>32</TD>
   <TD>3491 Mb</TD>
   <TD>2056 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.479871</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>34.782</TD>
   <TD>xx.456</TD>
   <TD>37.217</TD>
   <TD>tbd</TD>
   <TD>2007-09-25</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>GuineaPig cavPor2</TD>
   <TH>TreeShrew tupBel1</TH>
   <TD>295514</TD>
   <TD>150851</TD>
   <TD>3246 Mb</TD>
   <TD>3660 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.494934</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>21.099</TD>
   <TD>xx.456</TD>
   <TD>N/A</TD>
   <TD>tbd</TD>
   <TD>2007-10-01</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Rabbit oryCun1</TD>
   <TH>Bushbaby otoGar1</TH>
   <TD>215471</TD>
   <TD>120882</TD>
   <TD>3303 Mb</TD>
   <TD>3420 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.498957</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>22.972</TD>
   <TD>xx.456</TD>
   <TD>N/A</TD>
   <TD>tbd</TD>
   <TD>2007-09-28</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Shrew sorAra1</TD>
   <TH>Armadillo dasNov1</TH>
   <TD>262057</TD>
   <TD>304391</TD>
   <TD>2800 Mb</TD>
   <TD>3856 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.517360</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>16.547</TD>
   <TD>xx.456</TD>
   <TD>N/A</TD>
   <TD>tbd</TD>
   <TD>2007-10-02</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Hedgehog eriEur1</TD>
   <TH>Rabbit oryCun1</TH>
   <TD>379801</TD>
   <TD>215471</TD>
   <TD>3211 Mb</TD>
   <TD>3464 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.519779</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>18.945</TD>
   <TD>xx.456</TD>
   <TD>N/A</TD>
   <TD>tbd</TD>
   <TD>2007-09-29</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Dog canFam2</TD>
   <TH>Cat felCat3</TH>
   <TD>39</TD>
   <TD>217790</TD>
   <TD>2331 Mb</TD>
   <TD>4045 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.530610</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>19.077</TD>
   <TD>xx.456</TD>
   <TD>N/A</TD>
   <TD>tbd</TD>
   <TD>2007-09-29</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Cat felCat3</TD>
   <TH>Dog canFam2</TH>
   <TD>217790</TD>
   <TD>39</TD>
   <TD>3858 Mb</TD>
   <TD>2445 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.533544</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>32.362</TD>
   <TD>xx.456</TD>
   <TD>34.891</TD>
   <TD>tbd</TD>
   <TD>2007-09-05</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Horse equCab1</TD>
   <TH>Elephant loxAfr1</TH>
   <TD>32</TD>
   <TD>233134</TD>
   <TD>1961 Mb</TD>
   <TD>3707 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.536627</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>18.052</TD>
   <TD>xx.456</TD>
   <TD>N/A</TD>
   <TD>tbd</TD>
   <TD>2007-10-01</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Cow bosTau3</TD>
   <TH>Cow bosTau3</TH>
   <TD>30</TD>
   <TD>30</TD>
   <TD>2321 Mb</TD>
   <TD>2434 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.540852</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>26.352</TD>
   <TD>xx.456</TD>
   <TD>25.909</TD>
   <TD>tbd</TD>
   <TD>2007-09-25</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Armadillo dasNov1</TD>
   <TH>Hedgehog eriEur1</TH>
   <TD>304391</TD>
   <TD>379801</TD>
   <TD>3678 Mb</TD>
   <TD>3367 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.632457</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>10.022</TD>
   <TD>xx.456</TD>
   <TD>N/A</TD>
   <TD>tbd</TD>
   <TD>2007-10-02</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Elephant loxAfr1</TD>
   <TH>Shrew sorAra1</TH>
   <TD>233134</TD>
   <TD>262057</TD>
   <TD>3535 Mb</TD>
   <TD>2936 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.658734</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>9.556</TD>
   <TD>xx.456</TD>
   <TD>N/A</TD>
   <TD>tbd</TD>
   <TD>2007-10-01</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Tenrec echTel1</TD>
   <TH>Tenrec echTel1</TH>
   <TD>325491</TD>
   <TD>325491</TD>
   <TD>3646 Mb</TD>
   <TD>3823 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.666303</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
   <TD>xx.123</TD>
   <TD>11.141</TD>
   <TD>xx.456</TD>
   <TD>14.144</TD>
   <TD>tbd</TD>
   <TD>2007-09-27</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Opossum monDom4</TD>
   <TH>Opossum monDom4</TH>
   <TD>9</TD>
   <TD>9</TD>
   <TD>3272 Mb</TD>
   <TD>3431 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>0.909852</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>9.752</TD>
   <TD>xx.456</TD>
   <TD>7.254</TD>
   <TD>tbd</TD>
   <TD>2007-09-27</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Platypus ornAna1</TD>
   <TH>Platypus ornAna1</TH>
   <TD>201522</TD>
   <TD>201522</TD>
   <TD>1904 Mb</TD>
   <TD>1996 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>1.165888</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>5.417</TD>
   <TD>xx.456</TD>
   <TD>7.359</TD>
   <TD>tbd</TD>
   <TD>2007-09-25</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Chicken galGal3</TD>
   <TH>Chicken galGal3</TH>
   <TD>33</TD>
   <TD>33</TD>
   <TD>984 Mb</TD>
   <TD>1032 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>1.285399</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>3.729</TD>
   <TD>xx.456</TD>
   <TD>8.152</TD>
   <TD>tbd</TD>
   <TD>2007-09-25</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Lizard anoCar1</TD>
   <TH>Lizard anoCar1</TH>
   <TD>7233</TD>
   <TD>7233</TD>
   <TD>1699 Mb</TD>
   <TD>1781 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>1.404225</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>3.406</TD>
   <TD>xx.456</TD>
   <TD>4.934</TD>
   <TD>tbd</TD>
   <TD>2007-09-21</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>X. tropicalis xenTro2</TD>
   <TH>X. tropicalis xenTro2</TH>
   <TD>19759</TD>
   <TD>19759</TD>
   <TD>1443 Mb</TD>
   <TD>1513 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>1.726205</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>3.131</TD>
   <TD>xx.456</TD>
   <TD>2.651</TD>
   <TD>tbd</TD>
   <TD>2007-09-24</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Tetraodon tetNig1</TD>
   <TH>Stickleback gasAcu1</TH>
   <TD>21</TD>
   <TD>21</TD>
   <TD>207 Mb</TD>
   <TD>400 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>2.012649</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>1.849</TD>
   <TD>xx.456</TD>
   <TD>9.791</TD>
   <TD>tbd</TD>
   <TD>2007-09-07</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Fugu fr2</TD>
   <TH>Zebrafish danRer5</TH>
   <TD>1</TD>
   <TD>25</TD>
   <TD>381 Mb</TD>
   <TD>1547 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>2.027153</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>3.255</TD>
   <TD>xx.456</TD>
   <TD>4.625</TD>
   <TD>tbd</TD>
   <TD>2007-09-13</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Stickleback gasAcu1</TD>
   <TH>Tetraodon tetNig1</TH>
   <TD>21</TD>
   <TD>21</TD>
   <TD>382 Mb</TD>
   <TD>217 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>2.051015</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>1.763</TD>
   <TD>xx.456</TD>
   <TD>12.341</TD>
   <TD>tbd</TD>
   <TD>2007-09-07</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Medaka oryLat1</TD>
   <TH>Fugu fr2</TH>
   <TD>24</TD>
   <TD>1</TD>
   <TD>690 Mb</TD>
   <TD>400 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>2.086669</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>1.794</TD>
   <TD>xx.456</TD>
   <TD>10.784</TD>
   <TD>tbd</TD>
   <TD>2007-09-07</TD>
</TR>
</TR>


<TR>
<TR>
   <TD>Zebrafish danRer4</TD>
   <TH>Medaka oryLat1</TH>
   <TD>25</TD>
   <TD>24</TD>
   <TD>1475 Mb</TD>
   <TD>724 Mb</TD>
   <TD>X.xxxx</TD>
   <TD>2.200402</TD>
   <TD>3000</TD>
   <TD>5000</TD>
   <TD>medium</TD>
   <TD>loose</TD>
   <TD>xx.123</TD>
   <TD>1.933</TD>
   <TD>xx.456</TD>
   <TD>6.495</TD>
   <TD>tbd</TD>
   <TD>2007-08-31</TD>
</TR>
</TR>


Line 369: Line 414:
(*) chrom count does not include haplotypes, chr*_random, chrUn or chrM unless chrUn or scaffolds are the only sequences for that assembly.
(*) chrom count does not include haplotypes, chr*_random, chrUn or chrM unless chrUn or scaffolds are the only sequences for that assembly.


chrom size has the same limitation as the chrom count, no randoms.
The genome size has the same limitation as the chrom count, no randoms.
 
Tree distances are from the hg18 28-way measurements, with ponAbe1 and calJac1 manually inserted into the tree.


==blastz alignment parameters details==
==blastz alignment parameters details==
One special run, Rat rn4:
<PRE>
M=40M, K=4500, L=4500, Y=15000, T=2 O=600, Q=mouse_rat.q E=55
</PRE>


<TABLE BORDER=1>
<TABLE BORDER=1>
<TR>
<TR>
   <TH>target</TH>
   <TH>query </TH>
   <TH>query</TH>
  <TD>abridged<BR>repeats</TD>
   <TH>abridged<BR>repeats</TH>
  <TD>M</TD>
   <TH>target<BR>size<BR>(overlap)</TH>
  <TD>K</TD>
   <TH>query<BR>size<BR>(overlap)</TH>
  <TD>L</TD>
   <TH>H</TH>
  <TD>Q</TD>
   <TH>M</TH>
  <TD>Y</TD>
</TR>
<TR>
   <TH>Rat rn4</TH>
  <TD>yes</TD>
  <TD>40M</TD>
  <TD>4500</TD>
  <TD>4500</TD>
  <TD>mouse-rat</TD>
  <TD>15000</TD>
</TR>
<TR>
  <TH>Human hg18</TH>
  <TD>yes</TD>
  <TD>40M</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Rhesus rheMac2</TH>
  <TD>yes</TD>
  <TD>40M</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Orangutan ponAbe2</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Marmoset calJac1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Chimp panTro2</TH>
  <TD>yes</TD>
  <TD>40M</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
   <TH>GuineaPig cavPor2</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Horse equCab1</TH>
  <TD>no</TD>
  <TD>40M</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>TreeShrew tupBel1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
   <TH>Bushbaby otoGar1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Armadillo dasNov1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
   <TH>Rabbit oryCun1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Cat felCat3</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
   <TH>Dog canFam2</TH>
  <TD>yes</TD>
  <TD>40M</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
   <TH>Elephant loxAfr1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
</TR>
<TR>
<TR>
   <TH>mm9</TH>
   <TH>Cow bosTau3</TH>
   <TD>rat rn4</TD>
   <TD>no</TD>
  <TD>yes<BR>B=0</TD>
  <TD>10M (10K)</TD>
  <TD>10M (0)</TD>
  <TD>2000</TD>
   <TD>40M</TD>
   <TD>40M</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Hedgehog eriEur1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Shrew sorAra1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Tenrec echTel1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>3K</TD>
  <TD>3K</TD>
  <TD>default</TD>
  <TD>9400</TD>
</TR>
<TR>
  <TH>Opossum monDom4</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
<TR>
  <TH>Platypus ornAna1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
</TR>
<TR>
<TR>
   <TH>human hg18</TH>
   <TH>Chicken galGal3</TH>
  <TD>mm9</TD>
   <TD>yes</TD>
   <TD>yes<BR>B=0</TD>
  <TD>10M (0)</TD>
  <TD>10M (10K)</TD>
  <TD>2000</TD>
   <TD>40M</TD>
   <TD>40M</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
<TR>
  <TH>Lizard anoCar1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
<TR>
  <TH>X_tropicalis xenTro2</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
<TR>
  <TH>Stickleback gasAcu1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
<TR>
  <TH>Zebrafish danRer5</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
<TR>
  <TH>Tetraodon tetNig1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
<TR>
  <TH>Fugu fr2</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
<TR>
<TR>
  <TH>Medaka oryLat1</TH>
  <TD>no</TD>
  <TD>50</TD>
  <TD>2200</TD>
  <TD>6000</TD>
  <TD>HoxD55</TD>
  <TD>3400</TD>
</TR>
</TABLE>
</TABLE>


==default blastz parameters==
==default blastz parameters==
Line 438: Line 730:
</PRE>
</PRE>


==matrix parameters==
<TABLE BORDER=0>
<TR><TD>The default<BR>scoring matrix is:</TD><TD WIDTH=42>&nbsp;</TD>
<TD>The HoxD55<BR>scoring matrix is:</TD><TD WIDTH=42>&nbsp;</TD><TD>The mouse-rat<BR>scoring matrix is:</TD></TR>
 
<TR><TD>
<TABLE BORDER=1>
<TR><TH>&nbsp;</TH><TH>A</TH><TH>C</TH><TH>G</TH><TH>T</TH></TR>
<TR><TH>A</TH><TD>91</TD><TD>-114</TD><TD>-31</TD><TD>-123</TD></TR>
<TR><TH>C</TH><TD>-114</TD><TD>100</TD><TD>-125</TD><TD>-31</TD></TR>
<TR><TH>G</TH><TD>-31</TD><TD>-125</TD><TD>100</TD><TD>-114</TD></TR>
<TR><TH>T</TH><TD>-123</TD><TD>-31</TD><TD>-114</TD><TD>91</TD></TR>
</TABLE>
O=400 E=30
</TD><TD WIDTH=42>&nbsp;</TD><TD><TABLE BORDER=1>
<TR><TH>&nbsp;</TH><TH>A</TH><TH>C</TH><TH>G</TH><TH>T</TH></TR>
<TR><TH>A</TH><TD>91</TD><TD>-90</TD><TD>-25</TD><TD>-100</TD></TR>
<TR><TH>C</TH><TD>-90</TD><TD>100</TD><TD>-100</TD><TD>-25</TD></TR>
<TR><TH>G</TH><TD>-25</TD><TD>-100</TD><TD>100</TD><TD>-90</TD></TR>
<TR><TH>T</TH><TD>-100</TD><TD>-25</TD><TD>-90</TD><TD>91</TD></TR>
</TABLE>
O=400 E=30
</TD><TD WIDTH=42>&nbsp;</TD><TD><TABLE BORDER=1>
<TR><TH>&nbsp;</TH><TH>A</TH><TH>C</TH><TH>G</TH><TH>T</TH></TR>
<TR><TH>A</TH><TD>56</TD><TD>-109</TD><TD>-45</TD><TD>-137</TD></TR>
<TR><TH>C</TH><TD>-109</TD><TD>100</TD><TD>-103</TD><TD>-45</TD></TR>
<TR><TH>G</TH><TD>-45</TD><TD>-103</TD><TD>100</TD><TD>-109</TD></TR>
<TR><TH>T</TH><TD>-137</TD><TD>-45</TD><TD>-109</TD><TD>56</TD></TR>
</TABLE>
O=600 E=55
</TD>
</TR></TABLE>
 
 
==Additional doBlastzChainNet.pl parameters to tune==
 
These parameters for blastz are placed in the '''DEF''' file used by the <code>doBlastzChainNet.pl</code> perl script.  There are some
additional parameters worth discussing:
* SEQ1* variables describe the target/reference assembly, while SEQ2* variables describe the query assembly.
* SEQn_CHUNK and SEQn_LAP determine the step size and overlap size of chunks into which large target or query sequences are to be split before alignment. SEQ1_LAP is always set to 10000 for reasons described in the blastz paper, while SEQ2_LAP is traditionaly set to 0. Since the CHUNK size will determine how a genome is partitioned, smaller CHUNKS result in more blastz jobs and larger CHUNKS result in fewer jobs which will take longer to run. 
* SEQn_LIMIT decides what the maximum number of sequences should be for any partitioned file. This limit only effects SEQ1 or SEQ2 when they are 2bit files.  Some 2bit files have too many contigs which will result in ''blastz hippos'' (jobs taking forever compared to the other jobs).  SEQ1_LIMIT defaults to 30 and SEQ2_LIMIT defaults to 100.  By setting a lower limit, ''hippos'' may be avoided but more jobs will result.  By setting this limit higher, fewer but longer running jobs may result. ''Apologies to Hippopotamus amphibius.''
* The specifics of the target and query genomes may result in very different blastz run sizes, but CHUNK and LIMIT can be used to tune the size of the blastz run. One means of tuning is to run <CODE>doBlastzChainNet.pl</CODE> with the <CODE>-stop=partition</CODE> flag to test the results of the partition step until satisfied, then continue with the <code>-continue=blastz</code>.  To determine the number of blastz jobs that will result from a given partition, multiply the number of lines in <i>{target}</i>.lst by the number of lines in the <i>{query}</i>.lst which will be found in the <code>run.blastz</code> directory.  Expect a tuned blastz run for two 3Mb genomes will have in 50K-100K jobs in the blastz batch.
* While tuning the number of jobs is easy, what the goal should be is to tune the time that a job runs.  Angie recommends an *average* job time to be ~3-7 minutes, no less than 45s and no greater than 20 minutes.  Jim probably favors smaller numbers for each of those.
 
==axtChain matrix parameters==
The "medium" gap score matrix, tuned for the mouse-human distance is:
The "medium" gap score matrix, tuned for the mouse-human distance is:


Line 460: Line 795:
bothGap 625 660 700  750  900  1400  4000  8000  16000  32000  57000
bothGap 625 660 700  750  900  1400  4000  8000  16000  32000  57000
</PRE>
</PRE>
==the tree diagram==
Created from the Hg18 28-way diagram, with the two new species, ponAbe2 and calJac1 manually inserted:
<PRE>
((((((((
(((Mouse_mm9:0.076274,Rat_rn4:0.084383):0.200607,
    GuineaPig_cavPor2:0.202990):0.034350,
        Rabbit_oryCun1:0.208548):0.014587,
((((((Human_hg18:0.005873,Chimp_panTro2:0.007668):0.013037,
  Orangutan_ponAbe2:0.02):0.013037,Rhesus_rheMac2:0.031973):0.0365,
        Marmoset_calJac1:0.07):0.0365,Bushbaby_otoGar1:0.151185):0.015682,
          TreeShrew_tupBel1:0.162844):0.006272):0.019763,
((Shrew_sorAra1:0.248532,Hedgehog_eriEur1:0.222255):0.045693,
(((Dog_canFam2:0.101137,Cat_felCat3:0.098203):0.048213,
    Horse_equCab1:0.099323):0.007287,
        Cow_bosTau3:0.163945):0.012398):0.018928):0.030081,
(Armadillo_dasNov1:0.133274,(Elephant_loxAfr1:0.103030,
        Tenrec_echTel1:0.232706):0.049511):0.008424):0.213469,
Opossum_monDom4:0.320721):0.088647,
    Platypus_ornAna1:0.488110):0.118797,
        (Chicken_galGal3:0.395136,Lizard_anoCar1:0.513962):0.093688):0.151358,
            Frog_xenTro2:0.778272):0.174596,
(((Tetraodon_tetNig1:0.203933,Fugu_fr2:0.239587):0.203949,
    (Stickleback_gasAcu1:0.314162,Medaka_oryLat1:0.501915):0.055354):0.346008,
Zebrafish_danRer5:0.730028):0.174596);
</PRE>
==phastCons graph procedure==
The maf files from the 30-way alignment are split for phastCons with msa_split and the following parameters:
<PRE>
msa_split <eachChrom>.maf -i MAF -M mm9.thisChr.fa -o SS -w 10000000,0 -I 1000 -B 5000
</PRE>
An attempt was made to run phyloFit on some of the smaller chromosome .ss files, but this procedure consumes too much CPU time and did not give a good result for a starting-tree to phastCons.  Instead, using the starting-tree from the Hg18 28-way phastCons operation, insert the two new organisms into the tree diagram, with similar distances as seen in the failed attempt of using phyloFit on several small chromosomes, to arrive at a starting-tree.mod file:
<PRE>
ALPHABET: A C G T
ORDER: 0
SUBST_MOD: REV
BACKGROUND: 0.295000 0.205000 0.205000 0.295000
RATE_MAT:
  -0.990634    0.178819    0.490738    0.321078
  0.257325  -1.001677    0.186563    0.557790
  0.706184    0.186563  -1.161873    0.269126
  0.321078    0.387615    0.187019  -0.895713
TREE: (((((((((((mm9:0.076274,rn4:0.084383):0.200607,cavPor2:0.202990):0.034350,
oryCun1:0.208548):0.014587,((((((hg18:0.005873,panTro2:0.007668):0.013037,
ponAbe2:0.019969):0.013037,rheMac2:0.031973):0.013300,calJac1:0.074420):0.068818,
otoGar1:0.151185):0.015682,tupBel1:0.162844):0.006272):0.019763,
((sorAra1:0.248532,eriEur1:0.222255):0.045693,
(((canFam2:0.101137,felCat3:0.098203):0.048213,equCab1:0.099323):0.007287,
bosTau3:0.163945):0.012398):0.018928):0.030081,(dasNov1:0.133274,
(loxAfr1:0.103030,echTel1:0.232706):0.049511):0.008424):0.213469,
monDom4:0.320721):0.088647,ornAna1:0.488110):0.118797,
(galGal3:0.395136,anoCar1:0.513962):0.093688):0.151358,xenTro2:0.778272):0.174596,
(((tetNig1:0.203933,fr2:0.239587):0.203949,
(gasAcu1:0.314162,oryLat1:0.501915):0.055354):0.346008,danRer5:0.730028):0.174596);
</PRE>
Then, running phastCons on each .ss file with the parameters:
<PRE>
phastCons <oneFile>.ss starting-tree.mod --rho .31 --expected-length 45
  --target-coverage .3 --quiet --seqname <chrName> --idpref <chrName>
  --most-conserved <oneFile>.bed --score
</PRE>
Combine all the most-conserved .bed files into a single mostConserved.bed file and add lod score calculation:
<PRE>
cat bed/*/chr*.bed | sort -k1,1 -k2,2n | \
        awk '{printf "%s\t%d\t%d\tlod=%d\t%s\n", $1, $2, $3, $5, $5;}' | \
            /cluster/bin/scripts/lodToBedScore /dev/stdin > mostConserved.bed
</PRE>
That result is loaded for the ''Most Conserved'' track in the table phastConsElements30way.  A featureBits measurement verifies a target coverage of refGene CDS near 70%:
<PRE>
$  featureBits mm9 -enrichment refGene:cds phastConsElements30way
refGene:cds 1.167%, phastConsElements30way 5.841%, both 0.863%, cover 73.90%, enrich 12.65x
</PRE>
Encoding the wiggle data from the pp files from the phastCons run:
<PRE>
for D in pp/chr*
do
    ls $D/*.pp | sort -n -t\. -k2
done | xargs cat \
        | wigEncode -noOverlap stdin phastCons30way.wig phastCons30way.wib
</PRE>
The ''sort'' of the files names gets them into chrom position numerical order for input to the wigEncode.  Any data input to wigEncode must be in chrom position numerical order for correct encoding.
==alternative phastCons graphs==
Two subsets of the organisms were used to compute alternative phastCons graphs.
<OL>
<LI> Euarchontoglires - mm9,rn4,cavPor2,oryCun1,hg18,panTro2,ponAbe2,
rheMac2,calJac1,otoGar1,tupBel1
<LI> Placentals - mm9,rn4,cavPor2,oryCun1,hg18,panTro2,ponAbe2,
rheMac2,calJac1,otoGar1,tupBel1,sorAra1,eriEur1,canFam2,felCat3,
equCab1,bosTau3,dasNov1,loxAfr1,echTel1
</OL>
The same operating parameters were used during the phastCons operation.
The starting-tree.mod file for each of these was created from the overall 30-way starting-tree.mod file trimmed down with tree_doctor.  Defining trees of:
Euarchontoglires:
<PRE>
TREE: ((((mm9:0.076274,rn4:0.084383):0.200607,cavPor2:0.202990):0.034350,
oryCun1:0.208548):0.014587,((((((hg18:0.005873,panTro2:0.007668):0.013037,
ponAbe2:0.019969):0.013037,rheMac2:0.031973):0.013300,calJac1:0.074420):0.068818,
otoGar1:0.151185):0.015682,tupBel1:0.162844):0.006272);
</PRE>
Placentals:
<PRE>
TREE: ((((((mm9:0.076274,rn4:0.084383):0.200607,cavPor2:0.202990):0.034350,
oryCun1:0.208548):0.014587,((((((hg18:0.005873,panTro2:0.007668):0.013037,
ponAbe2:0.019969):0.013037,rheMac2:0.031973):0.013300,calJac1:0.074420):0.068818,
otoGar1:0.151185):0.015682,tupBel1:0.162844):0.006272):0.019763,
((sorAra1:0.248532,eriEur1:0.222255):0.045693,
(((canFam2:0.101137,felCat3:0.098203):0.048213,equCab1:0.099323):0.007287,
bosTau3:0.163945):0.012398):0.018928):0.030081,(dasNov1:0.133274,
(loxAfr1:0.103030,echTel1:0.232706):0.049511):0.008424);
</PRE>
==phastCons histograms==
Histogram data is created from the loaded track data with the ''hgWiggle'' command:
<PRE>
$ hgWiggle -doHistogram -hBinSize=0.001 -hBinCount=1000 -hMinVal=0.0 -verbose=2 \
            -db=mm9 phastCons30way > histogram.data 2>&1
</PRE>
[[Image:PhastCons30way.png|frame|phastCons Histogram]]
Euarchontoglires subset:
[[Image:Mm9Euarchontoglires.png|frame|Euarchontoglires phastCons Histogram]]
Placental subset:
[[Image:Mm9Placental.png|frame|Placental phastCons Histogram]]
[[Category:Technical FAQ]]
[[Category:Comparative Genomics]]

Latest revision as of 18:45, 21 July 2011

See also:

Mouse mm9 multiple alignment/conservation track

To avoid artifacts in downstream processing of the UCSC multiple alignments, it is important to be careful on the use of the parameters used in the blastz processing pipeline. There are a number of steps in the pipeline and a variety of tunable parameters involved. This page will track the various parameters used in the alignments as they proceed toward the completion of a multiple alignment conservation track on the mm9 mouse (NCBI build 37) assembly

The chrom count in the table below does not include haplotypes, chr*_random, chrUn or chrM unless chrUn or scaffolds are the only sequences for that assembly.

The genome size has the same limitation as the chrom count, no randoms.

Tree distances are from the hg18 28-way measurements, with ponAbe2 and calJac1 manually inserted into the tree.

1 Mb = 1000 * 1000 = 1,000,000 bases

Types of mafs used in the multiple alignment:

  • 2X mammal assemblies == reciprocal best net
    calJac1, cavPor2, tupBel1, otoGar1, dasNov1, oryCun1, felCat3, loxAfr1, eriEur1, sorAra1, echTel1
  • high coverage placental mammals == syntenic net
    rn4, hg18, rheMac2, ponAbe2, panTro2, equCab1, canFam2, bosTau3
  • all others == standard net
    monDom4, ornAna1, galGal3, anoCar1, xenTro2, gasAcu1, danRer5, tetNig1, fr2, oryLat1
30-way phylogenetic tree

axtChain parameters and end results

name db chrom
count (*)
genome
size
tree
distance
axtChain
minScore
axtChain
linearGap
% of mm9
matched
% of other
matched by mm9
done
mouse mm9
self alignment
21 2654 Mb 0.0 10000 medium 14.458 N/A 2007-08-31
rat rn4 21 2718 Mb 0.160657 5000 medium 65.380 66.538 2007-08-31
human hg18 24 3080 Mb 0.452619 3000 medium 38.499 35.201 2007-08-16
Rhesus rheMac2 22 2864 Mb 0.452745 3000 medium 38.087 41.335 2007-09-07
Orangutan ponAbe2 25 3029 Mb 0.453809 3000 medium 34.902 30.659 2007-09-21
Marmoset calJac1 49724 3029 Mb 0.454272 3000 medium 32.971 30.302 2007-09-07
Chimp panTro2 25 3175 Mb 0.454514 3000 medium 37.674 34.269 2007-09-25
GuineaPig cavPor2 295514 3403 Mb 0.479871 3000 medium 18.326 N/A 2007-09-21
Horse equCab1 32 2056 Mb 0.479871 3000 medium 34.782 37.217 2007-09-25
TreeShrew tupBel1 150851 3660 Mb 0.494934 3000 medium 21.099 N/A 2007-10-01
Bushbaby otoGar1 120882 3420 Mb 0.498957 3000 medium 22.972 N/A 2007-09-28
Armadillo dasNov1 304391 3856 Mb 0.517360 3000 medium 16.547 N/A 2007-10-02
Rabbit oryCun1 215471 3464 Mb 0.519779 3000 medium 18.945 N/A 2007-09-29
Cat felCat3 217790 4045 Mb 0.530610 3000 medium 19.077 N/A 2007-09-29
Dog canFam2 39 2445 Mb 0.533544 3000 medium 32.362 34.891 2007-09-05
Elephant loxAfr1 233134 3707 Mb 0.536627 3000 medium 18.052 N/A 2007-10-01
Cow bosTau3 30 2434 Mb 0.540852 3000 medium 26.352 25.909 2007-09-25
Hedgehog eriEur1 379801 3367 Mb 0.632457 3000 medium 10.022 N/A 2007-10-02
Shrew sorAra1 262057 2936 Mb 0.658734 3000 medium 9.556 N/A 2007-10-01
Tenrec echTel1 325491 3823 Mb 0.666303 3000 medium 11.141 14.144 2007-09-27
Opossum monDom4 9 3431 Mb 0.909852 5000 loose 9.752 7.254 2007-09-27
Platypus ornAna1 201522 1996 Mb 1.165888 5000 loose 5.417 7.359 2007-09-25
Chicken galGal3 33 1032 Mb 1.285399 5000 loose 3.729 8.152 2007-09-25
Lizard anoCar1 7233 1781 Mb 1.404225 5000 loose 3.406 4.934 2007-09-21
X. tropicalis xenTro2 19759 1513 Mb 1.726205 5000 loose 3.131 2.651 2007-09-24
Stickleback gasAcu1 21 400 Mb 2.012649 5000 loose 1.849 9.791 2007-09-07
Zebrafish danRer5 25 1547 Mb 2.027153 5000 loose 3.255 4.625 2007-09-13
Tetraodon tetNig1 21 217 Mb 2.051015 5000 loose 1.763 12.341 2007-09-07
Fugu fr2 1 400 Mb 2.086669 5000 loose 1.794 10.784 2007-09-07
Medaka oryLat1 24 724 Mb 2.200402 5000 loose 1.933 6.495 2007-08-31


(*) chrom count does not include haplotypes, chr*_random, chrUn or chrM unless chrUn or scaffolds are the only sequences for that assembly.

The genome size has the same limitation as the chrom count, no randoms.

Tree distances are from the hg18 28-way measurements, with ponAbe1 and calJac1 manually inserted into the tree.

blastz alignment parameters details

One special run, Rat rn4:

M=40M, K=4500, L=4500, Y=15000, T=2 O=600, Q=mouse_rat.q E=55
query abridged
repeats
M K L Q Y
Rat rn4 yes 40M 4500 4500 mouse-rat 15000
Human hg18 yes 40M 3K 3K default 9400
Rhesus rheMac2 yes 40M 3K 3K default 9400
Orangutan ponAbe2 no 50 3K 3K default 9400
Marmoset calJac1 no 50 3K 3K default 9400
Chimp panTro2 yes 40M 3K 3K default 9400
GuineaPig cavPor2 no 50 3K 3K default 9400
Horse equCab1 no 40M 3K 3K default 9400
TreeShrew tupBel1 no 50 3K 3K default 9400
Bushbaby otoGar1 no 50 3K 3K default 9400
Armadillo dasNov1 no 50 3K 3K default 9400
Rabbit oryCun1 no 50 3K 3K default 9400
Cat felCat3 no 50 3K 3K default 9400
Dog canFam2 yes 40M 3K 3K default 9400
Elephant loxAfr1 no 50 3K 3K default 9400
Cow bosTau3 no 40M 3K 3K default 9400
Hedgehog eriEur1 no 50 3K 3K default 9400
Shrew sorAra1 no 50 3K 3K default 9400
Tenrec echTel1 no 50 3K 3K default 9400
Opossum monDom4 no 50 2200 6000 HoxD55 3400
Platypus ornAna1 no 50 2200 6000 HoxD55 3400
Chicken galGal3 yes 40M 2200 6000 HoxD55 3400
Lizard anoCar1 no 50 2200 6000 HoxD55 3400
X_tropicalis xenTro2 no 50 2200 6000 HoxD55 3400
Stickleback gasAcu1 no 50 2200 6000 HoxD55 3400
Zebrafish danRer5 no 50 2200 6000 HoxD55 3400
Tetraodon tetNig1 no 50 2200 6000 HoxD55 3400
Fugu fr2 no 50 2200 6000 HoxD55 3400
Medaka oryLat1 no 50 2200 6000 HoxD55 3400

default blastz parameters

m=80  v=0  B=2  C=0  E=30  G=0  H=0  K=3000 L=K
M=0 O=400 P=1 R=0 T=1 W=8 X=10*(A-to-A match score)
Y=O+300*E Z=1

From the blastz usage message:

Default values are given in parentheses.
  m(80M) bytes of space for trace-back information
  v(0) 0: quiet; 1: verbose progress reports to stderr
  B(2) 0: single strand; >0: both strands
  C(0) 0: no chaining; 1: just output chain; 2: chain and extend;
       3: just output HSPs
  E(30) gap-extension penalty.
  G(0) diagonal chaining penalty.
  H(0) interpolate between alignments at threshold K = argument.
  K(3000) threshold for MSPs
  L(K) threshold for gapped alignments
  M(0) mask any base in seq1 hit this many times; 0 = no dynamic masking
  O(400) gap-open penalty.
  P(1) 0: entropy not used; 1: entropy used; >1 entropy with feedback.
  Q load the scoring matrix from a file.
  R(0) antidiagonal chaining penalty.
  T(1) 0: W-bp words;  1: 12of19;  2: 12of19 without transitions.
                       3: 14of22;  4: 14of22 without transitions.
  W(8) word size (unused unless T=0)
  X(10*(A-to-A match score)) X-drop parameter for ungapped extension.
  Y(O+300E) X-drop parameter for gapped extension.
  Z(1) increment between successive words in sequence 1.
The default
scoring matrix is:
  The HoxD55
scoring matrix is:
 The mouse-rat
scoring matrix is:
 ACGT
A91-114-31-123
C-114100-125-31
G-31-125100-114
T-123-31-11491

O=400 E=30

 
 ACGT
A91-90-25-100
C-90100-100-25
G-25-100100-90
T-100-25-9091

O=400 E=30

 
 ACGT
A56-109-45-137
C-109100-103-45
G-45-103100-109
T-137-45-10956

O=600 E=55


Additional doBlastzChainNet.pl parameters to tune

These parameters for blastz are placed in the DEF file used by the doBlastzChainNet.pl perl script. There are some additional parameters worth discussing:

  • SEQ1* variables describe the target/reference assembly, while SEQ2* variables describe the query assembly.
  • SEQn_CHUNK and SEQn_LAP determine the step size and overlap size of chunks into which large target or query sequences are to be split before alignment. SEQ1_LAP is always set to 10000 for reasons described in the blastz paper, while SEQ2_LAP is traditionaly set to 0. Since the CHUNK size will determine how a genome is partitioned, smaller CHUNKS result in more blastz jobs and larger CHUNKS result in fewer jobs which will take longer to run.
  • SEQn_LIMIT decides what the maximum number of sequences should be for any partitioned file. This limit only effects SEQ1 or SEQ2 when they are 2bit files. Some 2bit files have too many contigs which will result in blastz hippos (jobs taking forever compared to the other jobs). SEQ1_LIMIT defaults to 30 and SEQ2_LIMIT defaults to 100. By setting a lower limit, hippos may be avoided but more jobs will result. By setting this limit higher, fewer but longer running jobs may result. Apologies to Hippopotamus amphibius.
  • The specifics of the target and query genomes may result in very different blastz run sizes, but CHUNK and LIMIT can be used to tune the size of the blastz run. One means of tuning is to run doBlastzChainNet.pl with the -stop=partition flag to test the results of the partition step until satisfied, then continue with the -continue=blastz. To determine the number of blastz jobs that will result from a given partition, multiply the number of lines in {target}.lst by the number of lines in the {query}.lst which will be found in the run.blastz directory. Expect a tuned blastz run for two 3Mb genomes will have in 50K-100K jobs in the blastz batch.
  • While tuning the number of jobs is easy, what the goal should be is to tune the time that a job runs. Angie recommends an *average* job time to be ~3-7 minutes, no less than 45s and no greater than 20 minutes. Jim probably favors smaller numbers for each of those.

axtChain matrix parameters

The "medium" gap score matrix, tuned for the mouse-human distance is:

tableSize    11
smallSize   111
position  1   2   3   11  111  2111  12111  32111   72111  152111  252111
qGap    350 425 450  600  900  2900  22900  57900  117900  217900  317900
tGap    350 425 450  600  900  2900  22900  57900  117900  217900  317900
bothGap 750 825 850 1000 1300  3300  23300  58300  118300  218300  318300

The "loose" gap score matrix, tuned for the chicken-human distance is:

tablesize    11
smallSize   111
position  1   2   3   11  111  2111  12111  32111  72111  152111  252111
qGap    325 360 400  450  600  1100   3600   7600  15600   31600   56600
tGap    325 360 400  450  600  1100   3600   7600  15600   31600   56600
bothGap 625 660 700  750  900  1400   4000   8000  16000   32000   57000

the tree diagram

Created from the Hg18 28-way diagram, with the two new species, ponAbe2 and calJac1 manually inserted:

((((((((

 (((Mouse_mm9:0.076274,Rat_rn4:0.084383):0.200607,
    GuineaPig_cavPor2:0.202990):0.034350,
        Rabbit_oryCun1:0.208548):0.014587,

((((((Human_hg18:0.005873,Chimp_panTro2:0.007668):0.013037,
   Orangutan_ponAbe2:0.02):0.013037,Rhesus_rheMac2:0.031973):0.0365,
        Marmoset_calJac1:0.07):0.0365,Bushbaby_otoGar1:0.151185):0.015682,
           TreeShrew_tupBel1:0.162844):0.006272):0.019763,

 ((Shrew_sorAra1:0.248532,Hedgehog_eriEur1:0.222255):0.045693,

 (((Dog_canFam2:0.101137,Cat_felCat3:0.098203):0.048213,
    Horse_equCab1:0.099323):0.007287,
        Cow_bosTau3:0.163945):0.012398):0.018928):0.030081,

 (Armadillo_dasNov1:0.133274,(Elephant_loxAfr1:0.103030,
        Tenrec_echTel1:0.232706):0.049511):0.008424):0.213469,

 Opossum_monDom4:0.320721):0.088647,
    Platypus_ornAna1:0.488110):0.118797,
        (Chicken_galGal3:0.395136,Lizard_anoCar1:0.513962):0.093688):0.151358,
            Frog_xenTro2:0.778272):0.174596,

 (((Tetraodon_tetNig1:0.203933,Fugu_fr2:0.239587):0.203949,
    (Stickleback_gasAcu1:0.314162,Medaka_oryLat1:0.501915):0.055354):0.346008,
Zebrafish_danRer5:0.730028):0.174596);

phastCons graph procedure

The maf files from the 30-way alignment are split for phastCons with msa_split and the following parameters:

msa_split <eachChrom>.maf -i MAF -M mm9.thisChr.fa -o SS -w 10000000,0 -I 1000 -B 5000

An attempt was made to run phyloFit on some of the smaller chromosome .ss files, but this procedure consumes too much CPU time and did not give a good result for a starting-tree to phastCons. Instead, using the starting-tree from the Hg18 28-way phastCons operation, insert the two new organisms into the tree diagram, with similar distances as seen in the failed attempt of using phyloFit on several small chromosomes, to arrive at a starting-tree.mod file:

ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
BACKGROUND: 0.295000 0.205000 0.205000 0.295000 
RATE_MAT:
  -0.990634    0.178819    0.490738    0.321078 
   0.257325   -1.001677    0.186563    0.557790 
   0.706184    0.186563   -1.161873    0.269126 
   0.321078    0.387615    0.187019   -0.895713 
TREE: (((((((((((mm9:0.076274,rn4:0.084383):0.200607,cavPor2:0.202990):0.034350,
oryCun1:0.208548):0.014587,((((((hg18:0.005873,panTro2:0.007668):0.013037,
ponAbe2:0.019969):0.013037,rheMac2:0.031973):0.013300,calJac1:0.074420):0.068818,
otoGar1:0.151185):0.015682,tupBel1:0.162844):0.006272):0.019763,
((sorAra1:0.248532,eriEur1:0.222255):0.045693,
(((canFam2:0.101137,felCat3:0.098203):0.048213,equCab1:0.099323):0.007287,
bosTau3:0.163945):0.012398):0.018928):0.030081,(dasNov1:0.133274,
(loxAfr1:0.103030,echTel1:0.232706):0.049511):0.008424):0.213469,
monDom4:0.320721):0.088647,ornAna1:0.488110):0.118797,
(galGal3:0.395136,anoCar1:0.513962):0.093688):0.151358,xenTro2:0.778272):0.174596,
(((tetNig1:0.203933,fr2:0.239587):0.203949,
(gasAcu1:0.314162,oryLat1:0.501915):0.055354):0.346008,danRer5:0.730028):0.174596);

Then, running phastCons on each .ss file with the parameters:

phastCons <oneFile>.ss starting-tree.mod --rho .31 --expected-length 45
   --target-coverage .3 --quiet --seqname <chrName> --idpref <chrName>
   --most-conserved <oneFile>.bed --score

Combine all the most-conserved .bed files into a single mostConserved.bed file and add lod score calculation:

cat bed/*/chr*.bed | sort -k1,1 -k2,2n | \
        awk '{printf "%s\t%d\t%d\tlod=%d\t%s\n", $1, $2, $3, $5, $5;}' | \
            /cluster/bin/scripts/lodToBedScore /dev/stdin > mostConserved.bed

That result is loaded for the Most Conserved track in the table phastConsElements30way. A featureBits measurement verifies a target coverage of refGene CDS near 70%:

$  featureBits mm9 -enrichment refGene:cds phastConsElements30way
refGene:cds 1.167%, phastConsElements30way 5.841%, both 0.863%, cover 73.90%, enrich 12.65x

Encoding the wiggle data from the pp files from the phastCons run:

for D in pp/chr*
do
    ls $D/*.pp | sort -n -t\. -k2
done | xargs cat \
        | wigEncode -noOverlap stdin phastCons30way.wig phastCons30way.wib

The sort of the files names gets them into chrom position numerical order for input to the wigEncode. Any data input to wigEncode must be in chrom position numerical order for correct encoding.

alternative phastCons graphs

Two subsets of the organisms were used to compute alternative phastCons graphs.

  1. Euarchontoglires - mm9,rn4,cavPor2,oryCun1,hg18,panTro2,ponAbe2, rheMac2,calJac1,otoGar1,tupBel1
  2. Placentals - mm9,rn4,cavPor2,oryCun1,hg18,panTro2,ponAbe2, rheMac2,calJac1,otoGar1,tupBel1,sorAra1,eriEur1,canFam2,felCat3, equCab1,bosTau3,dasNov1,loxAfr1,echTel1

The same operating parameters were used during the phastCons operation.

The starting-tree.mod file for each of these was created from the overall 30-way starting-tree.mod file trimmed down with tree_doctor. Defining trees of:

Euarchontoglires:

TREE: ((((mm9:0.076274,rn4:0.084383):0.200607,cavPor2:0.202990):0.034350,
oryCun1:0.208548):0.014587,((((((hg18:0.005873,panTro2:0.007668):0.013037,
ponAbe2:0.019969):0.013037,rheMac2:0.031973):0.013300,calJac1:0.074420):0.068818,
otoGar1:0.151185):0.015682,tupBel1:0.162844):0.006272);

Placentals:

TREE: ((((((mm9:0.076274,rn4:0.084383):0.200607,cavPor2:0.202990):0.034350,
oryCun1:0.208548):0.014587,((((((hg18:0.005873,panTro2:0.007668):0.013037,
ponAbe2:0.019969):0.013037,rheMac2:0.031973):0.013300,calJac1:0.074420):0.068818,
otoGar1:0.151185):0.015682,tupBel1:0.162844):0.006272):0.019763,
((sorAra1:0.248532,eriEur1:0.222255):0.045693,
(((canFam2:0.101137,felCat3:0.098203):0.048213,equCab1:0.099323):0.007287,
bosTau3:0.163945):0.012398):0.018928):0.030081,(dasNov1:0.133274,
(loxAfr1:0.103030,echTel1:0.232706):0.049511):0.008424);

phastCons histograms

Histogram data is created from the loaded track data with the hgWiggle command:

$ hgWiggle -doHistogram -hBinSize=0.001 -hBinCount=1000 -hMinVal=0.0 -verbose=2 \
            -db=mm9 phastCons30way > histogram.data 2>&1
phastCons Histogram

Euarchontoglires subset:

Euarchontoglires phastCons Histogram

Placental subset:

Placental phastCons Histogram