Mm9 multiple alignment: Difference between revisions

From genomewiki
Jump to navigationJump to search
Line 6: Line 6:
<TR>
<TR>
   <TH>sequence</TH>
   <TH>sequence</TH>
   <TH>distance</TH>
   <TH>tree<BR>distance</TH>
  <TH>genome<BR>size</TH>
   <TH>axtChain<BR>minScore</TH>
   <TH>axtChain<BR>minScore</TH>
   <TH>axtChain<BR>linearGap</TH>
   <TH>axtChain<BR>linearGap</TH>
Line 17: Line 18:
   <TH>rat rn4</TH>
   <TH>rat rn4</TH>
   <TD>0.1587</TD>
   <TD>0.1587</TD>
  <TD>2,702 Mb</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>
Line 27: Line 29:
   <TH>human hg18</TH>
   <TH>human hg18</TH>
   <TD>0.4667</TD>
   <TD>0.4667</TD>
  <TD>2,963 Mb</TD>
   <TD>3000</TD>
   <TD>3000</TD>
   <TD>medium</TD>
   <TD>medium</TD>

Revision as of 21:53, 17 August 2007

To avoid artifacts in downstream processing of the UCSC multiple alignments, it is important to be careful on the use of the parameters used in the blastz processing pipeline. There are a number of steps in the pipeline and a variety of tunable parameters involved. This page will track the various parameters used in the alignments as they proceed toward the completion of a multiple alignment conservation track on the mm9 mouse (NCBI build 37) assembly

axtChain parameters and end results

sequence tree
distance
genome
size
axtChain
minScore
axtChain
linearGap
% of mm9
matched
% of other
matched by mm9
done
rat rn4 0.1587 2,702 Mb 3000 medium 68.357 69.541 16 August
human hg18 0.4667 2,963 Mb 3000 medium 38.499 35.201 16 August

blastz alignment parameters details

target query abridged
repeats
target
size
(overlap)
query
size
(overlap)
H M
mm9 rat rn4 yes
B=0
10M (10K) 10M (0) 2000 40M
human hg18 mm9 yes
B=0
10M (0) 10M (10K) 2000 40M


default blastz parameters

m=80  v=0  B=2  C=0  E=30  G=0  H=0  K=3000 L=K
M=0 O=400 P=1 R=0 T=1 W=8 X=10*(A-to-A match score)
Y=O+300*E Z=1

From the blastz usage message:

Default values are given in parentheses.
  m(80M) bytes of space for trace-back information
  v(0) 0: quiet; 1: verbose progress reports to stderr
  B(2) 0: single strand; >0: both strands
  C(0) 0: no chaining; 1: just output chain; 2: chain and extend;
       3: just output HSPs
  E(30) gap-extension penalty.
  G(0) diagonal chaining penalty.
  H(0) interpolate between alignments at threshold K = argument.
  K(3000) threshold for MSPs
  L(K) threshold for gapped alignments
  M(0) mask any base in seq1 hit this many times; 0 = no dynamic masking
  O(400) gap-open penalty.
  P(1) 0: entropy not used; 1: entropy used; >1 entropy with feedback.
  Q load the scoring matrix from a file.
  R(0) antidiagonal chaining penalty.
  T(1) 0: W-bp words;  1: 12of19;  2: 12of19 without transitions.
                       3: 14of22;  4: 14of22 without transitions.
  W(8) word size (unused unless T=0)
  X(10*(A-to-A match score)) X-drop parameter for ungapped extension.
  Y(O+300E) X-drop parameter for gapped extension.
  Z(1) increment between successive words in sequence 1.

matrix parameters

The "medium" gap score matrix, tuned for the mouse-human distance is:

tableSize    11
smallSize   111
position  1   2   3   11  111  2111  12111  32111   72111  152111  252111
qGap    350 425 450  600  900  2900  22900  57900  117900  217900  317900
tGap    350 425 450  600  900  2900  22900  57900  117900  217900  317900
bothGap 750 825 850 1000 1300  3300  23300  58300  118300  218300  318300

The "loose" gap score matrix, tuned for the chicken-human distance is:

tablesize    11
smallSize   111
position  1   2   3   11  111  2111  12111  32111  72111  152111  252111
qGap    325 360 400  450  600  1100   3600   7600  15600   31600   56600
tGap    325 360 400  450  600  1100   3600   7600  15600   31600   56600
bothGap 625 660 700  750  900  1400   4000   8000  16000   32000   57000