Whole genome alignment howto: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 8: Line 8:
So to convert from lav to axt/maf we need get those genomic sequences from the nib/fa files.
So to convert from lav to axt/maf we need get those genomic sequences from the nib/fa files.


For the history records: It assume axt started in mouseStuff and maf started with the ratStuff.  
For the history records: It assume axt started in mouseStuff and maf started with the ratStuff. "axtBest is ancient history.  It has been replaced by the chaining and netting process, which does a better job of finding the "best" alignment to cover a given region." (angie)
 


== The simplest case: Two genomes, first one is the reference ==
== The simplest case: Two genomes, first one is the reference ==


* The two genomes are aligned with BLASTZ (which parameters? Do we use BLASTZ chaining?). That generates lav-files, which have to be converted to axt (lavToAxt)
* The two genomes are aligned with BLASTZ (we don't use blastz's own chaining, see [http://www.soe.ucsc.edu/pipermail/genome/2007-March/013151.html discussion] (angie)). That generates lav-files, which have to be converted to axt (lavToAxt)
* As every genomic fragment can match with several others, we keep only the best match for a given part : first do axtSort, then filter with axtBest.
* As every genomic fragment can match with several others, we keep only the best match for a given part : first do axtSort, then filter with axtBest.
* Axt can then be converted to maf with AxtToMaf (needing faSize, why the heck do mafs include the chrosome sizes?)
* Axt can then be converted to maf with AxtToMaf (needing faSize, why the heck do mafs include the chrosome sizes?)

Revision as of 14:42, 19 September 2007

The whole genome alignments are definitely the biggest mystery of the UCSC browser for me. doBlastz and collegues doesn't really make it any easier to understand the system as everything is buried now even more and full of parasol statements. So I'm trying to re-create whole genome alignments to better understand this. Please correct my mistakes in the following. This page is far from being finished and I hope I will evolve it into a real howto.

Fileformats we have to know:

  1. lav: a compact form to store genomic pairwise alignments, using only numbers (pos of match + identities)
  2. axt: a more human way to store *pairwise* alignments: positions + aligned sequences
  3. maf: an extended version of axt, *multiple* genomic alignments: assemblies + positions + aligned sequences

So to convert from lav to axt/maf we need get those genomic sequences from the nib/fa files.

For the history records: It assume axt started in mouseStuff and maf started with the ratStuff. "axtBest is ancient history. It has been replaced by the chaining and netting process, which does a better job of finding the "best" alignment to cover a given region." (angie)


The simplest case: Two genomes, first one is the reference

  • The two genomes are aligned with BLASTZ (we don't use blastz's own chaining, see discussion (angie)). That generates lav-files, which have to be converted to axt (lavToAxt)
  • As every genomic fragment can match with several others, we keep only the best match for a given part : first do axtSort, then filter with axtBest.
  • Axt can then be converted to maf with AxtToMaf (needing faSize, why the heck do mafs include the chrosome sizes?)
  • I think we don't need any part of multiz for all of this
  • (Multiz contains a tool that converts lav2maf and UCSC includes one with lavToMaf but both don't care about fragments that match two times)

The normal case: Many genomes, one is the reference

  • The reference genome is aligned with all others with BLASTZ. That creates lav-files. They are converted to psl.
  • Two matching fragments next to each other are joined into one fragment (axtChain, "chaining")
  • Chains are simply better alignments (why cannot we simply used blastz's chains?). If several alignments overlap, we still don't know which the best one is. This filtering was done with axtBest before, now we use chainNet + netToAxt . (I have no clue why Kate mentioned netFilter here.)
  • We feed the tree into multiz. Multiz will use many local alignments to generate a multiple local alignment.
 Example:
    A aaactg  \
    B aa--tg   \     A aaa-ctg
                ->   B aa---tg
    A aaa-ctg  /     C aattt-g
    C aattt-g /
  • I think that multiz is not an aligner at all. It's just a "reformatter", rewriting pairwise alignments into multiple alignments.
  • Does multiz need a phylogenetic tree
  • TBA is probably a real aligner.

(to be continued...)