Whole genome alignment howto
From genomewiki
The whole genome alignments are definitely the biggest mystery of the UCSC browser for me. The scripts like doBlastz and collegues automize the whole stuff for the folks working at UCSC but doesn't really make it any easier to fully understand the system as everything is buried now even more and full of parasol statements.
So I'm trying to re-create whole genome alignments to better understand how my whole-genome alignments are created. Thanks a lot to Angie for answering my questions!
Fileformats we have to know
- lav: a compact form to store genomic pairwise alignments created by Blastz, using only numbers (pos of match + identities)
- psl: similar to lav, but the most compact format at UCSC to store alignments, easier to parse than lav (positions of match + identities)
- axt: a much more readable way to store *pairwise* alignments: positions + aligned sequences (you can forget about this format, it's not used anymore, but the tools still contain axt... in their name although most also accept psl)
- maf: an extended version of axt, *multiple* genomic alignments: assemblies + positions + aligned sequences
You see: To convert from lav to axt/maf we need the genomic sequences from the nib/fa files.
Outline
- MASKING: Both genomes have to be repeatmasked and masked Tandem Repeat Finder (trf) first (thanks to Hiram for pointing this out)
- ALIGNING: The two genomes are aligned with BLASTZ (we don't use blastz's own chaining, see discussion (angie)). This generates lav-files, which have to be converted to psl (lavToPsl)
- CHAINING: As every genomic fragment can match with several others, we keep only the best match for a given part : first do axtSort, then filter with axtBest
- NETTING: Group blocks of chained alignments into longer stretches of synteny (netChain)
- MAF'ING: From the synteny-files (positions), get the sequences and re-create alignments
Outline - Normal case: Many genomes, one is the reference
- BLASTZ: The reference genome is aligned with all others with BLASTZ. That creates lav-files. They are converted to psl.
- Angie says: Our hgdownload $db/vs$OtherDb/README.txt pages give the blastz parameters used for a given run.
- It's a good idea to do this on a compute cluster. It will take ages if you try to do this for a larger genome on one single computer. If you're Europen, you can get an account on vital-it. There are national compute cluster also in the Netherlands and the UK that I know of.
- Two matching fragments next to each other are joined into one fragment (axtChain, "chaining")
- Chains are simply better alignments: If several alignments overlap, we still don't know which the best one is. This filtering was done with axtBest before, now we use chainNet + netToAxt . (I have no clue why Kate mentioned netFilter here.)
- We feed the tree into multiz. Multiz will use many local alignments to generate a multiple local alignment.
Example: A aaactg \ B aa--tg \ A aaa-ctg -> B aa---tg A aaa-ctg / C aattt-g C aattt-g /
- Multiz is not an aligner at all. It's just a "reformatter", rewriting pairwise alignments into multiple alignments. It needs a phylogenetic tree.
- TBA is more like a real aligner. But it's too slow currently for massiv genome-wide alignments.
Example
- I want to do a whole genome alignment of C. intestinalis V2 and C.savignyi V2.
- Downloaded ci2 from hgdownload.cse.ucsc.edu into the directory ci2
- Downloaded cs2 (only visible on hgwtest.cse.ucsc.edu, not on the real server nor the download server) from Ensembl into the directory cs2
- Sometimes genomes are as one big fasta file, we want to split it into smaller ones:
faSplit byName cs2/cs2.fa cs2/
rm -f cs2/cs2.fa
- The Ensembl genome cs2 does not seem to be tandem repeat masked yet, so I do:
for i in cs2/*.fa; do trfBig $i trf/`basename $i`; done
- Blastz seems to ignore lower case characters (took me one afternoon to figure this out), we simply convert everything to nib-format:
for i in in ci2/* cs2/*; do faToNib $i `echo $i | sed -e s/.fa/.nib/`; done
- Now blastz everythin onto everything, in theory we could issue this command:
for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do blastz '$i' $j H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
- If you have a cluster, you can rsync the nib files onto it and add some cluster instructions:
for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do bsub blastz '$i' $j H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
- If both genomes are spread across many scaffolds you have to play around with the sequences. Hiram proposed to join the smaller scaffolds into one big chromsome first, that's what they do at UCSC, and then run blastz on the "ChrUn"-virtual-chromosome. (this will create some false nets, going from one scaffold to another, but we don't have a choice here)
However, it will take ages if you are working on many scaffolds in both species
Remarks
- For the history records: Axt started in mouseStuff and maf started with the ratStuff. There is an older tool called *axtBest* but Angies says: "axtBest is ancient history. It has been replaced by the chaining and netting process, which does a better job of finding the "best" alignment to cover a given region."
- Multiz contains a tool that converts lav2maf directly and UCSC includes one with lavToMaf. However, we don't care about fragments that match two times. For a whole genome, you really want "chains" of best-matching fragments, therefore we don't use these tools.
- Why the heck do mafs include the chrosome sizes?