Whole genome alignment howto: Difference between revisions
Line 29: | Line 29: | ||
** Angie says: ''Our hgdownload $db/vs$OtherDb/README.txt pages give the blastz parameters used for a given run.'' | ** Angie says: ''Our hgdownload $db/vs$OtherDb/README.txt pages give the blastz parameters used for a given run.'' | ||
** It's a good idea to do this on a compute cluster. It will take ages if you try to do this for a larger genome on one single computer. If you're Europen, you can get an account on [http://www.vital-it.ch vital-it]. There are national compute cluster also in the Netherlands and the UK that I know of. | ** It's a good idea to do this on a compute cluster. It will take ages if you try to do this for a larger genome on one single computer. If you're Europen, you can get an account on [http://www.vital-it.ch vital-it]. There are national compute cluster also in the Netherlands and the UK that I know of. | ||
* Chains are simply better alignments: If several alignments overlap, we still don't know which the best one is. This filtering was done with axtBest before, now we use chainNet + netToAxt . ( | * Chains are simply better alignments: If several alignments overlap, we still don't know which the best one is. This filtering was done with axtBest before, now we use chainNet + netToAxt . (Sometimes UCSC also uses netfilter to filter out nets with a gapsize <= 12) | ||
* We feed the tree into multiz. Multiz will use many local alignments to generate a multiple local alignment. | * We feed the tree into multiz. Multiz will use many local alignments to generate a multiple local alignment. | ||
Example: A-B and A-C have been aligned with blastz... | Example: A-B and A-C have been aligned with blastz... |
Revision as of 09:28, 22 September 2007
The whole genome alignments are definitely the biggest mystery of the UCSC browser for me. The scripts like doBlastzChainNet.pl and collegues automize the whole stuff for the folks working at UCSC but doesn't really make it any easier to fully understand the system as everything is buried now even more and full of parasol statements. In addition, it's using the new hgAutomate libraries: As I find Perl by itself disgusting and don't now anything about the idea behind hgAutomate doBlastzChainNet.pl is really confusing, even more as it permanently mixing shell-code with perl-code. That said, doBlastzChainNet.pl was very helpful when writing this howto just as was Conservation Track whereas this page was very confusing.
Thanks a lot to Angie and Hiram for answering all my questions!
Before you start
To understand the following you need to:
- Know about shell scripting. Have a look on the Bash Scripting Guide
- Have your own browser installed and running, see Learn_about_the_Browser and AngieTest
- Have created some tracks yourself, loaded them with hgLoadBed and configured them in trackDb.ra, see Learn_about_the_Browser and Adding New Tracks to a browser installation
Fileformats we have to know
You already know the basic file formats like bed/gff/etc, but might not have crossed some of these yet:
- lav: a compact form to store genomic pairwise alignments created by Blastz, using only numbers (pos of match + identities)
- psl: similar to lav, but the most compact format at UCSC to store alignments, easier to parse than lav (positions of match + identities), e.g. blat is creating this
- axt: a much more readable way to store *pairwise* alignments: positions + aligned sequences (It seems that you can forget about this format, it's rarely used, but the tools still contain axt... in their name although most also accept psl. )
- maf: an extended version of axt, *multiple* genomic alignments: assemblies + positions + aligned sequences
You see: To convert from lav to axt or maf we need the genomic sequences from the nib/fa files.
Outline
- MASKING: Both genomes have to be repeatmasked and masked Tandem Repeat Finder (trf) first (thanks to Hiram for pointing this out)
- ALIGNING: The two genomes are aligned with BLASTZ (we don't use blastz's own chaining, see discussion (angie)). This generates lav-files, which have to be converted to psl (lavToPsl)
- CHAINING: Two matching fragments next to each other are joined into one fragment (axtChain). As every genomic fragment can match with several others, we keep only the best match for a given part : first do axtSort, then filter with axtBest (more info on the mailing list)
- NETTING: Group blocks of chained alignments into longer stretches of synteny (netChain)
- MAF'ING: From the synteny-files (positions), get the sequences and re-create alignments
Outline - More Details
- BLASTZ: The reference genome is aligned with all others with BLASTZ. That creates lav-files. They are converted to psl.
- Angie says: Our hgdownload $db/vs$OtherDb/README.txt pages give the blastz parameters used for a given run.
- It's a good idea to do this on a compute cluster. It will take ages if you try to do this for a larger genome on one single computer. If you're Europen, you can get an account on vital-it. There are national compute cluster also in the Netherlands and the UK that I know of.
- Chains are simply better alignments: If several alignments overlap, we still don't know which the best one is. This filtering was done with axtBest before, now we use chainNet + netToAxt . (Sometimes UCSC also uses netfilter to filter out nets with a gapsize <= 12)
- We feed the tree into multiz. Multiz will use many local alignments to generate a multiple local alignment.
Example: A-B and A-C have been aligned with blastz... A aaactg \ B aa--tg \ A aaa-ctg -> B aa---tg A aaa-ctg / C aattt-g C aattt-g /
- Multiz is not a global aligner, as it uses Blatz
- Multiz is not an aligner at all. It's just a "reformatter", rewriting pairwise alignments into multiple alignments. It needs a phylogenetic tree.
- TBA is more like a real aligner. But it's too slow currently for massiv genome-wide alignments. (according to Angie)
Example, step 1: Alignments with Blastz
- I want to do a whole genome alignment of C. intestinalis V2 and C.savignyi V2. This is rather small genome (150 MB).
- Downloaded ci2 from hgdownload.cse.ucsc.edu into the directory ci2
- Downloaded cs2 (only visible on hgwtest.cse.ucsc.edu, not on the real server nor the download server) from Ensembl into the directory cs2
- Sometimes genomes come in one big fasta file, we want to split it into smaller ones:
faSplit byName cs2/cs2.fa cs2/
rm -f cs2/cs2.fa
- The Ensembl genome cs2 does not seem to be tandem repeat masked yet, so I do:
for i in cs2/*.fa; do trfBig $i trf/`basename $i`; done
- Blastz seems to ignore lower case characters (took me one afternoon to figure this out), we simply convert everything to nib-format:
for i in in ci2/* cs2/*; do faToNib $i `echo $i | sed -e s/.fa/.nib/`; done
- Then, to blastz everything onto everything, in theory we could issue this command:
for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do blastz '$i' $j H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
- However, I better use a cluster: I rsync the nib files onto it and add a cluster instruction:
for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do bsub blastz '$i' $j H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
- In my case I have 4000 scaffolds on one genome and 400 reftigs on the other. That was still doable with the cluster.
- However, if both genomes are spread across many more scaffolds you have to play around with the sequences, otherwise it will be millions of blastz-runs. At UCSC they first join the smaller scaffolds into one big chromsome and then run blastz on the "ChrUn"-virtual-chromosome. (this will create some false nets, going from one scaffold to another, but we don't have a choice here (hiram))
Example, step 2: Chaining
- Convert the lav files into the more compact psl format:
for i in *.lav; do echo $i; lavToPsl $i `basename $i .lav`.psl; done;
rm -f *.lav
less chr01p.psl
- Oups. Now, once again, I've messed up target and query. My psl-files actually look like "... chrom01 ... reftig1..." (the format is query, target). However, since I want to annotate Ci2, which is assembled into chromosomes, chrom01 should really be the target. I have to swap these fields in the current psl files and re-create new psl-files, split by the name of the target:
cat *.psl > ../all.psl
pslSwap ../all.psl ../all-swap.psl
pslSplitOnTarget all-swap.psl psl/ -lump
- Now the chaining (this is quite fast):
for i in psl/*.psl; do echo $i; axtChain $i ci2 cs2 chain/`basename $i .psl`.chain -linearGap=loose -psl; done
- For the filtering of the chains, we need the size of each chromosome:
faSize ../genome/ci2.fa -detailed > ci2.sizes
faSize ../../cs/cs2/cioSav2.fa -detailed > cs2.sizes
- Sort and filter the chains (more info in this post of the mailing list):
chainMergeSort chain/*.chain > all.chain
chainPreNet all.chain ci2.sizes cs2.sizes all.pre.chain
- We load these chains into the browser:
hgLoadChain ci2 chainCioSav2 all.pre.chain
- You can look at them by adding a section similar to the following to your trackDb.ra, running make alpha DBS=<yourdbname> ZOO= there, the track should then appear on your browser:
track chainCioSav2
shortLabel C. savignyi chain
longLabel C. savignyi chain
group compGeno
priority 125
visibility hide
color 100,50,0
altColor 255,240,200
spectrum on
type chain cioSav2
otherDb cioSav2
Example, step 3: Netting
- Then the netting itself, combine the chains into nets and add synteny information (I have no clue what this is) to them:
chainNet all.pre.chain -minSpace=1 ci2.sizes cs2.sizes stdout /dev/null | netSyntenic stdin noClass.net
- We need to add additional information to this nets (shown on the details page):
netClass -noAr noClass.net ci2 cioSav2 cioSav2.net
NB: I got an error message here because of the gap.bin field in cioSav2 which does not seem to exist in lib/agpGap.c, function agpGapLoad, but is retrieved by "select * from gap order by chrom". I don't get it, I just updated my source tree. I hacked around the problem but there shouldn't really be any error message.
- Finally, we can load these nets into our own browser:
hgLoadNet ci2 netCioSav2 cioSav2.net
adding something like this to trackDb.ra:
track netCioSav2
shortLabel C. savignyi Net
longLabel $o_Organism ($o_date/$o_db) Alignment Net
group compGeno
priority 134
visibility dense
spectrum on
type netAlign cioSav2 chainCioSav2
otherDb cioSav2
Example, step 4: Maffing
- We needs mafs for the browser:
netToAxt cioSav2.net all.pre.chain ci2/ cs2/ ci2-cioSav2.axt
axtToMaf ci2-cioSav2.axt ci2.sizes cs2.sizes ci2-cioSav2.maf -tPrefix=ci2. -qPrefix=cioSav2.
- The Browser needs an mysql-index to find the right position and the files in gbdb:
hgLoadMafSummary ci2 mafCioSav2Summary ci2-cioSav2.maf
sudo mkdir /gbdb/ci2/mafCioSav2
sudo cp ci2-cioSav2.maf /gbdb/ci2/mafCioSav2/mafCioSav2.maf
hgLoadMaf ci2 mafCioSav2
</code
- ... and something like this entry in trackDb.ra:
track mafCioSav2
shortLabel test
longLabel test
group compGeno
visibility full
type wigmaf 0.0 1.0
speciesOrder ci2 cioSav2
wiggle vista80
Remarks
- The original authors already wrote similar tools: Multiz contains a tool that converts lav2maf directly and UCSC includes one with lavToMaf. However, we don't care about fragments that match two times. For a whole genome, you really want "chains" of best-matching fragments, therefore we don't use these tools.
- Later on, axt started in mouseStuff and maf started with the ratStuff. There is an older tool called *axtBest* but Angies says: "axtBest is ancient history. It has been replaced by the chaining and netting process, which does a better job of finding the "best" alignment to cover a given region."
- I'm curious: Why the heck do mafs include the chrosome sizes?