Whole genome alignment howto: Difference between revisions

From genomewiki
Jump to navigationJump to search
(→‎doBlastzChainNet.pl: Removing broken link)
 
(83 intermediate revisions by 11 users not shown)
Line 1: Line 1:
The whole genome alignments are definitely the biggest mystery of the UCSC browser for me. The scripts like ''doBlastzChainNet.pl'' and collegues automize the whole stuff for the folks working at UCSC but doesn't really make it any easier to fully understand the system as everything is buried now even more and full of parasol statements. In addition, it's using the new hgAutomate libraries: As I find Perl by itself disgusting and don't now anything about the idea behind hgAutomate ''doBlastzChainNet.pl'' is really confusing.
==2018 UPDATE NOTE==


So I'm trying to re-create whole genome alignments to better understand how my whole-genome alignments are created. Thanks a lot to Angie for answering my questions!
This page is an interesting historical discussion and well worth the read.
 
<b><span style="color:#bb0000">
HOWEVER, please note</span></b>,
the UCSC tool chain command: [[DoBlastzChainNet.pl]] can now perform this entire sequence of events
in your environment with your selected genome sequences.
 
== doBlastzChainNet.pl==
The procedure described here is for a lift over construction between two different species genomes. There is a different procedure for [[Same_species_lift_over_construction]].
 
Scripts like ''doBlastzChainNet.pl'' and colleagues automate the whole process for the folks working at UCSC but don't really make it any easier to fully understand the system for an outsider, as everything is buried in hgAutomate-Perl and full of parasol statements. I'm doing a whole genome alignment in the following with bash as the only tool.
 
This page is based on doBlastzChainNet.pl, [[Conservation Track]], [http://compgen.bscb.cornell.edu/wiki/index.php?title=Creating_and_adding_a_conservation_track_with_UCSC_Chaining&printable=yes this page] which is very similar, and [[Chains Nets|Angie's mental model]] of chains/nets.  See also [[Blastz]] and  [[Conservation_Track]].
 
Thanks a lot to Angie and Hiram for answering all my questions!
 
[https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/FhFTAIJ8Z2Q/8en43ZQ5_WIJ Some updates] (12/2009) from Angie on the pipeline.


== Before you start ==
== Before you start ==
To understand the following you need:
To understand the following you need to:
# Know about shell scripting. Have a look on the [http://tldp.org/LDP/abs/html/ Bash Scripting Guide]
# Know about shell scripting. In case of doubt, use the [http://tldp.org/LDP/abs/html/ Bash Scripting Guide] as a reference.
# Have your own browser installed and running, see [[Learn_about_the_Browser]] and [[AngieTest]]
# Have your own UCSC genome browser installed and running, see [[Learn_about_the_Browser]].  There is also some helpful information about UCSC's comparative genomics track data in http://www.gmod.org/wiki/index.php/GBrowse_UCSC_Plugin_Install_HOWTO.
# Have created some tracks yourself, loaded them with hgLoadBed and configured them in trackDb.ra, see [[Learn_about_the_Browser]] and [[Adding New Tracks to a browser installation]]
# If is helpful if you have have created a UCSC browser tracks yourself, loaded it with hgLoadBed and configured it in trackDb.ra, see [[Learn_about_the_Browser]] and [[Adding New Tracks to a browser installation]]


== Fileformats we have to know ==
You will need a compute cluster for bigger genomes. Depending on where you work, one of these might be free for you:
Ok, I guess most people have left now. The others already know the [http://genome.ucsc.edu/FAQ/FAQformat.html basic file format], but might not have crossed some of these yet:
# Vital-IT in Switzerland, available for most European users
# lav: a compact form to store genomic pairwise alignments created by Blastz, using only numbers (pos of match + identities)
# [https://portal.xsede.org XSede], for the USA
# psl: similar to lav, but the most compact format at UCSC to store alignments, easier to parse than lav (positions of match + identities)
# [http://ngs.ac.uk National Grid Services] in the UK
# axt: a much more readable way to store *pairwise* alignments: positions + aligned '''sequences''' (you can forget about this format, it's not used anymore, but the tools still contain axt... in their name although most also accept psl. )
# maf: an extended version of axt, *multiple* genomic alignments: '''assemblies''' + positions + aligned sequences


You see: To convert from lav to axt/maf we need the genomic sequences from the nib/fa files.
== Fileformats  ==
I already knew the [http://genome.ucsc.edu/FAQ/FAQformat.html basic file formats] like bed/gff/etc, but hadn't crossed some of these yet:
# [http://www.bx.psu.edu/miller_lab/dist/lav_format.html lav]: a compact form to store genomic pairwise alignments created by [[Blastz]], using only numbers (pos of match + identities). As of late 2009, the pipeline has replaced blastz with lastz (see http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.01.50/README.lastz-1.01.50.html], but this is almost the same.
# [http://genome.ucsc.edu/FAQ/FAQformat#format2 psl]: similar to lav, but the most compact format at UCSC to store alignments, easier to parse than lav (positions of match + identities), e.g. blat is creating this
# [http://www.bioperl.org/wiki/AXT_alignment_format axt]: a much more readable way to store *pairwise* alignments: positions + aligned '''sequences''' (It seems that you can forget about this format, it's rarely used, but the tools still contain axt... in their name although most also accept psl. )
# [http://genome.ucsc.edu/FAQ/FAQformat#format5 maf]: an extended version of axt, *multiple* genomic alignments: '''assemblies''' + positions + aligned sequences
 
As you can see from the above, to convert from lav to maf we need the genomic sequences from the nib/fa files.


== Outline ==
== Outline ==
# MASKING: Both genomes have to be repeatmasked and masked Tandem Repeat Finder (trf) first (thanks to Hiram for pointing this out)
# MASKING: Both genomes have to be repeatmasked and masked Tandem Repeat Finder (trf) first (thanks to Hiram for pointing this out)
# ALIGNING: The two genomes are aligned with BLASTZ (we don't use blastz's own chaining, see [http://www.soe.ucsc.edu/pipermail/genome/2007-March/013151.html discussion] (angie)). This generates lav-files, which have to be converted to psl (lavToPsl)
# ALIGNING: The two genomes are aligned with BLASTZ (we don't use blastz's own chaining, see [http://www.soe.ucsc.edu/pipermail/genome/2007-March/013151.html discussion] (angie)). This generates lav-files, which have to be converted to psl (lavToPsl)
# CHAINING: Two matching fragments next to each other are joined into one fragment (axtChain). As every genomic fragment can match with several others, we keep only the best match for a given part : first do axtSort, then filter with axtBest (more info on the [http://www.soe.ucsc.edu/pipermail/genome/2005-October/008710.html mailing list])
# CHAINING: Two matching alignments next to each other are joined into one fragment if they are close enough (axtChain). As every genomic fragment can match with several others, we keep only the longest chains : first do axtSort then filter with axtBest (more info on the [http://www.soe.ucsc.edu/pipermail/genome/2005-October/008710.html mailing list])
# NETTING: Group blocks of chained alignments into longer stretches of synteny (netChain)
# NETTING: Group blocks of chained alignments into longer stretches of synteny (netChain)
# MAF'ING: From the synteny-files (positions), get the sequences and re-create alignments
# MAF'ING: From the synteny-files (positions), get the sequences and re-create alignments
# PhastCons: Using the maf-files, calculate the strength of conservation for every base, similar to a Vista- or protein Conservation plot, but applicable to multiple alignments


== Outline - More Details ==
== Outline - Notes ==
* BLASTZ: The reference genome is aligned with all others with BLASTZ. That creates lav-files. They are converted to psl.
* BLASTZ: The reference genome is aligned with all others with BLASTZ. That creates lav-files. They are converted to psl.
** Angie says: ''Our hgdownload $db/vs$OtherDb/README.txt pages give the blastz parameters used for a given run.''
** Angie says: ''Our hgdownload $db/vs$OtherDb/README.txt pages give the blastz parameters used for a given run.''
** It's a good idea to do this on a compute cluster. It will take ages if you try to do this for a larger genome on one single computer. If you're Europen, you can get an account on [http://www.vital-it.ch vital-it]. There are national compute cluster also in the Netherlands and the UK that I know of.  
** It's a good idea to do this on a compute cluster. It took ages on my own computer. I'm European, so I got an account [http://www.vital-it.ch vital-it], anyone can apply who is working in the EU. There are national compute cluster also in the Netherlands and the UK that I know of.  
* Chains are simply better alignments: If several alignments overlap, we still don't know which the best one is. This filtering was done with axtBest before, now we use chainNet + netToAxt . (I have no clue why Kate mentioned netFilter here.)
* Chains are simply better alignments: If several alignments overlap, we still don't know which the best one is. This filtering was done with axtBest before, now we use chainNet + netToAxt. (Sometimes UCSC also uses netfilter to filter out nets with a gapsize <= 12)
* We feed the tree into multiz. Multiz will use many local alignments to generate a multiple local alignment.
* We feed the tree into multiz. Multiz will use many local alignments to generate a multiple local alignment.
   Example: A-B and A-C have been aligned with blastz...
   Example: A-B and A-C have been aligned with blastz...
Line 37: Line 59:
     A aaa-ctg  /    C aattt-g
     A aaa-ctg  /    C aattt-g
     C aattt-g /
     C aattt-g /
* Multiz is not a global aligner, as it uses Blatz
* Multiz is not a real aligner, as it uses Blastz
* Multiz is not an aligner at all. It's just a "reformatter", rewriting pairwise alignments into multiple alignments. It needs a phylogenetic tree.
* Multiz is "a phylogenetic tree directed multiple alignment. It starts with subsets of the phylo tree with sequences that are already close to each other.  As they are combined, new gaps will appear and poorly aligned paired sequences will disappear to produce a better multiple alignment at a higher scale." (thanks to Hiram, replied to me on the mailing list in 2011)
* TBA is more like a real aligner. But it's too slow currently for massiv genome-wide alignments. (according to Angie)
* Multiz is not needed if you only have two species, as the chain information from blastz/lastz will contain the same information.
* TBA is more like a real aligner. But it's too slow currently for massiv genome-wide alignments. (according to Angie) and is only used for the encode regions.
* The only other aligner on whole genomes that I know of is SLAGAN. In theory, its output can be converted to lav but I doubt that anyone has ever tried this. SLAGAN is not related to Blastz. I have alignments for Ci/Cs made by both algorithms and they differ quite a lot, however, I've never explored the subject in detail in the end and will probably stick with Blastz


== Example, step 1: Alignments with Blastz  ==
== Example, step 1: Alignments with Blastz  ==
* I want to do a whole genome alignment of C. intestinalis V2 and C.savignyi V2. This is rather small genome (150 MB).
 
** Downloaded ci2 from hgdownload.cse.ucsc.edu into the directory ci2
<span style="color:red">Please note the script runLastzChain.sh: [[Image:RunLastzChain_sh.txt]] which encapsulates the procedures to generate the required steps for the lastz and chain operations. </span> The lastz parameters used by this script are just examples, and should be fine-tuned for your particular alignment problem. This script also requires [[Image:ConstructLiftFile_pl.txt]] from the genomewiki, and [http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/mastersrc/hg/utils/automation/partitionSequence.pl partitionSequence.pl] from the kent source tree.
** Downloaded cs2 (only visible on hgwtest.cse.ucsc.edu, not on the real server nor the download server) from Ensembl into the directory cs2
 
* I want to do a whole genome alignment of C. intestinalis V2 and C.savignyi V2. This is rather small genome (150 MB). Note that ci2 stands for Ciona intestinalis and cs2 stands for Ciona savignyi
** Downloaded ci2 from hgdownload.soe.ucsc.edu into the directory ci2
** Downloaded cs2 (only visible on hgwtest.soe.ucsc.edu, not on the real server nor the download server) from Ensembl into the directory cs2
* Sometimes genomes come in one big fasta file, we want to split it into smaller ones:
* Sometimes genomes come in one big fasta file, we want to split it into smaller ones:
<code>
<pre>
   faSplit byName cs2/cs2.fa cs2/
   faSplit byName cs2/cs2.fa cs2/
   rm -f cs2/cs2.fa
   rm -f cs2/cs2.fa
</code>
</pre>
* The Ensembl genome cs2 does not seem to be tandem repeat masked yet, so I do:
* The Ensembl genome cs2 does not seem to be tandem repeat masked yet, so I do:
<code>
<pre>
   for i in cs2/*.fa; do trfBig $i trf/`basename $i`; done
   for i in cs2/*.fa; do trfBig $i trf/`basename $i`; done
</code>
</pre>
* Blastz seems to ignore lower case characters (took me one afternoon to figure this out), we simply convert everything to nib-format:
* Blastz seems to ignore lower case characters (took me one afternoon to figure this out), we simply convert everything to nib-format:
<code>
<pre>
   for i in in ci2/* cs2/*; do faToNib $i `echo $i | sed -e s/.fa/.nib/`; done
   for i in in ci2/* cs2/*; do faToNib $i `echo $i | sed -e s/.fa/.nib/`; done
</code>
</pre>
* Then, to blastz everything onto everything, in theory we could issue this command:
* Then, to blastz everything onto everything, in theory we could issue this command:
<code>
<pre>
   for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do blastz '$i' $j  H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
   for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do blastz '$i' $j  H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
</code>
</pre>
* However, I better use a cluster: I rsync the nib files onto it and add a cluster instruction:
* However, I better use the cluster at vital-it.ch: I rsync the nib files onto it and add a cluster instruction:
<code>
<pre>
   for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do '''bsub''' blastz '$i' $j  H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
   for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do '''bsub''' blastz '$i' $j  H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
</code>
</pre>
** In my case I have 4000 scaffolds on one genome and 400 reftigs on the other. That was still doable with the cluster.
** In my case I have 4000 scaffolds on one genome and 400 reftigs on the other. That was still doable with the cluster.
** However, if both genomes are spread across many more scaffolds you have to play around with the sequences, otherwise it will be millions of blastz-runs. At UCSC they first join the smaller scaffolds into one big chromsome and then run blastz on the "ChrUn"-virtual-chromosome. (this will create some false nets, going from one scaffold to another, but we don't have a choice here (hiram))
** However, if both genomes are spread across many more scaffolds you have to play around with the sequences, otherwise it will be millions of blastz-runs. At UCSC they first join the smaller scaffolds into one big chromsome and then run blastz on the "ChrUn"-virtual-chromosome. (this will create some false nets, going from one scaffold to another, but we don't have a choice here (hiram))
Line 71: Line 98:
== Example, step 2: Chaining ==
== Example, step 2: Chaining ==
* Convert the lav files into the more compact psl format:
* Convert the lav files into the more compact psl format:
<code>
<pre>
   for i in *.lav; do echo $i; lavToPsl $i `basename $i .lav`.psl; done;  
   for i in *.lav; do echo $i; lavToPsl $i `basename $i .lav`.psl; done;  
   rm -f *.lav
   rm -f *.lav
   less chr01p.psl
   less chr01p.psl
</code>
</pre>
* Oups. Now, once again, I've messed up target and query. My psl-files actually look like "... chrom01 ... reftig1..." (the format is query, target). However, since I want to annotate Ci2, which is assembled into chromosomes, chrom01 should really be the target. I have to swap these fields in the current psl files and re-create new psl-files, split by the name of the target:
* Oups. Now, once again, I've messed up target and query. My psl-files actually look like "... chrom01 ... reftig1..." (the format is query, target). However, since I want to annotate Ci2, which is assembled into chromosomes, chrom01 should really be the target. I have to swap these fields in the current psl files and re-create new psl-files, split by the name of the target:
<code>
<pre>
   cat *.psl > ../all.psl
   cat *.psl > ../all.psl
   pslSwap ../all.psl ../all-swap.psl
   pslSwap ../all.psl ../all-swap.psl
   pslSplitOnTarget all-swap.psl psl/ -lump
   pslSplitOnTarget all-swap.psl psl/ -lump
</code>
</pre>
* Now the chaining (this is quite fast):
* Now the chaining (this is quite fast for my little ciona-genome):
<code>
<pre>
   for i in psl/*.psl; do echo $i; axtChain $i ci2 cs2 chain/`basename $i .psl`.chain -linearGap=loose -psl; done
   for i in psl/*.psl; do echo $i; axtChain $i ci2 cs2 chain/`basename $i .psl`.chain -linearGap=loose -psl; done
</code>
</pre>
* For the filtering of the chains, we need the size of each chromosome:
* For the filtering of the chains, we need the size of each chromosome:
<code>
<pre>
   faSize ../genome/ci2.fa -detailed > ci2.sizes
   faSize ../genome/ci2.fa -detailed > ci2.sizes
   faSize ../../cs/cs2/cioSav2.fa -detailed > cs2.sizes
   faSize ../../cs/cs2/cioSav2.fa -detailed > cs2.sizes
</code>
</pre>
* Sort and filter the chains (more info in [http://www.soe.ucsc.edu/pipermail/genome/2005-October/008753.html this post] of the mailing list):
* Sort and filter the chains (more info in [https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/zcHuWtmw-LE/M4kdIInQ7TQJ this post] of the mailing list):
<code>
<pre>
   chainMergeSort chain/*.chain > all.chain
   chainMergeSort chain/*.chain > all.chain
   chainPreNet all.chain ci2.sizes cs2.sizes all.pre.chain
   chainPreNet all.chain ci2.sizes cs2.sizes all.pre.chain
</code>
</pre>
* We load these chains into the browser:
* We load these chains into the browser (hm...wouldn't it be more interesting to have the complete chains in the browser?):
<code>
<pre>
   hgLoadChain ci2 chainCioSav2 all.pre.chain
   hgLoadChain ci2 chainCioSav2 all.pre.chain
</code>
</pre>
* You can look at them by adding a section similar to the following to your ''trackDb.ra'', running ''make alpha DBS=<yourdbname> ZOO='' there, the track should then appear on your browser:
* Adding a section similar to the following to ''trackDb.ra'', running ''make alpha DBS=<yourdbname> ZOO='' there, the track should then appear on the browser:
<code>
<pre>
   track chainCioSav2
   track chainCioSav2
   shortLabel C. savignyi chain
   shortLabel C. savignyi chain
Line 113: Line 140:
   type chain cioSav2
   type chain cioSav2
   otherDb cioSav2
   otherDb cioSav2
</code>
</pre>


== Example, step 3: Netting ==
== Example, step 3: Netting ==


* Then the netting itself, combine the chains into nets and add synteny information (I have no clue what this is) to them:
* Then the netting itself, combine the chains into nets and add synteny information (not sure where this is displayed, probably needed for double/single-lines display of net-tracks) to them:
<code>
<pre>
   chainNet all.pre.chain -minSpace=1 ci2.sizes cs2.sizes stdout /dev/null | netSyntenic stdin noClass.net
   chainNet all.pre.chain -minSpace=1 ci2.sizes cs2.sizes stdout /dev/null | netSyntenic stdin noClass.net
</code>
</pre>
* We need to add additional information to this nets (shown on the details page):
* We need to add additional information to this nets (shown on the details page):
<code>
<pre>
   netClass -noAr noClass.net ci2 cioSav2 cioSav2.net
   netClass -noAr noClass.net ci2 cioSav2 cioSav2.net
</code>
</pre>
<small>NB: I got an error message here because of the gap.bin field in cioSav2 which does not seem to exist in lib/agpGap.c, function agpGapLoad, but is retrieved by "select * from gap order by chrom". I don't get it, I just updated my source tree. I hacked around the problem but there shouldn't really be any error message.</small>
<small>NB: I got an error message here because of the gap.bin field in cioSav2 which does not seem to exist in lib/agpGap.c, function agpGapLoad, but is retrieved by "select * from gap order by chrom". I hacked around the problem but there shouldn't really be any error message.</small>
* Finally, we can load these nets into our own browser:
* Finally, we can load these nets into our own browser:
<code>
<pre>
   hgLoadNet ci2 netCioSav2 cioSav2.net
   hgLoadNet ci2 netCioSav2 cioSav2.net
</code>
</pre>
adding something like this to ''trackDb.ra'':
adding something like this to ''trackDb.ra'':
<code>
<pre>
   track netCioSav2
   track netCioSav2
   shortLabel C. savignyi Net
   shortLabel C. savignyi Net
Line 141: Line 168:
   type netAlign cioSav2 chainCioSav2
   type netAlign cioSav2 chainCioSav2
   otherDb cioSav2
   otherDb cioSav2
</code>
</pre>
 
== Example, step 4: Maffing ==
* I needs mafs for the browser as I want to display the alignments. This is also the right time to sort the alignments as they need to be sorted for downstream tools like mafsInRange (cut out only selected blocks).
<pre>
  netToAxt cioSav2.net all.pre.chain ci2/ cs2/ stdout | axtSort stdin ci2-cioSav2.axt
  axtToMaf ci2-cioSav2.axt ci2.sizes cs2.sizes ci2-cioSav2.maf -tPrefix=ci2. -qPrefix=cioSav2.
</pre>
* Load the data into the brower. It needs an mysql-index to find the right position. The alignment files go into /gbdb:
<pre>
  hgLoadMafSummary ci2 mafCioSav2Summary ci2-cioSav2.maf
  sudo mkdir /gbdb/ci2/mafCioSav2
  sudo cp ci2-cioSav2.maf /gbdb/ci2/mafCioSav2/mafCioSav2.maf
  hgLoadMaf ci2 mafCioSav2
</pre>
 
== Example, step 5: Phastcons ==
* Phastcons doesn't seem to like big maf files that include many chromsomes, so I split the maf files by chromosome:
<pre>
  mkdir maf
  mafSplit -byTarget dummy.bed maf/ ci2-cioSav2.maf
</pre>
* I take the biggest chromsome and calculate the background model from it:
<pre>
  PHAST=~/bio/phastCons
  ${PHAST}/bin/phyloFit -i MAF maf/002.maf
</pre>
* Run phastCons on every maf file with this background model and some obscure parameters that phastCons needs. I'm parsing the chrosomome name here from the maf as it was lost during the mafSplitByTarget (there should be an easier way but it works):
<pre>
  mkdir wig
  mkdir mostCons
  for i in maf/*.maf; do \
          x=`basename $i .maf`;
          ~/bio/phastCons/bin/phastCons --target-coverage 0.25 --expected-length 12 \
          --rho 0.4 --msa-format MAF $i phyloFit.mod \
          --seqname `cat $i | head -n 3  | tail -n 1 | tr -s ' ' | cut -f 2 -d ' ' | cut -d. -f2` \
          --most-conserved mostCons/$x.bed > wig/$x.wig; \
  done
</pre>
This took 7 minutes for a maf file of 113 MB
 
<small>Q: How can I check if these parameters are "good"? Use CDS overlap? </small>
* Load the wiggle files into the browser:
<pre>
  cat wig/*.wig > ../ci2-cioSav2.pp
  wigEncode ci2-cioSav2.pp ci2-cioSav2.wig ci2-cioSav2.wib
  sudo mkdir /gbdb/ci2/wib/
  sudo cp ci2-cioSav2.wib /gbdb/ci2/wib/
  hgLoadWiggle  -pathPrefix=/gbdb/ci2/wib ci2 phastCons ci2-cioSav2.wig
</pre>
* Load the "most Conserved" regions into the browser:
<pre>
  cat mostCons/*.bed > mostCons.bed
  hgLoadBed ci2 mostConserved mostCons.bed
</pre>
* ... add something like this entry in trackDb.ra:
<pre>
  track mafCioSav2
  shortLabel Conservation
  longLabel Blastz/Phastcons conservation C. savignyi
  group compGeno
  visibility full
  type wigmaf 0.0 1.0
  speciesOrder ci2 cioSav2
  wiggle phastCons
  viewLimits 0:1
  autoScale off
  maxHeightPixels 50:50:11
</pre>
* Refreshing my trackDb (make ...) will then display the track


== Remarks ==
== Remarks ==


* The original authors already wrote similar tools: Multiz contains a tool that converts lav2maf directly and UCSC includes one with lavToMaf. However, we don't care about fragments that match two times. For a whole genome, you really want "chains" of best-matching fragments, therefore we don't use these tools.
* The original authors already wrote similar tools: Multiz contains a tool that converts lav2maf directly and UCSC includes one with lavToMaf. However, we don't care about fragments that match two times. For a whole genome, you really want "chains" of best-matching fragments, therefore we don't use these tools.
* Later on, axt started in mouseStuff and maf started with the ratStuff. There is an older tool called *axtBest* but Angies says: "axtBest is ancient history.  It has been replaced by the chaining and netting process, which does a better job of finding the "best" alignment to cover a given region."  
* For the records: In the UCSC source tree, the AXT format started in the mouseStuff directory and MAF started in the ratStuff directory. There is an older tool called *axtBest* but Angies says: "axtBest is ancient history.  It has been replaced by the chaining and netting process, which does a better job of finding the "best" alignment to cover a given region."  
* I'm curious: Why the heck do mafs include the chrosome sizes?
* Why do mafs include the chrosome sizes? I don't know.
* GERP is an alternative for phastCons from the Sidow lab and used by the Ensembl browser.


[[Category:Comparative Genomics]]
[[Category:Comparative Genomics]]

Latest revision as of 23:14, 22 February 2019

2018 UPDATE NOTE

This page is an interesting historical discussion and well worth the read.

HOWEVER, please note, the UCSC tool chain command: DoBlastzChainNet.pl can now perform this entire sequence of events in your environment with your selected genome sequences.

doBlastzChainNet.pl

The procedure described here is for a lift over construction between two different species genomes. There is a different procedure for Same_species_lift_over_construction.

Scripts like doBlastzChainNet.pl and colleagues automate the whole process for the folks working at UCSC but don't really make it any easier to fully understand the system for an outsider, as everything is buried in hgAutomate-Perl and full of parasol statements. I'm doing a whole genome alignment in the following with bash as the only tool.

This page is based on doBlastzChainNet.pl, Conservation Track, this page which is very similar, and Angie's mental model of chains/nets. See also Blastz and Conservation_Track.

Thanks a lot to Angie and Hiram for answering all my questions!

Some updates (12/2009) from Angie on the pipeline.

Before you start

To understand the following you need to:

  1. Know about shell scripting. In case of doubt, use the Bash Scripting Guide as a reference.
  2. Have your own UCSC genome browser installed and running, see Learn_about_the_Browser. There is also some helpful information about UCSC's comparative genomics track data in http://www.gmod.org/wiki/index.php/GBrowse_UCSC_Plugin_Install_HOWTO.
  3. If is helpful if you have have created a UCSC browser tracks yourself, loaded it with hgLoadBed and configured it in trackDb.ra, see Learn_about_the_Browser and Adding New Tracks to a browser installation

You will need a compute cluster for bigger genomes. Depending on where you work, one of these might be free for you:

  1. Vital-IT in Switzerland, available for most European users
  2. XSede, for the USA
  3. National Grid Services in the UK

Fileformats

I already knew the basic file formats like bed/gff/etc, but hadn't crossed some of these yet:

  1. lav: a compact form to store genomic pairwise alignments created by Blastz, using only numbers (pos of match + identities). As of late 2009, the pipeline has replaced blastz with lastz (see http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.01.50/README.lastz-1.01.50.html], but this is almost the same.
  2. psl: similar to lav, but the most compact format at UCSC to store alignments, easier to parse than lav (positions of match + identities), e.g. blat is creating this
  3. axt: a much more readable way to store *pairwise* alignments: positions + aligned sequences (It seems that you can forget about this format, it's rarely used, but the tools still contain axt... in their name although most also accept psl. )
  4. maf: an extended version of axt, *multiple* genomic alignments: assemblies + positions + aligned sequences

As you can see from the above, to convert from lav to maf we need the genomic sequences from the nib/fa files.

Outline

  1. MASKING: Both genomes have to be repeatmasked and masked Tandem Repeat Finder (trf) first (thanks to Hiram for pointing this out)
  2. ALIGNING: The two genomes are aligned with BLASTZ (we don't use blastz's own chaining, see discussion (angie)). This generates lav-files, which have to be converted to psl (lavToPsl)
  3. CHAINING: Two matching alignments next to each other are joined into one fragment if they are close enough (axtChain). As every genomic fragment can match with several others, we keep only the longest chains : first do axtSort then filter with axtBest (more info on the mailing list)
  4. NETTING: Group blocks of chained alignments into longer stretches of synteny (netChain)
  5. MAF'ING: From the synteny-files (positions), get the sequences and re-create alignments
  6. PhastCons: Using the maf-files, calculate the strength of conservation for every base, similar to a Vista- or protein Conservation plot, but applicable to multiple alignments

Outline - Notes

  • BLASTZ: The reference genome is aligned with all others with BLASTZ. That creates lav-files. They are converted to psl.
    • Angie says: Our hgdownload $db/vs$OtherDb/README.txt pages give the blastz parameters used for a given run.
    • It's a good idea to do this on a compute cluster. It took ages on my own computer. I'm European, so I got an account vital-it, anyone can apply who is working in the EU. There are national compute cluster also in the Netherlands and the UK that I know of.
  • Chains are simply better alignments: If several alignments overlap, we still don't know which the best one is. This filtering was done with axtBest before, now we use chainNet + netToAxt. (Sometimes UCSC also uses netfilter to filter out nets with a gapsize <= 12)
  • We feed the tree into multiz. Multiz will use many local alignments to generate a multiple local alignment.
 Example: A-B and A-C have been aligned with blastz...
    A aaactg  \
    B aa--tg   \     A aaa-ctg
                ->   B aa---tg
    A aaa-ctg  /     C aattt-g
    C aattt-g /
  • Multiz is not a real aligner, as it uses Blastz
  • Multiz is "a phylogenetic tree directed multiple alignment. It starts with subsets of the phylo tree with sequences that are already close to each other. As they are combined, new gaps will appear and poorly aligned paired sequences will disappear to produce a better multiple alignment at a higher scale." (thanks to Hiram, replied to me on the mailing list in 2011)
  • Multiz is not needed if you only have two species, as the chain information from blastz/lastz will contain the same information.
  • TBA is more like a real aligner. But it's too slow currently for massiv genome-wide alignments. (according to Angie) and is only used for the encode regions.
  • The only other aligner on whole genomes that I know of is SLAGAN. In theory, its output can be converted to lav but I doubt that anyone has ever tried this. SLAGAN is not related to Blastz. I have alignments for Ci/Cs made by both algorithms and they differ quite a lot, however, I've never explored the subject in detail in the end and will probably stick with Blastz

Example, step 1: Alignments with Blastz

Please note the script runLastzChain.sh: File:RunLastzChain sh.txt which encapsulates the procedures to generate the required steps for the lastz and chain operations. The lastz parameters used by this script are just examples, and should be fine-tuned for your particular alignment problem. This script also requires File:ConstructLiftFile pl.txt from the genomewiki, and partitionSequence.pl from the kent source tree.

  • I want to do a whole genome alignment of C. intestinalis V2 and C.savignyi V2. This is rather small genome (150 MB). Note that ci2 stands for Ciona intestinalis and cs2 stands for Ciona savignyi
    • Downloaded ci2 from hgdownload.soe.ucsc.edu into the directory ci2
    • Downloaded cs2 (only visible on hgwtest.soe.ucsc.edu, not on the real server nor the download server) from Ensembl into the directory cs2
  • Sometimes genomes come in one big fasta file, we want to split it into smaller ones:
   faSplit byName cs2/cs2.fa cs2/
   rm -f cs2/cs2.fa
  • The Ensembl genome cs2 does not seem to be tandem repeat masked yet, so I do:
   for i in cs2/*.fa; do trfBig $i trf/`basename $i`; done
  • Blastz seems to ignore lower case characters (took me one afternoon to figure this out), we simply convert everything to nib-format:
   for i in in ci2/* cs2/*; do faToNib $i `echo $i | sed -e s/.fa/.nib/`; done
  • Then, to blastz everything onto everything, in theory we could issue this command:
   for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do blastz '$i' $j  H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
  • However, I better use the cluster at vital-it.ch: I rsync the nib files onto it and add a cluster instruction:
   for i in cs2/*.nib; do echo 'for j in ci2/*.nib; do '''bsub''' blastz '$i' $j  H=2000 Y=3400 L=6000 K=2200 Q=HoxD55.q > lav/`basename '$i' .nib`-`base name $j .nib`.lav; done'; done
    • In my case I have 4000 scaffolds on one genome and 400 reftigs on the other. That was still doable with the cluster.
    • However, if both genomes are spread across many more scaffolds you have to play around with the sequences, otherwise it will be millions of blastz-runs. At UCSC they first join the smaller scaffolds into one big chromsome and then run blastz on the "ChrUn"-virtual-chromosome. (this will create some false nets, going from one scaffold to another, but we don't have a choice here (hiram))

Example, step 2: Chaining

  • Convert the lav files into the more compact psl format:
   for i in *.lav; do echo $i; lavToPsl $i `basename $i .lav`.psl; done; 
   rm -f *.lav
   less chr01p.psl
  • Oups. Now, once again, I've messed up target and query. My psl-files actually look like "... chrom01 ... reftig1..." (the format is query, target). However, since I want to annotate Ci2, which is assembled into chromosomes, chrom01 should really be the target. I have to swap these fields in the current psl files and re-create new psl-files, split by the name of the target:
  cat *.psl > ../all.psl
  pslSwap ../all.psl ../all-swap.psl
  pslSplitOnTarget all-swap.psl psl/ -lump
  • Now the chaining (this is quite fast for my little ciona-genome):
  for i in psl/*.psl; do echo $i; axtChain $i ci2 cs2 chain/`basename $i .psl`.chain -linearGap=loose -psl; done
  • For the filtering of the chains, we need the size of each chromosome:
  faSize ../genome/ci2.fa -detailed > ci2.sizes
  faSize ../../cs/cs2/cioSav2.fa -detailed > cs2.sizes
  • Sort and filter the chains (more info in this post of the mailing list):
   chainMergeSort chain/*.chain > all.chain
   chainPreNet all.chain ci2.sizes cs2.sizes all.pre.chain
  • We load these chains into the browser (hm...wouldn't it be more interesting to have the complete chains in the browser?):
  hgLoadChain ci2 chainCioSav2 all.pre.chain
  • Adding a section similar to the following to trackDb.ra, running make alpha DBS=<yourdbname> ZOO= there, the track should then appear on the browser:
  track chainCioSav2
  shortLabel C. savignyi chain
  longLabel C. savignyi chain
  group compGeno
  priority 125
  visibility hide
  color 100,50,0
  altColor 255,240,200
  spectrum on
  type chain cioSav2
  otherDb cioSav2

Example, step 3: Netting

  • Then the netting itself, combine the chains into nets and add synteny information (not sure where this is displayed, probably needed for double/single-lines display of net-tracks) to them:
   chainNet all.pre.chain -minSpace=1 ci2.sizes cs2.sizes stdout /dev/null | netSyntenic stdin noClass.net
  • We need to add additional information to this nets (shown on the details page):
   netClass -noAr noClass.net ci2 cioSav2 cioSav2.net

NB: I got an error message here because of the gap.bin field in cioSav2 which does not seem to exist in lib/agpGap.c, function agpGapLoad, but is retrieved by "select * from gap order by chrom". I hacked around the problem but there shouldn't really be any error message.

  • Finally, we can load these nets into our own browser:
   hgLoadNet ci2 netCioSav2 cioSav2.net

adding something like this to trackDb.ra:

  track netCioSav2
  shortLabel C. savignyi Net
  longLabel $o_Organism ($o_date/$o_db) Alignment Net
  group compGeno
  priority 134
  visibility dense
  spectrum on
  type netAlign cioSav2 chainCioSav2
  otherDb cioSav2

Example, step 4: Maffing

  • I needs mafs for the browser as I want to display the alignments. This is also the right time to sort the alignments as they need to be sorted for downstream tools like mafsInRange (cut out only selected blocks).
  netToAxt cioSav2.net all.pre.chain ci2/ cs2/ stdout | axtSort stdin ci2-cioSav2.axt
  axtToMaf ci2-cioSav2.axt ci2.sizes cs2.sizes ci2-cioSav2.maf -tPrefix=ci2. -qPrefix=cioSav2.
  • Load the data into the brower. It needs an mysql-index to find the right position. The alignment files go into /gbdb:
  hgLoadMafSummary ci2 mafCioSav2Summary ci2-cioSav2.maf
  sudo mkdir /gbdb/ci2/mafCioSav2
  sudo cp ci2-cioSav2.maf /gbdb/ci2/mafCioSav2/mafCioSav2.maf
  hgLoadMaf ci2 mafCioSav2

Example, step 5: Phastcons

  • Phastcons doesn't seem to like big maf files that include many chromsomes, so I split the maf files by chromosome:
   mkdir maf
   mafSplit -byTarget dummy.bed maf/ ci2-cioSav2.maf
  • I take the biggest chromsome and calculate the background model from it:
   PHAST=~/bio/phastCons
   ${PHAST}/bin/phyloFit -i MAF maf/002.maf
  • Run phastCons on every maf file with this background model and some obscure parameters that phastCons needs. I'm parsing the chrosomome name here from the maf as it was lost during the mafSplitByTarget (there should be an easier way but it works):
  mkdir wig
  mkdir mostCons
  for i in maf/*.maf; do \
          x=`basename $i .maf`;
          ~/bio/phastCons/bin/phastCons --target-coverage 0.25 --expected-length 12 \
           --rho 0.4 --msa-format MAF $i phyloFit.mod \
           --seqname `cat $i | head -n 3  | tail -n 1 | tr -s ' ' | cut -f 2 -d ' ' | cut -d. -f2` \
           --most-conserved mostCons/$x.bed > wig/$x.wig; \
  done

This took 7 minutes for a maf file of 113 MB

Q: How can I check if these parameters are "good"? Use CDS overlap?

  • Load the wiggle files into the browser:
  cat wig/*.wig > ../ci2-cioSav2.pp
  wigEncode ci2-cioSav2.pp ci2-cioSav2.wig ci2-cioSav2.wib
  sudo mkdir /gbdb/ci2/wib/
  sudo cp ci2-cioSav2.wib /gbdb/ci2/wib/
  hgLoadWiggle  -pathPrefix=/gbdb/ci2/wib ci2 phastCons ci2-cioSav2.wig
  • Load the "most Conserved" regions into the browser:
  cat mostCons/*.bed > mostCons.bed
  hgLoadBed ci2 mostConserved mostCons.bed
  • ... add something like this entry in trackDb.ra:
  track mafCioSav2
  shortLabel Conservation
  longLabel Blastz/Phastcons conservation C. savignyi
  group compGeno
  visibility full
  type wigmaf 0.0 1.0
  speciesOrder ci2 cioSav2
  wiggle phastCons
  viewLimits 0:1
  autoScale off
  maxHeightPixels 50:50:11
  • Refreshing my trackDb (make ...) will then display the track

Remarks

  • The original authors already wrote similar tools: Multiz contains a tool that converts lav2maf directly and UCSC includes one with lavToMaf. However, we don't care about fragments that match two times. For a whole genome, you really want "chains" of best-matching fragments, therefore we don't use these tools.
  • For the records: In the UCSC source tree, the AXT format started in the mouseStuff directory and MAF started in the ratStuff directory. There is an older tool called *axtBest* but Angies says: "axtBest is ancient history. It has been replaced by the chaining and netting process, which does a better job of finding the "best" alignment to cover a given region."
  • Why do mafs include the chrosome sizes? I don't know.
  • GERP is an alternative for phastCons from the Sidow lab and used by the Ensembl browser.