Human/hg19/GRCh37 46-way multiple alignment

From genomewiki
Revision as of 18:47, 7 December 2009 by Hiram (talk | contribs) (adding corrected trees)
Jump to navigationJump to search

The 46 species multiple alignment on human/hg19/GRCh37 is an extra large bit of work. A discussion of the phylogenetic trees used in the alignment is included here.

Errata

The initial release of this track include a phylogenetic tree that had two small errors in it. 46way.nh
Namely: Baboon (papHam1) and Rhesus (rheMac2) were specified as separate nodes instead of correctly sister species. The same problem is present for Wallaby (macEug1) and Opossum (monDom5). The discussion below includes a corrected phylogenetic tree.

Corrected Tree

1. all 46 species:

(((((((((((((((((hg19,panTro2),gorGor1), ponAbe2), (rheMac2,papHam1)),
calJac1),tarSyr1), (micMur1,otoGar1)), tupBel1),(((((mm9,rn4),
dipOrd1),cavPor3), speTri1), (oryCun2,ochPri2))),
(((vicPac1,(turTru1,bosTau4)), ((equCab2,(felCat3,canFam2)),
(myoLuc1,pteVam1))), (eriEur1,sorAra1))), (((loxAfr3,proCap1),echTel1),
(dasNov2,choHof1))), (monDom5,macEug1)),ornAna1), ((galGal3,taeGut1),
anoCar1)),xenTro2), (((tetNig2,fr2), (gasAcu1,oryLat2)), danRer6)),petMar1)

2. placental only subset:

(((((((((((hg19,panTro2),gorGor1),ponAbe2),(rheMac2,papHam1)),calJac1),
tarSyr1),(micMur1,otoGar1)),tupBel1),(((((mm9,rn4),dipOrd1),
cavPor3),speTri1),(oryCun2,ochPri2))),(((vicPac1,(turTru1,bosTau4)),
((equCab2,(felCat3,canFam2)),(myoLuc1,pteVam1))),(eriEur1,sorAra1))),
(((loxAfr3,proCap1),echTel1),(dasNov2,choHof1)))

3. primate only subset:

(((((((hg19,panTro2),gorGor1),ponAbe2),(rheMac2,papHam1)),calJac1),tarSyr1),(micMur1,otoGar1))

I ran an experiment to rerun the entire multiple alignment with the corrected tree as well as recalculate phastCons and phyloP tracks. There are slight differences in the resulting multiple alignment, but nothing significant. It may be interesting to construct a difference track for points of interest on the phastCons and phyloP tracks that do have some differences.

4D sites branch length calculations

Branch lengths were estimated by taking a subset of the refSeq track:

hgsql hg19 -Ne \
   "select * from refGene,refSeqStatus where refGene.name=refSeqStatus.mrnaAcc
         and refSeqStatus.status='Reviewed' and mol='mRNA'" \
   | cut -f 2-20 | egrep -E -v "chrM|chrUn|random|_hap|chrX" \
   genePredSingleCover stdin stdout | sort > refSeqReviewedNR.gp

Which is for the following chromosomes only:

chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9
chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17
chr18 chr19 chr20 chr21 chr22 chrY

Using the 2009-10-21 version of the PHAST package.

For each chromosome c in the above list, running msa_view on each $c.maf from the multiz alignment:

awk -v C=$c '$2 == C {print}' refSeqReviewedNR.gp > $c.gp
msa_view --4d --features $c.gp --do-cats 3 -i MAF $c.maf -o SS > $c.ss
msa_view -i SS --tuple-size 1 $c.ss > mfa/chr${c}.mfa

Then, putting all the chr*.mfa files together:

msa_view  --aggregate `cat species.lst` mfa/*.mfa | sed s/"> "/">"/ > all.mfa

Using phyloFit to construct a tree model:

phyloFit --EM --precision MED --msa-format FASTA --subst-mod REV --tree tree.all.nh all.mfa

Adjust the frequencies back to a genome-wide GC percent of 0.41 with:

modFreqs phyloFit.mod 0.41 > vertebrate.mod

Resulting tree is for all 46 species. The same procedure is run for primate and placental subsets. And the same procedure was performed with just chrX. The resulting six trees are listed below.

Trees without chrX

1. Primate only subset

(((((((hg19:0.006036,panTro2:0.006817):0.002265,gorGor1:0.008230):0.008764,
ponAbe2:0.016911):0.012657(rheMac2:0.006959,papHam1:0.006378):0.027277):0.019382,
calJac1:0.064230):0.054029,tarSyr1:0.132418):0.018980,
(micMur1:0.087093,otoGar1:0.129282):0.018980);

2. Placental only subset

(((((((((((hg19:0.006024,panTro2:0.006789):0.002262,gorGor1:0.008208):0.008765,
ponAbe2:0.016837):0.012535,(rheMac2:0.006872,
papHam1:0.006439):0.027303):0.019859,calJac1:0.063414):0.055057,
tarSyr1:0.129930):0.009838,(micMur1:0.085989,
otoGar1:0.128778):0.033658):0.012238,tupBel1:0.194945):0.004481,
(((((mm9:0.086189,rn4:0.089247):0.195157,dipOrd1:0.208591):0.023521,
cavPor3:0.217198):0.010584,speTri1:0.151978):0.024663,(oryCun2:0.114630,
ochPri2:0.198507):0.094820):0.012067):0.019462,(((vicPac1:0.107714,
(turTru1:0.063700,bosTau4:0.114890):0.023250):0.035925,
((equCab2:0.104886,(felCat3:0.103989,canFam2:0.108561):0.053212):0.005425,
(myoLuc1:0.163463,pteVam1:0.131116):0.042793):0.004204):0.010077,
(eriEur1:0.219045,sorAra1:0.286631):0.054487):0.020551):0.011417,
(((loxAfr3:0.076308,proCap1:0.143785):0.026717,echTel1:0.221431):0.042789,
(dasNov2:0.110105,choHof1:0.085867):0.045250):0.011417);

3. all 46 species

(((((((((((((((((hg19:0.006700,panTro2:0.006667):0.002250,
gorGor1:0.008825):0.009680,ponAbe2:0.018318):0.014340,
(rheMac2:0.007853,papHam1:0.007637):0.029618):0.021965,
calJac1:0.066131):0.057590,tarSyr1:0.137823):0.011062,(micMur1:0.092749,
otoGar1:0.129725):0.035463):0.015494,tupBel1:0.186203):0.004937,
(((((mm9:0.084509,rn4:0.091589):0.197773,dipOrd1:0.211609):0.022992,
cavPor3:0.225629):0.010150,speTri1:0.148468):0.025746,(oryCun2:0.114227,
ochPri2:0.201069):0.101463):0.015313):0.020593,(((vicPac1:0.107275,
(turTru1:0.064688,bosTau4:0.123592):0.025153):0.040335,((equCab2:0.109397,
(felCat3:0.098612,canFam2:0.102458):0.049845):0.006219,(myoLuc1:0.142540,
pteVam1:0.113399):0.033706):0.004508):0.011671,(eriEur1:0.221785,
sorAra1:0.269562):0.056393):0.021227):0.023664,(((loxAfr3:0.082242,
proCap1:0.155358):0.026990,echTel1:0.245936):0.049697,
(dasNov2:0.116664,choHof1:0.096357):0.053145):0.006717):0.234728,
(monDom5:0.125686,macEug1:0.122008):0.215100):0.071664,
ornAna1:0.456592):0.109504,((galGal3:0.165536,taeGut1:0.171542):0.199223,
anoCar1:0.489241):0.105143):0.172371,xenTro2:0.855573):0.311354,
(((tetNig2:0.224159,fr2:0.203847):0.195181,(gasAcu1:0.316413,
oryLat2:0.481970):0.059150):0.325640,danRer6:0.730752):0.147949):0.526688,
petMar1:0.526688);

Trees on chrX only

1. primate only subset

(((((((hg19:0.003917,panTro2:0.005184):0.002146,gorGor1:0.008108):0.007057,
ponAbe2:0.015569):0.013208,(rheMac2:0.004711,
papHam1:0.004180):0.023970):0.018430,calJac1:0.058028):0.053927,
tarSyr1:0.096237):0.019719,(micMur1:0.074162,otoGar1:0.118457):0.019719);

2. placental only subset

(((((((((((hg19:0.003913,panTro2:0.005197):0.002196,gorGor1:0.008068):0.006904,
ponAbe2:0.015655):0.013435,(rheMac2:0.004704,
papHam1:0.004228):0.023895):0.019027,calJac1:0.057101):0.054600,
tarSyr1:0.096642):0.013924,(micMur1:0.074221,
otoGar1:0.117211):0.029832):0.013436,tupBel1:0.153211):0.002109,
(((((mm9:0.063891,rn4:0.066094):0.167668,dipOrd1:0.175669):0.023604,
cavPor3:0.171594):0.005607,speTri1:0.125382):0.026739,(oryCun2:0.083723,
ochPri2:0.168135):0.075730):0.008263):0.019786,(((vicPac1:0.081343,
(turTru1:0.056118,bosTau4:0.102627):0.021578):0.029857,((equCab2:0.087934,
(felCat3:0.097379,canFam2:0.091434):0.043427):0.006482,(myoLuc1:0.117475,
pteVam1:0.106041):0.027660):0.003592):0.010126,(eriEur1:0.212782,
sorAra1:0.234802):0.043917):0.021099):0.013238,(((loxAfr3:0.066021,
proCap1:0.128897):0.023867,echTel1:0.212319):0.046738,
(dasNov2:0.101972,choHof1:0.102076):0.046533):0.013238);

3. all 46 species

(((((((((((((((((hg19:0.003795,panTro2:0.005196):0.002735,
gorGor1:0.008038):0.006805,ponAbe2:0.015616):0.013581,
(rheMac2:0.004670,papHam1:0.004195):0.023755):0.019288,
calJac1:0.056549):0.053796,tarSyr1:0.095597):0.014122,
(micMur1:0.074107,otoGar1:0.115630):0.029705):0.012941,
tupBel1:0.151971):0.002729,(((((mm9:0.063333,rn4:0.065446):0.163518,
dipOrd1:0.172634):0.023201,cavPor3:0.169031):0.005188,
speTri1:0.123749):0.026667,(oryCun2:0.083602,
ochPri2:0.165502):0.074856):0.008279):0.018611,
(((vicPac1:0.080894,(turTru1:0.056486,bosTau4:0.101570):0.021353):0.030026,
((equCab2:0.087517,(felCat3:0.097110,canFam2:0.090575):0.043182):0.006555,
(myoLuc1:0.116436,pteVam1:0.105246):0.027654):0.003743):0.010755,
(eriEur1:0.209272,sorAra1:0.229678):0.042690):0.021348):0.019835,
(((loxAfr3:0.066090,proCap1:0.127321):0.023966,echTel1:0.208813):0.045208,
(dasNov2:0.100868,choHof1:0.100549):0.045734):0.008854):0.264749,
(monDom5:0.131898,macEug1:0.143442):0.220509):0.085972,
ornAna1:0.483509):0.097702,((galGal3:0.177448,taeGut1:0.162788):0.246554,
anoCar1:0.557249):0.117029):0.144329,xenTro2:0.880060):0.301475,
(((tetNig2:0.229555,fr2:0.206494):0.159973,(gasAcu1:0.307725,
oryLat2:0.476387):0.072242):0.286820,danRer6:0.792960):0.189983):0.438700,
petMar1:0.438700);

Multiple Trees

For the phyloP/phastCons calculations, there are a number of trees that were used.

There is a set of trees with branch lengths calculated based only on the ordinary chromosomes without chrX, and a set of trees calculated based only on chrX.

Within those two categories, there are three trees with branch lengths calculated from subsets of the 46 species:

  1. primate subset only
  2. placental mammal subset only
  3. all 46 vertebrates

Thus, there are six different phylogenetic trees.