Bison: mitochondrial genomics

From genomewiki
Revision as of 15:27, 1 December 2010 by Tomemerald (talk | contribs) (New page: == Introduction to bison conservation genomics == (to be continued) === Phylogeny: bison and yak are sister groups === (to be continued) === Interpreting bison CYTB variation === (to...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Introduction to bison conservation genomics

(to be continued)

Phylogeny: bison and yak are sister groups

(to be continued)


Interpreting bison CYTB variation

(to be continued)

Interpreting yak CYTB variation

Although the mitochondria encodes the usual 20 amino acids, only a subset of physio-chemically similar residues (the reduced alphabet) ever appear at a given position in a given protein. This subset describes the acceptable substitutions that do not significantly disrupt protein functionality. Discovery of this reduced alphabet can be achieved with greater sensitivity when the number of available species and their individual sequencies multiplicities are high. For mitochondrial proteins, that sensitivity is 1 in 10,000 (0.01% occurence frequency) for a given amino acid.

Interpretive certainty is never attained without experimentation but improves (up to a point) with more sequence data. Here it is important to check whether certain less common substitutions have persisted over evolutionary time in a phylogenetically coherent manner (ie a sub-clade) or are novel adaptations perhaps in conjunction with a co-evolving residue at another site (or another protein, perhaps even nuclear-encoded). After these considerations, the remaining rare changes are either deleterious or sequencing error. Polymorphism significance can be pursued at the xray structural level for only 3 of the 13 mitochondrial proteins (CYTB, COX2, COX1) and even this is complicated in the case of CYTB by its oliomeric association with 3 nuclear encoded proteins.

Aligning CTYB from the 70 complete yak mitochondrial genomes available on 1 Dec 10 shows variation at just 9 sites along the protein (ie 9 nsSNPs). These are quickly found when the web alignment tool retains input sequence order, displays residues identical to the top sequence as dots, gaps fragmentary data correctly, and allows a wide display permitting effective cross-species comparisons.

Yak and bison -- despite being sister species -- share variation only at one site, position 98. Here yak is exclusively valine with the exception of a single deleterious occurence (see below) of leucine, whereas bison have a mix of valine and alanine (which otherwise is very rare at this position in mammals), ie the ancestral residue was valine. Thus no lineage sorting occured at any amino acid position in CYTB at the time these species diverged. Lineage sorting however may be important in the overall evolution of the Bovini: 53 ancient polymorphisms (at the dna level) are said to have persisted since Bos and Bison diverged from Bubalus 5–8 million years ago.

The summary table of yak CYTB amino acid polymorphisms below arises from alignment of 5000 full-length mammalian cytochrome b orthologs. Red indicates deleterious mutation, green a possibly acceptable change but of restricted distribution, and blue a near-neutral substitution. It can be seen that the smallish yak population sampled (73 animals: 21 wild, 52 domestic) contains 5 deleterious alleles in CYTB which represents only 10% of the mitochondrial proteome.

   A017T       A084T     V098L     I188T     I192T     V195A     D214N     V329M     I348F  
  927  A    4,994  A   4522  V   4309  I     94  I   4528  V   4429  D   4610  V   4232  I
 4018  S        3  T    430  I    667  S   4353  L    427  I    512  N    188  T    651  V
   46  T        1  P     34  M     14  I    505  M     25  T     43  E    133  A     63  T
    3  L        1  V     11  A      1  T     31  T      4  G      8  S     44  I     45  M
    3  M                  1  L                3  F      4  M      2  Y     22  M      4  N
    1  F                  1  N                2  V      1  A      1  H      2  G      2  F
    1  P                                      1  A                          1  E      1  A                       
                                              1  S        
(analysis to be continued) 
A017T
  927  A
 4018  S
   46  T
    3  L
    3  M
    1  F
    1  P

(analysis to be continued)
A084T  
 4,994  A
     3  T
     1  P
     1  V

At position 98 of the 350 residues of mitochondrial-encoded CYTB (cytochrome b), using the top 5000 blastp matches to a yak query (all full-length orthologs), the reduced alphabet consists of valine 90% of the time regardless of mammalian clade with the similar (branched chain aliphatic) isoleucine having substantial representation at nearly 9%. The 430 species in which it occurs are scattered incoherently within mammal clades, meaning that it has arisen independantly many times. V098I may be slightly suboptimal as there is an evident bias (at some level) against equal occurence. It likely co-exists with valine in most non-bottlenecked populations of mammals, observed if enough individuals of a given species are sequenced.

However leucine, the seemingly similar third aliphatic residue, occurs one once despite being but a single base change transition away from the dominant residue. Were leucine a near-neutral substitution, its incidence would be vastly higher. Thus the change V098L reported for yak represents either a deleterious mutation or an unprecedented adaptation (eg to high altitude) or sequencing error in GenBank entry ACU82101. The same can be said for the more overtly radical change V098N in lemur AAS00156.

V098L
4522	V most common amino acid at position 98 of CYTB
 430	I
  34	M
  11	A bison
   1	L yak
   1	N lemur

(analysis to be continued)
V098L  
 4522  V
  430  I
   34  M
   11  A
    1  L
    1  N

(analysis to be continued)
I188T  
 4309  I
  667  S
   14  I
    1  T

(analysis to be continued)
I192T  
   94  I
 4353  L
  505  M
   31  T
    3  F
    2  V
    1  A
    1  S

(analysis to be continued)
V195A  
 4528  V
  427  I
   25  T
   11  X
    4  G
    4  M
    1  A

(analysis to be continued)
D214N  
 4429  D
  512  N
   43  E
    8  S
    4  X
    2  Y
    1  H

(analysis to be continued)
V329M  
 4610  V
  188  T
  133  A
   44  I
   22  M
    2  G
    1  E

(analysis to be continued)
I348F  
 4232  I
  651  V
   63  T
   45  M
    4  N
    2  F
    1  A

Kilo-sequence alignment tricks

New sequencing technologies have greatly affected the amount of mammalian mitochondrial genomic data available at GenBank. Five years ago, it was acceptable to publish population-level D loop sequences accompanied by a few fragmentary coding reads; today, a publication might offer 60-70 entire mitochondrial genomes. This favors evolutionary study of mitochondrial proteins over comparative genomics of nuclear genome products because the latter is still restricted to around 50 species (Dec 2010) almost all incompletely sequenced.

Many long-standing issues such as introgression, historic bottlenecks, population mixing, accrual of deleterious coding variants, hard polytomies, and lineage sorting during speciation can now be approached and resolved, especially with the increasing sequencing of end-Pleistocene frozen dna. This may allow more enlightened management of endangered species such as bison where populations reached rock bottom -- recovering numbers is not enough if genomic integrity is still at risk.

However, the flood of data raises significant issues in extraction of significant information: it is not instructive to align the tens of thousands of sequences available for each of 13 mitochondrial proteins -- that give a an intractible array of 3789 amino acids by 12500 sequences, enough to fill 20 x 100 = 2000 screens on the largest possible computer monitor. That data must be distilled down somehow to take-away information.

This section explains a practical desktop protocol for extracting the 'reduced phylogenetic alphabet' at each residue of the mitochondrial proteome. The method depends heavily on current capabilities of Blastp at NCBI and so may not be completely stable to changes made there over time.

First note that tBlastn cannot be used against the nr or wgs nucleotide databases at NCBI (or with Blat at UCSC) since the signficantly different genetic code of mammalian mitochondia is no longer supported as a parameter option. Other oddities involve missing terminal nucleotides that are added before translation. However mitochondrial dna is usually translated sensibly at GenBank protein entries.

The vertebrate mitochondrial code:

TTT F Phe      TCT S Ser      TAT Y Tyr      TGT C Cys  
TTC F Phe      TCC S Ser      TAC Y Tyr      TGC C Cys  
TTA L Leu      TCA S Ser      TAA * Ter      TGA W Trp  
TTG L Leu      TCG S Ser      TAG * Ter      TGG W Trp  

CTT L Leu      CCT P Pro      CAT H His      CGT R Arg  
CTC L Leu      CCC P Pro      CAC H His      CGC R Arg  
CTA L Leu      CCA P Pro      CAA Q Gln      CGA R Arg  
CTG L Leu      CCG P Pro      CAG Q Gln      CGG R Arg  

ATT I Ile      ACT T Thr      AAT N Asn      AGT S Ser  
ATC I Ile i    ACC T Thr      AAC N Asn      AGC S Ser  
ATA M Met i    ACA T Thr      AAA K Lys      AGA * Ter  Bos can use ATA as initiation codon
ATG M Met i    ACG T Thr      AAG K Lys      AGG * Ter  

GTT V Val      GCT A Ala      GAT D Asp      GGT G Gly  
GTC V Val      GCC A Ala      GAC D Asp      GGC G Gly  
GTA V Val      GCA A Ala      GAA E Glu      GGA G Gly  
GTG V Val i    GCG A Ala      GAG E Glu      GGG G Gly  

    AAs  = FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG
  Start  = --------------------------------MMMM---------------M------------
  Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG