Bison: mitochondrial genomics

Introduction to bison conservation genomics

(to be continued)

Phylogeny: bison and yak are sister groups

(to be continued)

Interpreting bison CYTB variation

Bison mitochondrial genomes became well-represented at GenBank withthe 1 Dec 10 release of 31 complete genomes from 6 herds by the Derr group which includes two woods bison (Bison bison athabascae) from the non-admixed Elk Island herd along with various cow-bison hybrid and cow breed genomes). The cow-bison hybrids represent crossing of a bison male with a domestic cow (or rather a continuous line of female descent from such a cross) and so have strictly cow mitochondrial dna, not relevent to this section.

Bison accession numbers:
GU946976 GU946977 GU946978 GU946979 GU946980 GU946981 GU946982 GU946983 GU946984 GU946985 GU946986
GU946987 GU946988 GU946989 GU946990 GU946991 GU946992 GU946993 GU946994 GU946995 GU946996 GU946997
GU946998 GU946999 GU947000 GU947001 GU947002 GU947003 GU947004 GU947005 GU947006

Interpreting yak CYTB variation

Although the mitochondria encodes the usual 20 amino acids, only a subset of physio-chemically similar residues (the reduced alphabet) ever appear at a given position in a given protein. This subset describes the acceptable substitutions that do not significantly disrupt protein functionality. Discovery of this reduced alphabet can be achieved with greater sensitivity when the number of available species and their individual sequences multiplicities are high. For mitochondrial proteins, that sensitivity is 1 in 10,000 (0.01% occurrence frequency) for a given amino acid.

Interpretive certainty is never attained without experimentation but improves (up to a point) with more sequence data. Here it is important to check whether certain less common substitutions have persisted over evolutionary time in a phylogenetically coherent manner (ie a sub-clade) or are novel adaptations perhaps in conjunction with a co-evolving residue at another site (or another protein, perhaps even nuclear-encoded). After these considerations, the remaining rare changes are either deleterious or sequencing error. Polymorphism significance can be pursued at the xray structural level for only 3 of the 13 mitochondrial proteins (CYTB, COX2, COX1) and even this is complicated in the case of CYTB by its oligomeric association with 3 nuclear encoded proteins.

Aligning CTYB from the 70 complete yak mitochondrial genomes available on 1 Dec 10 shows variation at just 9 sites along the protein (ie 9 nsSNPs). These are quickly found when the web alignment tool retains input sequence order, displays residues identical to the top sequence as dots, gaps fragmentary data correctly, and allows a wide display permitting effective cross-species comparisons.

Yak and bison -- despite being sister species -- share variation only at one site, position 98. Here yak is exclusively valine with the exception of a single deleterious occurrence (see below) of leucine, whereas bison have a mix of valine and alanine (which otherwise is very rare at this position in mammals), ie the ancestral residue was valine. Thus no lineage sorting occurred at any amino acid position in CYTB at the time these species diverged. Lineage sorting however may be important in the overall evolution of the Bovini: 53 ancient polymorphisms (at the dna level) are said to have persisted since Bos and Bison diverged from Bubalus 5–8 million years ago.

The summary table of yak CYTB amino acid polymorphisms below arises from alignment of 5000 full-length mammalian cytochrome b orthologs. Red indicates deleterious mutation, green a possibly acceptable change but of restricted distribution, and blue a near-neutral substitution. It can be seen that the smallish yak population sampled (21 wild, 48 domestic added in Aug 10 to 3-4 previously available) already contains 5 deleterious alleles in CYTB which represents only 10% of the mitochondrial proteome.

The changes can also be displayed in context by coloring the appropriate residues in a reference sequence relative to a composite sequence consolidating all the polymorphisms from distinct animals (no one animal has more than two of the 9; V195A + I348F occurs in two animals). This latter sequence is quite useful in comparing polymorphism sites across species as explained in the annotation tricks section.

>CYTB_bosGruR Bos grunniens cytochrome b ref seq taken as gi|147744503 
MTNIRKSHPLMKIVNNAFIDLPAPSNISSWWNFGSLLGVCLILQILTGLFLAMHYTSDTTTAFSSVAHICRDVNYGWIIRYMHANGASMFFICLYMHVGRGLYYGSYTFLETWNIGVILLLTVMATAFMGYVLPWGQMSF
WGATVITNLLSAIPYIGTNLVEWIWGGFSVDKATLTRFFAFHFILPFIITAIAMVHLLFLHETGSNNPTGISSDADKIPFHPYYTIKDILGALLLILALMLLVLFTPDLLGDPDNYTPANPLNTPPHIKPEWYFLFAYAI
LRSIPNKLGGVLALAFSILILALIPLLHTSKQRSMIFRPLSQCLFWTLVADLLTLTWIGGQPVEHPYIIIGQLASIMYFLLILVLMPTAGTIENKLLKW

>CYTB_bosGruP Bos grunniens composite polymorphisms: A017T A084T V098L I188T I192T V195A D214N V329M I348F
MTNIRKSHPLMKIVNNTFIDLPAPSNISSWWNFGSLLGVCLILQILTGLFLAMHYTSDTTTAFSSVAHICRDVNYGWIIRYMHTNGASMFFICLYMHLGRGLYYGSYTFLETWNIGVTLLLTVMATAFMGYVLPWGQMSF
WGATVITNLLSAIPYIGTNLVEWIWGGFSVDKATLTRFFAFHFILPFIITATAMAHLLFLHETGSNNPTGISSNADKIPFHPYYTIKDILGALLLILALMLLVLFTPDLLGDPDNYTPANPLNTPPHIKPEWYFLFAYAI
LRSIPNKLGGVLALAFSILILALIPLLHTSKQRSMIFRPLSQCLFWTLMADLLTLTWIGGQPVEHPYFIIGQLASIMYFLLILVLMPTAGTIENKLLKW

   A017T       A084T     V098L     I188T     I192T     V195A     D214N     V329M     I348F  
  927  A    4,994  A   4522  V   4309  I     94  I   4528  V   4429  D   4610  V   4232  I
 4018  S        3  T    430  I    667  S   4353  L    427  I    512  N    188  T    651  V
   46  T        1  P     34  M     14  I    505  M     25  T     43  E    133  A     63  T
    3  L        1  V     11  A      1  T     31  T      4  G      8  S     44  I     45  M
    3  M                  1  L                3  F      4  M      2  Y     22  M      4  N
    1  F                  1  N                2  V      1  A      1  H      2  G      2  F
    1  P                                      1  A                          1  E      1  A                       
                                              1  S        
(analysis to be continued) 
A017T
  927  A
 4018  S
   46  T
    3  L
    3  M
    1  F
    1  P

(analysis to be continued)
A084T  
 4,994  A
     3  T
     1  P
     1  V

V098L: At position 98, the reduced alphabet consists of valine 90% of the time regardless of mammalian clade with the similar (branched chain aliphatic) isoleucine having substantial dispersed representation at nearly 9%. The 430 species in which it occurs are scattered incoherently within mammal clades, meaning that it has arisen independently many times. V098I may be slightly suboptimal as there is an evident bias (at some level) against equal occurence. It likely co-exists with valine in most non-bottlenecked populations of mammals, observed if enough individuals of a given species are sequenced.

However leucine, the seemingly similar third aliphatic residue, occurs one once despite being but a single base change transition away from the dominant residue. Were leucine a near-neutral substitution, its incidence would be vastly higher. Thus the change V098L reported for yak represents either a deleterious mutation or an unprecedented adaptation (eg to high altitude) or sequencing error in GenBank entry ACU82101. The same can be said for the more overtly radical change V098N in lemur AAS00156.

V098L
4522	V most common amino acid at position 98 of CYTB
 430	I
  34	M
  11	A bison
   1	L yak
   1	N lemur

(analysis to be continued)
V098L  
 4522  V
  430  I
   34  M
   11  A
    1  L
    1  N

(analysis to be continued)
I188T  
 4309  I
  667  S
   14  I
    1  T

(analysis to be continued)
I192T  
   94  I
 4353  L
  505  M
   31  T
    3  F
    2  V
    1  A
    1  S

(analysis to be continued)
V195A  
 4528  V
  427  I
   25  T
   11  X
    4  G
    4  M
    1  A

(analysis to be continued)
D214N  
 4429  D
  512  N
   43  E
    8  S
    4  X
    2  Y
    1  H

(analysis to be continued)
V329M  
 4610  V
  188  T
  133  A
   44  I
   22  M
    2  G
    1  E

(analysis to be continued)
I348F  
 4232  I
  651  V
   63  T
   45  M
    4  N
    2  F
    1  A

Kilo-sequence alignment tricks

New sequencing technologies have greatly affected the amount of mammalian mitochondrial genomic data available at GenBank. Five years ago, it was acceptable to publish population-level D loop sequences accompanied by a few fragmentary coding reads; today, a publication might offer 60-70 entire mitochondrial genomes. This favors evolutionary study of mitochondrial proteins over comparative genomics of nuclear genome products because the latter is still restricted to around 50 species (Dec 2010) almost all incompletely sequenced.

Many long-standing issues such as introgression, historic bottlenecks, population mixing, accrual of deleterious coding variants, hard polytomies, and lineage sorting during speciation can now be approached and resolved, especially with the increasing sequencing of end-Pleistocene frozen dna. This may allow more enlightened management of endangered species such as bison where populations reached rock bottom -- recovering numbers is not enough if genomic integrity is still at risk.

However, the flood of data raises significant issues in extraction of significant information: it is not instructive to align the tens of thousands of sequences available for each of 13 mitochondrial proteins -- that give a an intractable array of 3789 amino acids by 12500 sequences, enough to fill 20 x 100 = 2000 screens on the largest possible computer monitor. That data must be distilled down somehow to take-away information.

This section explains a practical desktop protocol for extracting the 'reduced phylogenetic alphabet' at each residue of the mitochondrial proteome. The method depends heavily on current capabilities of Blastp at NCBI and so may not be completely stable to changes made there over time.

First note that tBlastn cannot be used against the nr or wgs nucleotide databases at NCBI (or with Blat at UCSC) since the significantly different genetic code of mammalian mitochondria is no longer supported as a parameter option. Other oddities involve missing terminal nucleotides that are added before translation. However mitochondrial dna is usually translated sensibly at GenBank protein entries.

The vertebrate mitochondrial code:

TTT F Phe      TCT S Ser      TAT Y Tyr      TGT C Cys  
TTC F Phe      TCC S Ser      TAC Y Tyr      TGC C Cys  
TTA L Leu      TCA S Ser      TAA * Ter      TGA W Trp  
TTG L Leu      TCG S Ser      TAG * Ter      TGG W Trp  

CTT L Leu      CCT P Pro      CAT H His      CGT R Arg  
CTC L Leu      CCC P Pro      CAC H His      CGC R Arg  
CTA L Leu      CCA P Pro      CAA Q Gln      CGA R Arg  
CTG L Leu      CCG P Pro      CAG Q Gln      CGG R Arg  

ATT I Ile      ACT T Thr      AAT N Asn      AGT S Ser  
ATC I Ile i    ACC T Thr      AAC N Asn      AGC S Ser  
ATA M Met i    ACA T Thr      AAA K Lys      AGA * Ter  Bos can use ATA as initiation codon
ATG M Met i    ACG T Thr      AAG K Lys      AGG * Ter  

GTT V Val      GCT A Ala      GAT D Asp      GGT G Gly  
GTC V Val      GCC A Ala      GAC D Asp      GGC G Gly  
GTA V Val      GCA A Ala      GAA E Glu      GGA G Gly  
GTG V Val i    GCG A Ala      GAG E Glu      GGG G Gly  

    AAs  = FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG
  Start  = --------------------------------MMMM---------------M------------
  Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

Bison: mitochondrial genomics

Contents

Introduction to bison conservation genomics

Phylogeny: bison and yak are sister groups

Interpreting bison CYTB variation

Interpreting yak CYTB variation

Kilo-sequence alignment tricks

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools