Opsin evolution: informative indels: Difference between revisions

From genomewiki
Jump to navigationJump to search
Line 51: Line 51:
=== Indels in melanopsins ===   
=== Indels in melanopsins ===   


(to be continued)
It quickly emerges from a much larger data set that the mid-helix region preceding the proline in TM2 -- the only opsin transmembrane helix to ever experience an indel in 100 billion years of available branch length -- has experienced numerous independent insertions and deletions. That would seem to undercut efforts to make the length a definitive fundamental classifying tool among GPCR. The situation is compounded by separate indels following the proline that, depending on gap placement, might affect the extracellular loop connecting TM2 and TM3.
 
However with care, the '''homoplasy is managable''' and the locus is quite informative for opsins, though quite a detailed argument is necessary to fully exploit it.
 
An 'iron triangle' provides a fixed upstream frame of reference critical to reliable gapping of indels in this region. This consists of a very conserved asparagine Asn55 in TM1 hydrogen bonded to an almost universal charged glutamate Asp83 internal to TM2 which itself is hydrogen bonded to a peptide amide Ala299 in TM7 (bovine rhodopsin numbering). The iron triangle is central to the proper associative bundling and relative orientation of the seven transmembrane helices in the vicinity of the Schiff base K296. No indels occur in any opsin or GPCR between this N and D (meaning [[Opsin_evolution:_Cytoplasmic_face#The_first_cytoplasmic_loop|cytoplasmic loop CL1]] is of fixed length, namely 12 aa).
 
Downstream, the reference frame is augmented by the first cysteine C110 of the universal GPCR disulfide linking TM3 to EC2. This is preceded by an easily recognized motif WIFG (squid melanopsin), which forces all gaps to be placed between the iron triangle D and WIFGFAAC (FVFGPTGC in bovine rhodopsin). This 8 residue motif is well conserved in GPCR as well, so very ancient. Thus post-proline gapping is quite constrained in reliable anchors.
 
The 185 ciliary opsins (which includes 5 basal cnidarian opsins) in the reference sequence collection are all of the same length in this region, as are 65 peropsins, RGR and neuropsins and the vast majority of near-opsin GPCR. Consequently this length, denoted P.59.2 (for proline in position 59 bovine rhodopsin numbering with 2 aa missing post-proline), is ''ancestral for melanopsins''' which vary in length.
 
Deuterostome melanopsins are all of P.59.2 type, as are LMS and BCR arthropod melanopsins as well as a subclass of lophotrochozoan melanopsins and the one known cndarian melanopsin. The remaining dozen known lophotrochozoan melanopsins are of type P.60.2. This class -- which includes the structurally determined squid melanopsin -- thus has a one residue insertion whose location appears to be 5 residues after the D and 4 before the P. Thus lophotrochozoan melanopsins had ancestral length up to a gene duplication which acquired this insertion (in the default parsimonious scenario_.
 
The three classes of arthopod ultraviolet opsin genes (represented by 44 genes) all share a one residue deletion in this same region, approximately at the 4th post-D residue. This event affected insects, crustaceans and chelicerates, ie occured deep in the stem lineage of ecdysozoa.
 
[[Opsin_bovRHO1.png]]


=== Indels in other opsins ===
=== Indels in other opsins ===

Revision as of 00:04, 7 December 2009

Introduction to indels

Insertions and deletions of amino acids (together called coding indels) are a class of genetic event rarely fixed in conserved protein sequence regions. It is not immediately clear whether a given indel represents an insertion or a deletion. The process of deciding is called indel resolution; it requires a phylogenetic tree allowing determination of ancestral length. If outgroups are consistently short, then by parsimony the ingroup clade with longer length experienced an insertion. Indels are unresolvable when outgroup data is not available. Two or more consistent outgroup nodes establishes a period of length stability.

It is implausible -- rarity cubed -- that multiple outgroups plus an ingroup experienced independent deletions of the same length at the same site (though the exact site can be difficult to evaluate if flanking residues were also affected by the original genetic event or subsequently by accelerated compensatory mutation). Advanced statistical methods can provide only illusory gains over simple parsimony because the underlying required models of indel formation are entirely speculative.

Nonetheless, examples of homoplasy are easy to come by, especially in repetitive nucleotide regions encoding runs of compositionally simple amino acids subject to the mutational mechanism of replication slippage. Homoplasy at longer time scales manifests itself by incoherent distribution over a known phylogenetic tree. Convergent evolution can also be driven by selective advantage for altered length.

Indels occur very unevenly across the length of a given protein homology class. The rate might be high in terminal regions if the amino or carboxy termini are unimportant to the fold or function of matured protein. Within folded regions of soluble proteins, indels are greatly concentrated in loop regions of the 3D structure where a change in length can be accommodated without structural disruption. The distributional occurence of indels even allows prediction of loop regions.

For integral membrane proteins such as GPCR, deletions are very rarely fixed in the transmembrane helical regions because a shortened length would no longer span the membrane at the same angle, thus pulling in inappropriate non-hydrophobic residues from soluble loops. Insertions too are rare because they push hydrophobic and boundary turn residues out into soluble compartments and distort connecting loops, perhaps altering insertion angles of adjacent transmembrane regions. Such mutations arise frequently enough but are rarely fixed at the population level or hang on as balanced alleles over timescales commensurate with ordinal speciations.

In massively expanded gene families such as GPCR, a coherently fixed indel in one descendent clade of the gene tree suggests adaptive sub- or neo-functionalisation: if the indel were merely tolerated as near-neutral change, over geological timescales homoplasy at that site would occur. A remarkable site in transmembrane helix 2 was proposed in May 2009:

'Class A GPCR constitute a large family of transmembrane receptors. Helical distortions play a major role in the overall fold of these receptors. Most are related to conserved proline residues. However, in transmembrane helix 2, the proline pattern is not conserved, and when present, proline may be located at position TM 2.58, 2.59, or 2.60 yielding a bulged structure in P2.59 and P2.60 receptors or a more typical proline kink in P2.58 receptors. The proline pattern of helix 2 can be used as an evolutionary marker of molecular divergence of class A GPCRs.

At this site, two independent indel events occurred. One [unresolvable] indel arose very early in GPCR evolution in a bilaterian ancestor before protostome-deuterostome divergence. This indel led to the split between the P2.58 somatostatin/opioid receptors and peptide receptors with the P2.59 pattern. Subfamilies with proline at position 2.59 or no proline expanded earlier, whereas P2.60 receptors remained marginal throughout evolution. P2.58 receptors underwent later rapid expansion in vertebrates with the development of the chemokine and purinergic receptor subfamilies from somatostatin/opioid-related ancestors. A second indel, resolvable as a deletion, occurred in insect melanopsins.'

This result refines the classification of Class A GPCR, which might be quite indecisive at certain gene tree nodes from sequence alignment alone. Timing of the insect deletion can be done better (below) because the SwissProt collection used by the authors carries only 20% of the melanopsins actually available. Note the structural significance of length and bulge changes can be examined in available 3D determinations. The functional effect of this shift in TM2 remains obscure but must be important.

Class  Gene           PDB            Protein                     PubMed      Best human opsin   Next Best         Signaling

T.60.1  RHO1_bosTau    1JFP 3C9M 2J4Y bovine rod rhodopsin        17825322  RHO1_homSap 93%   SWS1_homSap   45%  Gt GNAT1 raises cGMP
P.60.0  MEL1_todPac    2Z73 2ZIY      squid melanopsin            18480818  MEL1_homSap 43%   PER1_homSap   30%  Gq GNAQ? inositol trisphosphate
P.59.3  ADORA2A_homSap 3EML           adenosine receptor 2A       18832607  MEL1_homSap 27%   ENCEPH_homSap 27%  Gs GNAT3 raises cAMP
P.59.1  ADRB1_melGal   2VT4           beta 1 adrenergic receptor  18594507  MEL1_homSap 29%   ENCEPH_homSap 25%  Gs GNAT3 raises cAMP
P.59.1  ADRB2_homSap   2R4R           beta 2 adrenergic receptor  17962520  MEL1_homSap 28%   PER1_homSap   29%  Gs GNAT3 raises cAMP

Thus indels in opsins -- when they occur in a conserved region -- are potentially very informative as rare genetic events not appreciably subject to homoplasy in defining orthology classes and higher order clusterings of them, hopefully corroborating or even refining trees derived from sequence clustering by alignment. While precious, such data is limited because physiological and structural constraints have prevented most regions of opsins from ever accommodating an indel.

Indels in ciliary opsins

The tertiary structural integrity requirements of a 7-transmembrane opsin, along with tuned binding of retinal, isomerization cycle conformational shifts and binding to secondary protein contributers to the photoreception cycle, conspire to greatly constrain admissable locations for ciliary opsin indels. Indeed this varies greatly by region, with indels never seen in the transmembrane regions themselves (despite tens of billions of branch length years) and restricted in connecting cytoplasmic and extracellular loops to EC2 and IC3 and IC7. Indel incidence is much higher in amino and carboxy terminal tails but not useful because of gapping ambiguity issues.

The distribution of fixed indels is quite peculiar: almost all occur in gene family stems (ie shortly after gene duplication in one branch), hardly any occur mid-history. For vertebrate imaging opsins, this means prior to lamprey divergence. In other words, not only had all the classes of imaging opsins emerged post-tunicate/amphioxus pre-lamprey but (neglecting tails) also all their indels. No further indels arose in the subsequent 500 million years in any of these opsins, as if these opsins were already optimized from the length perspective

Consequently the rate of indel occurence per billion years of branch length -- and so the frequency of multiple independent events near a given site -- is highly correlated to region, ie each region has a characteristic time scale over which it can be informative: too long and the risk of homoplasy (convergent evolution) is too high. That risk is exacerbated by uncertainty in gap placement within an alignment, which first requires delimitation by flanking invariant residues. Gap length per se is ambiguous: an indel of 3 residues shared by two extant species might have arisen once as a single event in the first species or as two events (one and two residues successively) in the other. Thus any phylogenetic interpretation of indels must be tempered by knowledge of the regional indel susceptibilities and the assumption these remain fairly constant across lineages and time.

Informative indels show up as readily apparent columns of gaps in large-scale alignments. If present across a single opsin orthology class, that merely validates prior blast clustering and other rare genomic events in establishing those classes in the first place. Sporadic indels, defined here as indels found within a single opsin gene, arise from seqencing errors but if not might be an adaptive specialization. It's very rare to see a ciliary opsin indel restricted to a phylogenetic subclade but examples exist: the post-marsupial loss of 5 residues of RHO1 in the distal arrestin binding region.

We're concerned here primarily with non-sporadic indels that span two or more orthology classes that speak to unresolved dating and topological issues in the gene tree. Significant individual indels visible on the alignment page. These give rise to a table sortable by position along the opsin sequence, indel length, region (eg 3rd cytoplasmic loop), higher taxonomic clade, and phylogenetic depth. Specific goals are dating indel events, characterizing remote opsins in pre-vertebrate deuterostomes, correctly placing cnidarians opsins, disambiguating opsins from non-opsin GPCR, and establishing ancestral lengths.

For deuterostome ciliary opsins, the story is fairly simple up to encephalopsin. None of the transmembrane helices have indels. That holds also for the first two cytoplasmic loops and first and last extracellular loops. Structural constraints can be too rigid, as illustrated by the well-known hydrogen bond chain of extremely conserved residues that holds the transmembrane helices in a fixed relative position: N55 in TM1 hydrogen bonded to D83 in TMH2 to peptide A299 in TMH6. Indels that altered the position of these residues within the respective helical wheels would cause the whole arrangement to become unglued. The asparagine and aspartate are deeply invariant not only in opsins but also GPCR.

The second extracellular loop has a two residue insert in all rod and cone opsins in a region so far not attributed functional significance; this may have been a near-neutral event in the ancestral stem protein (ie in a gene duplicate of pinopsin). The cytoplasmic side has all the protein-protein interactions but length of the extracellular loops can still be important in tensioning of transmembrane helices that sets their angles of insertion and relative orientation.

The third cytoplasmic loop has variable length distally. Length is constant within orthology classes with parietopsin having full length, parapinopsin one residue shorter, and all others two residues fewer. This is a region of high beta factor in bovine rhodopsin crystals, ie has too much movement to be assigned a conformation. Unsurprisingly no function has been assigned. While the indel pattern supports the conventional gene tree, evidently this indel hotspot has fixed at least three separate events. While that hasn't resulted in overt homoplasy in terms of length, additional events could be masked. This weakens interpretive certainty of indels in this region.

The amino terminus has 4 informative indels, all deletions. The first unites unites RHO1 and RHO2 to the exclusion of all other opsins (as does the short highly conserved N-terminus with two glycosylation sites). No indel or intron distinguishes them. RHO2 has an odd phylogenetic distribution -- it seems to occur in one species of lamprey but not in genomic lamprey (despite 19 million traces) nor in cartilaginous nor ray-finned fish, but seeming rises again in lungfish, coelocanth, lizards, and chicken but not frog nor any mammal. Possibly the lamprey RHO2 is a lineage-specific duplication of lamprey RHO1. A later independent duplication in lobe-finned fish persisted until the mammalian nocturnal loss era. It may be missing in frog because of an incomplete genome.

Indels in melanopsins

It quickly emerges from a much larger data set that the mid-helix region preceding the proline in TM2 -- the only opsin transmembrane helix to ever experience an indel in 100 billion years of available branch length -- has experienced numerous independent insertions and deletions. That would seem to undercut efforts to make the length a definitive fundamental classifying tool among GPCR. The situation is compounded by separate indels following the proline that, depending on gap placement, might affect the extracellular loop connecting TM2 and TM3.

However with care, the homoplasy is managable and the locus is quite informative for opsins, though quite a detailed argument is necessary to fully exploit it.

An 'iron triangle' provides a fixed upstream frame of reference critical to reliable gapping of indels in this region. This consists of a very conserved asparagine Asn55 in TM1 hydrogen bonded to an almost universal charged glutamate Asp83 internal to TM2 which itself is hydrogen bonded to a peptide amide Ala299 in TM7 (bovine rhodopsin numbering). The iron triangle is central to the proper associative bundling and relative orientation of the seven transmembrane helices in the vicinity of the Schiff base K296. No indels occur in any opsin or GPCR between this N and D (meaning cytoplasmic loop CL1 is of fixed length, namely 12 aa).

Downstream, the reference frame is augmented by the first cysteine C110 of the universal GPCR disulfide linking TM3 to EC2. This is preceded by an easily recognized motif WIFG (squid melanopsin), which forces all gaps to be placed between the iron triangle D and WIFGFAAC (FVFGPTGC in bovine rhodopsin). This 8 residue motif is well conserved in GPCR as well, so very ancient. Thus post-proline gapping is quite constrained in reliable anchors.

The 185 ciliary opsins (which includes 5 basal cnidarian opsins) in the reference sequence collection are all of the same length in this region, as are 65 peropsins, RGR and neuropsins and the vast majority of near-opsin GPCR. Consequently this length, denoted P.59.2 (for proline in position 59 bovine rhodopsin numbering with 2 aa missing post-proline), is ancestral for melanopsins' which vary in length.

Deuterostome melanopsins are all of P.59.2 type, as are LMS and BCR arthropod melanopsins as well as a subclass of lophotrochozoan melanopsins and the one known cndarian melanopsin. The remaining dozen known lophotrochozoan melanopsins are of type P.60.2. This class -- which includes the structurally determined squid melanopsin -- thus has a one residue insertion whose location appears to be 5 residues after the D and 4 before the P. Thus lophotrochozoan melanopsins had ancestral length up to a gene duplication which acquired this insertion (in the default parsimonious scenario_.

The three classes of arthopod ultraviolet opsin genes (represented by 44 genes) all share a one residue deletion in this same region, approximately at the 4th post-D residue. This event affected insects, crustaceans and chelicerates, ie occured deep in the stem lineage of ecdysozoa.

Opsin_bovRHO1.png

Indels in other opsins

Informative indels would be very helpful in this class of opsins because their sequence relationships to ciliary and melanopsins are too weak. Note intron patterns, another class of even rarer genetic event and so even better suited for deep time scales -- has already illuminated branching relationships to a certain extent.

(to be continued)