Opsin evolution: informative indels

From genomewiki
Jump to navigationJump to search

Introduction to indels

Insertions and deletions of amino acids (together called coding indels) are a class of genetic event rarely fixed in conserved protein sequence regions. It is not immediately clear whether a given indel represents an insertion or a deletion. The process of deciding is called indel resolution requires a phylogenetic tree allowing determination of ancestral length. If outgroups are consistently short, then by parsimony the ingroup clade with longer length experienced an insertion (respectively deletion). Indels are unresolvable when outgroup data is not available.

It is implausible -- rarity cubed -- that two or more outgroups plus an ingroup experienced independent deletions of the same length at the same site (though the exact site can be difficult to evaluate if flanking residues were also affected by the original genetic event or subsequent accelerated compensatory mutation). Advanced statistical methods can provide only illusory gains over simple parsimony because the underlying required models of indel formation are highly uncertain.

Nonetheless, examples of homoplasy are easy to come by, especially in repetitive nucleotide regions (encoding runs of compositionally simple amino acids) subject to the mutational mechanism of replication slippage. Homoplasy at longer time scales manifests itself by incoherent distribution over a known phylogenetic tree. Convergent evolution can also be driven by selective advantage for altered length.

Indels occur very unevenly across the length of a given protein homology class. The rate might be high in terminal regions if the amino or carboxy termini are unimportant to the fold or function of matured protein. Within folded regions of soluble proteins, indels are greatly concentrated in loop regions of the 3D structure where a change in length can be accommodated without structural disruption.

For integral membrane proteins such as GPCR, deletions are very rarely fixed in the transmembrane helical regions because a shortened length would no longer span the membrane at the same angle, puling in inappropriate non-hydrophobic residues from soluble loops. Insertions too are rare because they push hydrophobic and boundary turn residues into soluble compartments and distort connecting loops, perhaps altering insertion angles of other transmembrane regions. Such mutations occur frequently enough but are rarely fixed at the population level or as balanced alleles over timescales commensurate with major speciations.

In massively expanded gene families such as GPCR, a coherently fixed indel in one descendent clade of the gene tree suggests adaptive sub- or neo-functionalisation: if the indel were merely tolerated as near-neutral change, over geological timescales homoplasy would arise. A remarkable site in transmembrane helix 2 was asserted in May 2009:

"Class A GPCR constitute a large family of transmembrane receptors. Helical distortions play a major role in the overall fold of these receptors. Most are related to conserved proline residues. However, in transmembrane helix 2, the proline pattern is not conserved, and when present, proline may be located at position 2.58, 2.59, or 2.60 [yielding] a bulged structure in P2.59 and P2.60 receptors or a [more] typical proline kink in P2.58 receptors. The proline pattern of helix 2 can be used as an evolutionary marker and helps to trace the molecular evolution of class A GPCRs.

Two indel events yielding functional receptors occurred independently. One [unresolvable] indel arose very early in GPCR evolution, in a bilaterian ancestor, before the protostome-deuterostome divergence. This indel led to the split between the P2.58 somatostatin/opioid receptors and other peptide receptors with the P2.59 pattern. A second indel occurred in insect [melan]opsins corresponding to a deletion. Subfamilies with proline at position 2.59 or no proline expanded earlier, whereas P2.60 receptors remained marginal throughout evolution. P2.58 receptors underwent rapid expansion in vertebrates with the development of the chemokine and purinergic receptor subfamilies from somatostatin/opioid-related ancestors."

This result thus refines the classification of Class A GPCR, which might be quite indecisive at certain gene tree nodes from sequence alignment alone. Timing of the insect deletion can be done better (below) because these authors did not access the full collection of genomic melanopsins available. Note the functional significance of length and bulge changes remains obscure.

Thus indels in opsins -- when they occur in a conserved region -- are potentially very informative (as rare genetic events not appreciably subject to homoplasy) in defining orthology classes and higher order clusterings of them, hopefully corroborating or even refining results of sequence clustering by alignment. But because of physiological and structural constraints, few regions of the opsin molecule have ever accommodated indels.

Indels in melanopsins

(to be continued)

Indels in ciliary opsins

We shall see below that the distribution of fixed indels is very peculiar: almost all occur in gene family stems (ie shortly after gene duplication in one branch), hardly any occur mid-history. For vertebrate imaging opsins, this means prior to lamprey divergence. In other words, not only had all the classes of imaging opsins emerged post-tunicate/amphioxus pre-lamprey but also all their indels. No further indels arose in the subsequent 500 million years in any of these opsins (apart from little-selected leading and terminal domains) -- these opsins were already optimized from the length perspective too.

The tertiary structural integrity requirements of a 7-transmembrane opsin, along with its tuned binding of retinal, isomerization cycle conformational shifts, and binding to secondary protein contributers to the photoreception cycle, conspire to greatly constrain admissable locations for indels. Indeed this varies greatly by region, with indels scarcely seen in the transmembrane regions themselves and rare in most cytoplasmic and extracellular loops but moderately common in others and more freely occuring in amino and carboxy terminal tail regions.

Consequently the rate of indel occurence per billion years of branch length -- and so the frequency of multiple independent events near a given site -- is highly correlated to region. This means each region has a characteristic time scale over which it can be informative: too long and the risk of homoplasy (convergent evolution) is too high. That risk is exacerbated by uncertainty in gap placement within an alignment. Gap length has value but especially in a high incidence region, an indel of 3 residues shared by two extant species might have arisen once as a single event in the first species but as two events (one and two residues successively) in the other. Thus any phylogenetic interpretation of indels must be tempered by knowledge of the regional indel susceptibilities and the assumption these remain fairly constant across lineages and time.


Informative indels show up as readily apparent columns of gaps in large-scale alignments, minimally consistent across an opsin orthology class, indeed supplement blast clustering and other rare genomic events in establishing these classes. Sporadic indels are defined here as indels found within a single opsin or subclade of class (for example the post-marsupial loss of 5 residues in the arrestin binding region of RHO1). We're concerned here primarily with non-sporadic indels that span two or more orthology classes because these speak to not fully resolved issues in opsin research.

Let's now discuss significant individual indels visible on the alignment page. These could result in a simple flatfile database sortable by position along the opsin sequence, indel length, region (eg 3rd cytoplasmic loop), higher taxonomic clade, and phylogenetic depth. We're specifically interested in validating the current reference set collection (as carried in the fasta headers), dating indel events, characterizing very remote opsins in cnidarians, disambiguating opsins from generic GPCR, establishing ancestral lengths, and evaluating overall the usefulness of this type of rare genomic event for this gene family.

For deuterostome ciliary opsins, the story is fairly simple up to encephalopsin. None of the transmembrane helices have indels. That holds also for the first two cytoplasmic loops and first and last extracellular loops. Evidently structural constraints are too rigid. The second extracellular loop has a two residue insert in all rod and cone opsins in a region so far not attributed functional significance; this may have been a near-neutral event in the ancestral stem protein (ie in a gene duplicate of pinopsin).

The third cytoplasmic loop has variable length distally. Length is constant within orthology classes with parietopsin having full length, parapinopsin one residue shorter, and all others two residues fewer. This is a region of high beta in bovine rhodopsin crystals, ie has too much movement to be assigned a conformation. Unsurprisingly no function has been assigned. While the indel pattern supports the conventional gene tree, evidently this indel hotspot has fixed at least three separate events. While that hasn't resulted in overt homoplasy in terms of length, additional events could be masked. This reduces the value of the indel.

The amino terminus has 4 informative indels, all deletions. The first unites unites RHO1 and RHO2 to the exclusion of all other opsins (as does the short highly conserved N-terminus with two glycosylation sites). No indel or intron distinguishes them. RHO2 has an odd phylogenetic distribution -- it seems to occur in one species of lamprey but not in genomic lamprey (despite 19 million traces) nor in cartilaginous nor ray-finned fish, but seeming rises again in lungfish, coelocanth, lizards, and chicken but not frog nor any mammal. Possibly the lamprey RHO2 is a lineage-specific duplication of lamprey RHO1. A later independent duplication in lobe-finned fish persisted until the mammalian nocturnal loss era. It may be missing in frog because of an incomplete genome.