Opsin evolution: informative indels
Insertions and deletions of amino acids (together called coding indels) are quite informative in defining orthology classes and higher order clusterings in opsins. Because of many physiological and structural constraints, few regions of the opsin molecule can accommodate indels. No doubt such mutations occur frequently enough, but only rarely are they fixed at the population level. This fixation does not necessarily imply improved properties of the opsin as the indel might simply be tolerated as approximately neutral change.
The tertiary structural integrity requirements of a 7-transmembrane opsin, along with its tuned binding of retinal, isomerization cycle conformational shifts, and binding to secondary protein contributers to the photoreception cycle, conspire to greatly constrain admissable locations for indels. Indeed this varies greatly by region, with indels scarcely seen in the transmembrane regions themselves and rare in most cytoplasmic and extracellular loops but moderately common in others and more freely occuring in amino and carboxy terminal tail regions.
Consequently the rate of indel occurence per billion years of branch length -- and so the frequency of multiple independent events near a given site -- is highly correlated to region. This means each region has a characteristic time scale over which it can be informative: too long and the risk of homoplasy (convergent evolution) is too high. That risk is exacerbated by uncertainty in gap placement within an alignment. Gap length has value but especially in a high incidence region, an indel of 3 residues shared by two extant species might have arisen once as a single event in the first species but as two events (one and two residues successively) in the other. Thus any phylogenetic interpretation of indels must be tempered by knowledge of the regional indel susceptibilities and the assumption these remain fairly constant across lineages and time.
It is not immediately clear whether a given indel represents an insertion or a deletion. The process of deciding is called 'indel resolution' and requires determination of immediately ancestral length. If outgroups are consistently short, then the ingroup clade with longer length experienced an insertion (conversely deletion). This is tantamount to an operative assumption of parsimony, ie the implausibility of two outgroups and ingroup have experienced independent deletions of the same length, probabalistically characterized as rare event cubed. Advanced statistical methods can provide only illusory gains over simply parsimony because the underlying models of indel formation are highly uncertain.
Informative indels show up as readily apparent columns of gaps in large-scale alignments, minimally consistent across an opsin orthology class, indeed supplement blast clustering and other rare genomic events in establishing these classes. Sporadic indels are defined here as indels found within a single opsin or subclade of class (for example the post-marsupial loss of 5 residues in the arrestin binding region of RHO1). We're concerned here primarily with non-sporadic indels that span two or more orthology classes because these speak to not fully resolved issues in opsin research.
Let's now discuss significant individual indels visible on the alignment page. This results in a simple flatfile database that can be sorted by numbered position along the opsin sequence, length, region (eg 3rd cytoplasmic loop), higher taxonomic clade, and phylogenetic depth. We're specifically interested in validating the current reference set collection (as carried in the fasta headers), dating indel events, characterizing very remote opsins in cnidarians, disambibuating opsins from generic GPCR, establishing ancestral lengths, and evaluating overall the usefulness of this type of rare genomic event for this gene family.
(to be continued)