Opsin evolution: ancestral sequences

From genomewiki
Jump to: navigation, search

See also: Curated Sequences | Alignment | Ancestral Introns | Informative Indels | Cytoplasmic face | Update Blog

Introduction to Ancestral Sequences

Reconstruction of ancestral genes -- indeed whole genomes -- is useful in a variety of contexts. Widely done in opsins to reconstruct historical spectral sensitivities, our purpose here is primarily to reduce the excessive number of available opsin sequences to representative ones that still carry all the information of the opsin class but without the idiosyncrasies that might have developed in particular clades. An ancestral ciliary opsin sequence at the agnathan divergence node takes away the subsequent 500 million years of sequence divergences. Suppose the same is done for rhabdomeric opsins. Comparing these to an uncharacterized extant (contemporary) lophotrochozoan or cnidarian opsin greatly sharpens the alignment which previously involved a billion years of round trip evolutionary divergence. It further facilitates comparison of diagnostic signature residues and patches and rare genetic events such as intron gain or loss.

These considerations are very important in opsins, which are embedded in the largest and most complex of all gene families, the GPCR, because it happens that the critical events in the evolutionary origin of eye are quite old, certainly predating the Cambrian. Opsin sequences are well-conserved, less well than some like histone or ribosomal proteins but far more than the median protein, but over the time scales involved the percent identity has dropped off into the unreliable Blast twilight zone (below 30%) where a faster evolving opsin might be confused with a slower evolving GPCR not involved in photoreception. Ancestral sequences thus greatly improve the placement of opsins within their correct homology class.

However the utility really depends on the accuracy of ancestral sequence reconstruction. We hear this or that maximal likelihood or bayesian methodology "should" work -- but are these assertions really testable or just self-serving bioinformatic blather? Ancient dna has two problems -- the sequence-able component is never that ancient and the fossil that it came from is never exactly from the divergence node. On the protein side, even if collagens and other structural proteins can be sequenced from dinosaur femur, that won't help with soft-tissue membrane-bound opsins. Fossil eyes in trilobites are much studied but equally uninformative at the molecular level. So direct tests of reconstructed opsins are not imminent.

Another dimensionality to testing accuracy of reconstructed sequences involves physical construction of the gene and its expression in a contemporary host. If the gene were an enzyme, we might gain confidence looking at binding constants, catalytic efficiency, and substrate specificity. For opsins we have covalent binding of retinal, spectral sensitivity, 7-transmembrane topology, and signaling capability. Unfortunately we don't know what the ancestral lambda max should be. These functionalities won't prove stringent enough because even the sloppiest reconstruction will get invariant and near-invariant residues correct. These may be quite adequate to produce a satisfactorily functioning opsin that bears little relationship to true ancestral sequence.

It could be argued that the ancestral residues with the most variation are the least important, so it doesn't really matter if the reconstruction gets them right. It's abundantly clear from a quick alignment that amino and carboxy termini are under very relaxed constraints sometimes even within a single orthology class (but sometimes not), making reconstruction outside the core opsin essentially hopeless. There's also markedly less conservation in some of the extracellular and cytoplasmic connecting loops. Note though that a highly organized portion in the extracellular region, including a conserved disulfide bridge, actually guides the arrangement of the seven-helix transmembrane motif.

Thus the focus of the reconstruction effort lies between the hopeless and the slam-dunk invariant residues. Here we don't want to use reconstruction methods developed and vetted for cytoplasmic proteins. The rules -- such as what constitutes a conservative substitution -- are very different for integral membrane proteins such as opsins where alpha helices have exterior exposed to hydrophobic rather than hydrophilic water except at their cap residues. We know from the determined 3D structure of bovine RHO1 (intradiskal loop 1 and loop 2 have been studied separately) that opsins will have significant co-evolution, that is ectopic (non-adjacent) residue pairs that shift in a coordinated manner according to their own reduced alphabet despite the disparity in linear position, the best known case being the retinal-bearing lysine and its negative counterion glutamate (where the reduced alphabet admits aspartate at the counterion). These issues -- and opsin dimerization surfaces -- are not considered in residue-by-residue and local patch reconstruction methods.

Residue constraints on opsin evolution

Opsin counterion.png

Variations in that counterion residue, seemingly so important at the retinal Schiff base, seem barely discussed from a modern comparative genomics perspective. It's known though that upon photoactivation, rhodopsin relaxes slightly and loses its initial counterion (but possibly picking up a second Glu181 in EXC2, a residue affecting lambda max and likely the ancestral counterion; E108 may also have a role in SWS1). GLY90 is also critical as seen from directed mutagenesis and causal role in human night blindness.

From a quick alignment of our phylogenetically dispersed 228 opsins covering this region, we can recover all sorts of new results, using a Multalin annotation trick (isolate E113 by setting line break to position 167, graduation to 1, etc to add G90 and E181, sort covariation in a spreadsheet.) We see initial received wisdom is wrong: the most common alternatives are not charged and could not provide the required offset of the Schiff base with a salt bridge. (Chloride ion is often discussed here.) Not all are potential hydrogen-bond donors -- all 7 insect UV opsins use phenylalanine. Surely tyrosine, not glutamate, is ancestral at position 113 (as deduced by Terakita et al in the pre-genomic era) and hydrogen-bonding is the usual structural contribution to agonist binding pocket. This illustrates the limitations of bovine rhodopsin as proxy to all opsins. Non-opsin GPCR are again required here as controls on specificity.


-- All 78 opsins from RHO1 to PPIN (in the gene tree) carry glutamate E113 with the exception of frog SWS1 aspartate.
-- Of the remaining 150 opsins, only MEL1b_braFlo and MEL1b_braBel have glutamate suggesting their classification needs revisiting. All other melanopsins have tyrosine.
-- Vertebrate encephalopsins only have the similarly charged but shorter sidechain aspartate.
-- All 6 parietopsins uniquely carry glutamine. This one residue is completely diagnostic for parietopsins and may be functionally discriminatory.
-- All 6 insect UV opsins uniquely carry phenylalanine which is neither charged nor polar and cannot hydrogen-bond.
-- All other protostomal opsins -- both ciliary encephalopsins and rhabdomeric melanopsins -- have tyrosine, as do neuropsins and peropsins.
-- The putative opsin in cnidarian MEL_nemVec has a supporting tyrosine whereas another putative cnidarian LWS_nemVec has questionable asparagine.
-- Echinoderm PIN_stoPur has tyrosine; this protein also classifies with encephalopsins so tyrosine may have persisted up to chordate.
-- Only RGR opsins have histidine modulo some obscure exceptions.
-- A few anomalies may represent sequence errors, inclusion of non-opsins, or misclassifications: RGR2_cioInt glycine, ENCEPH_xen histidine, and PER1_strPur serine.

We're also interested in what specifies cognate heterotrimeric G-protein binding. That binding (and hence signalling) is quenched by an equally important quasi-cognate arrestin binding to phosphorylated opsin. Vertebrates have 4 multi-exon arrestin paralogs on 4 distinct chromosomes (indicating segmental duplication) considered further below. The role of arrestin in newer opsin classes and non-imaging eyes is not so clear; between-species comparison are complicated by timing of lineage-specific gene duplications.

All in all there is so much going on with opsins that it is a wonder their sequences can evolve at all. The mutations known in human -- a complex subject best reviewed at OMIM -- primarily affect RHO1 and cone opsins, illuminating key residues such as Gly90 and T94. Humans are not the best species for this having lost 5 of the 14 standard vertebrate opsins. Knockouts in mice, also summarized there for each opsin gene, are also instructive but suffer from a similar problem.

The aromatic rotameric lock of TM5-TM6 in opsin evolution

A very detailed 2010 study of the ghrelin receptor GPCR tryptophan rotameric lock in the sixth transmembrane helix is relevant to opsins, even though these receptors classify quite differently within the GRAFS system. The reason: all GPCR are descended from a common ancestral core protein in which the critical features of the ligand-induced conformational shift needed for signaling had already been established. Since the chemistry of GPCR ligands varies immensely, residues conserved across a broad spectrum of GPCR cannot be concerned with ligand binding.

AromatLock.jpg

The basic concept here is that the large indole side chain of a specific conserved tryptophan (in the CWxP motif at the bottom of the ligand binding pocket) rotates about its beta carbon, finding a physically adjacent phenylalanine in the fifth transmembrane helix with which it can stack, forming a fairly strong bond involving pairing of the two delocalized pi-electron systems. This lock is necessary for both agonist-induced and constitutive signalling (the latter much suppressed in opsins). The system can be influenced by a residue just across in TM3 (a threonine in ghrelin). stabilizing TM7 plays a key role in activation by rocking forward about a central pivot, tilting its extracellular extension inward into the ligand pocket and its intracellular segment outward allowing binding of Galpha. Here the kink induced by the motif's conserved proline is taken up at the tryptophan position.

Note this is but one of three switches involved in GPCR activation: the DRY motif shifts from DR hydrogen bonding to RY (a tyrosine in TM5) and the NPxxY shifts from an inactive YF stack (phenylalanine of intracellular helix 8) to an active hydrophobic cluster stabilizing the tilt here of TM6.

If these switches are not thought of as sequential (falling dominoes) but rather as a concerted mechanism for stabilizing a switching conformational state, then some departures from absolute conservation in a given switch can be tolerated if compensatory strengthening occurs elsewhere. Note here that a seemingly mild substitution of an aliphatic residue such as leucine for either the tryptophan or phenylalanine breaks the aromatic interaction; a tyrosine substitution would not though the rotamer alternative is gone.

Thus it is instructive to consider the comparative genomics of this switch within the 455 curated opsins with adequate coverage and their 28 closest neighbors among GPCR. Among observed departures, some are idiosyncratic (suggesting sequencing error or early stages of pseudogenization) while others cut a systematic swath across a whole opsin class (suggesting some functional significance by conservation over billions of years of branch length).

First note that the intervening residue in the CWxP is not conserved, incoherently uninformative even within an orthology class. Although non-random compositional abundances (T:146, L:71, S:59, V:41, M:33, A:33, G:33, I:14, C:4, F:3) suggest constraints do exist on the reduced alphabet at this position, they are weak and not significant to broader issues in opsin evolution. Similarly the C in CWxP dominates the reduced alphabet but A and S though not T or G are acceptable (implying a size constraint) and at places characteristic of various unrelated opsin classes (eg RHO2 and peropsin) over long time spans.

Second, most but not all opsin classes conform to the F CxWP pattern. Apart from substitution of other aromatic amino acids (arthropod opsins), changes that would disable the roamer switch are seen in LWS, TMT, encephalopsin, fish RGR2 and the first and third neuropsins. The changes are mostly to aliphatic leucine but LWS and NEU3 have polar cysteine and threonine, resp. SWS1 is the only exception to the indole rotamer concept (Y in place of W) but still has aromatic-to-aromatic potential (Y to F).

None of the switches depart from the rotamer paradigm on both sides. It's not clear whether leucine-tryptophan switches are broken or merely weakened --- while the delocalized pi electron bond concept is inapplicable to single-bonded leucine, perhaps a hydrophobic association can contribute to an altered switch. On the hydrophilic side, the effect on deeply conserved structural water molecules in GPCR should be considered in conformational shifting. It may not be possible to understand conservation of internal polar residues in opsins without consideration of hydrogen-bonding networks and their switching states.

However LWS and NEU3 seem to have broken switches as these concepts are inapplicable. Yet if the switch were the only constraint, what has conserved the cysteine (resp. threonine) and tryptophan over billions of years of branch length? Note the kink-inducing proline remains present and is likely still adsorbed by the tryptophan, suggesting the helix pivoting mechanism may still be operative. LWS and NEU3 maintain conventional DRY and NPxxYxxFR switch motifs (note LWS has somewhat unconventional ERW).

In ghrelin, a threonine in TM3 lies opposite the WF switch. This residue corresponds to a highly conserved glycine in all opsins (with the exception of RGR) but is not noticeably conserved in close-in GPCR. Glycine has no side chain capable of participating in the switch. Possibly a size constraint accounts for its conservation. This glycine is almost as effective as K296 in defining an opsin.

Finally, the GPCR closest to opsins almost all have rotameric lock potential but with similar exceptions. Thus the work with ghrelin, despite its belonging to a very different component of the GRAFS classification, appears quite applicable here as well.

                              Proxy Seq     TM5   TM6   Match  Exceptions

CCR4_homSap    L    FWTP      RHO1_bosTau    F    CWxP  21/21
ADORA2A_homSa  V    CWLP      RHO2_galGal    F    AWxP  21/22  RHO2_geoAus  F CWxP
P2RY8_homSap   F    CFAP      SWS2_ornAna    F    CWxP  12/12
UROPS1_triAdh  F    CFLP      SWS1_homSap    F    CYxP  15/15
GPR17_homSap   F    CFVP      LWS_homSap     C    CWxP  19/19  LWS1_calMil  A CWxP
CYSLTR_homSap  F    SFMP      PIN_galGal     F    CWxP  8/8
HCRTR1_homSap  Y    CYLP      VAOP_galGal    F    CWxP  5/15   also A and G for C
ADRA1D_homSap  Y    CWFP      PPIN_anoCar    F    CWxP  9/19   also T and S for C
NPY1R_homSap   Y    CWLP      PARI_anoCar    F    CWxP  7/7
PRLHR_homSap   Y    CWLP      ENC_homSap     L    CWxP  11/20  F early spp and FSGA for C
TACR2_homSap   Y    CWLP      TMT_monDom     L    CWxP  48/51  A for C in invertebrates
GPR21_homSap   Y    LWLP      MEL1_homSap    F    SWxP  62/71
GPR52_homSap   Y    LWLP      UV7_ixoSca     W    AWxP  10/10  A or S used at C
PPYR1_homSap   Y    LWLP      UV5_braKug     Y    SWxP  28/30  A or S used at C
TRHR_homSap    Y    LWMP      LMS_ixoSca     Y    AWxP  25/25
GALR1_homSap   Y    SWLP      BCR_limPol     Y    SWxP  10/12  SMAIC used at C
GPR161_homSap  Y    TWGP      RGR1_galGal    F    CWxP  9/10   GS used at C; LY for W Ciona
NMUR2_homSap   Y    CWAP      RGR2_danRer    L    CWxP  5/5
MTNR1A_homSAP  F    CWAP      NEU1_homSap    L    AWxP  13/13  Y for F
QRFPR_homSap   F    CWAP      NEU3_galGal    T    AWxP  6/7
HRH2_homSap    F    CWFP      PER1_homSap    F    AWxP  14/21  Y for F; ASVCG for C
ADRB1_melGal   F    CWLP      NEU2_galGal    F    AWxP  4/4
ADRB2_homSap   F    CWLP      NEU4_ornAna    F    AWxP  9/9
BDKRB2_homSap  F    CWLP      TMT_triCys     F    AWxP  2/2
UROPS2_triAdh  F    CWLP     
SSTR1_homSap   F    CWMP     
OPRM1_homSap   F    CWTP     
GPR19_homSap   F    SWLP    

The dataset used to compile these summaries is provided below. Replace commas with tabs and semicolons with carriage returns for importation into a spreadsheet.

RHO1_bosTau,F,CWLP;RHO1_homSap,F,CWVP;RHO1_monDom,F,CWLP;RHO1_ornAna,F,CWVP;RHO1_galGal,F,CWVP;RHO1_anoCar,F,CWVP;RHO1_xenTro,F,CWVP;RHO1_neoFor,F,CWLP;RHO1_latCha,F,CWVP;RHO1_angAng,F,CWVP;RHO1_conMyr,F,CWVP;RHO1_danRer,F,CWVP;RHO1_tetNig,F,CWVP;RHO1_takRub,F,CWVP;RHO1_gasAcu,F,CWVP;RHO1_oryLat,F,CWLP;RHO1_leuEri,F,CWVP;RHO1_calMil,F,CWVP;RHO1_petMar,F,CWVP;RHO1_geoAus,F,CWVP;RHO1_letJap,F,CWVP;RHO2_galGal,F,AWTP;RHO2_taeGut,F,AWTP;RHO2_anoCar,F,AWTP;RHO2_gekGek,F,AWTP;RHO2_podSic,F,AWTP;RHO2_pheMad,F,AWTP;RHO2_neoFor,F,AWTP;RHO2_latCha,F,AWVP;RHO2_danRer,G,AWTP;RHO2a_danRer,F,AWVP;RHO2c_danRer,F,AWTP;RHO2d_danRer,F,AWTP;RHO2_tetNig,F,AWTP;RHO2_takRub,F,AWTP;RHO2_gasAcu,F,AWVP;RHO2_oryLat,F,AWVP;RHO2_oreNil,F,AWTP;RHO2_hipHip,F,AWTP;RHO2_mulSur,F,AWVP;RHO2_pomMin,F,AWVP;RHO2_calMil,F,AWLP;RHO2_geoAus,F,CWVP;SWS2_ornAna,F,CWLP;SWS2_galGal,F,CWAP;SWS2_taeGut,F,CWLP;SWS2_anoCar,F,CWLP;SWS2_utaSta,F,CWLP;SWS2_xenTro,F,CWLP;SWS2_neoFor,F,CWLP;SWS2_tetNig,F,CWLP;SWS2_takRub,F,CWLP;SWS2_gasAcu,F,CWMP;SWS2_oryLat,F,CWMP;SWS2_geoAus,F,CWLP;SWS1_homSap,F,CYVP;SWS1_monDom,F,CYVP;SWS1_smiCra,F,CYVP;SWS1_tarRos,F,CYVP;SWS1_galGal,F,CYVP;SWS1_taeGut,F,CYVP;SWS1_anoCar,F,CYVP;SWS1_utaSta,F,CYVP;SWS1_xenLae,F,CYVP;SWS1_neoFor,F,CYVP;SWS1_danRer,F,CYAP;SWS1_gasAcu,F,CYAP;SWS1_oryLat,F,CYGP;SWS1_petMar,F,CYV-;SWS1_geoAus,F,CYVP;LWS_homSap,C,CWGP;LWS_monDom,C,CWGP;LWS_macEug,C,CWGP;LWS_smiCra,C,CWGP;LWS_ornAna,C,CWGP;LWS_galGal,C,CWGP;LWS_anoCar,C,CWGP;LWS_xenTro,C,CWGP;LWS_neoFor,C,CWGP;LWS_danRer,C,CWGP;LWS_tetNig,C,CWGP;LWS_takRub,C,CWGP;LWS_gasAcu,C,CWGP;LWS_oryLat,C,CWGP;LWS1_calMil,A,CWGP;LWS2_calMil,A,CWGP;LWS_petMar,C,CWGP;LWS_letJap,C,CWGP;LWS_geoAus,C,CWGP;PIN_galGal,F,CWLP;PIN_taeGut,F,CWLP;PIN_colLiv,F,CWLP;PIN_utaSta,F,CWLP;PIN_pheMad,F,CWLP;PIN_podSic,F,CWLP;PIN_xenTro,F,CWLP;PIN_bufJap,F,CWLP;VAOP_galGal,F,CWMP;VAOP_taeGut,F,CWMP;VAOP_anoCar,F,CWSP;VAOP_xenTro,F,CWTP;VAOP_danRer,F,AWTP;VAOP_tetNig,F,AWTP;VAOP_takRub,F,AWTP;VAOP_gasAcu,F,GWTP;VAOP_oryLat,F,GWTP;VAOP_salSal,F,GWTP;VAOP_pleAlt,F,GWTP;VAOP_rutRut,F,AWTP;VAOP_cypCar,F,AWTP;VAOP_ictPun,F,AWTP;VAOP_petMar,F,CWMP;PPIN_anoCar,F,CWLP;PPIN_xenTro,F,CWLP;PPIN_danRer,F,TWLP;PPINa_tetNig,F,SWLP;PPINa_takRub,F,SWLP;PPINa_gasAcu,F,SWLP;PPIN_ictPun,F,TWLP;PPIN_oncMyk,F,SWLP;PPINb_takRub,F,TWLP;PPINb_tetNig,F,TWLP;PPINb_gasAcu,F,TWLP;PPINb_mayZeb,F,TWLP;PPINa_petMar,F,CWLP;PPINb_petMar,F,CWLP;PPIN_letJap,F,CWLP;PPINa_cioInt,F,CWLP;PPINb_cioInt,F,CWTP;PPINa_cioSav,F,CWLP;PPINb_cioSav,F,CWLP;PARIE_anoCar,F,CWLP;PARIE_utaSta,F,CWLP;PARIE_xenTro,F,CWLP;PARIE_danRer,F,CWLP;PARIE_tetNig,F,CWLP;PARIE_takRub,F,CWLP;PARIE_gasAcu,F,CWLP;ENC_homSap,L,CWMP;ENC_otoGar,L,CWMP;ENC_musMus,L,CWMP;ENC_canFam,L,FWMP;ENC_pteVam,L,SWMP;ENC_loxAfr,L,CWMP;ENC_monDom,L,CWMP;ENC_galGal,L,CWMP;ENC_anoCar,L,CWMP;ENC_xenTro,I,GWMP;ENC_danRer,F,CWTP;ENC_tetNig,F,CWTP;ENC_takRub,F,CWTP;ENC_gasAcu,F,CWTP;ENC_oryLat,F,CWTP;ENC_torCal,L,CWLP;ENC_petMar,L,CWSP;ENC4_braFlo,F,CWTP;ENC4_braBel,F,CWTP;ENC_strPur,F,AWSP;TMT5_braFlo,L,CWLP;TMT5_braBel,L,CWLP;TMT_monDom,L,CWVP;TMT_macEug,L,CWVP;TMT_ornAna,L,CWMP;TMT_galGal,L,CWIP;TMT_taeGut,L,CWIP;TMT_anoCar,L,CWMP;TMT_xenTro,L,CWLP;TMTa_danRer,L,CWMP;TMTb_danRer,L,CWMP;TMT_tetNig,L,CWMP;TMT_takRub,L,CWMP;TMT_gasAcu,L,CWMP;TMT_oryLat,L,CWMP;TMT_ictPun,L,CWTP;TMTc_xenTro,L,CWMP;TMTc_danRer,L,CWMP;TMTc_tetNig,L,CWMP;TMTc_takRub,L,CWMP;TMTc_oryLat,L,CWMP;TMTa_oncMyk,L,CWMP;TMTa_anoCar,L,CWLP;TMTa_xenTro,L,CWLP;TMTa1_danRer,L,CWLP;TMTa_takRub,L,CWLP;TMTa_tetNig,L,CWLP;TMTa_gasAcu,L,CWLP;TMTa_oryLat,L,CWLP;TMTa_pimPro,L,CWLP;TMTb_takRub,L,CWLP;TMTb_tetNig,L,CWLP;TMTb_gasAcu,L,CWLP;TMTb_oryLat,L,CWLP;TMTa1_calMil,L,CWLP;TMTx_braFlo,L,CWTP;TMTy_braFlo,L,AWLP;TMT1_strPur,L,AWTP;TMT1_plaDum,Y,AWTP;TMT2_plaDum,F,AWSP;TMT1_anoGam,L,AWTP;TMT2_anoGam,L,AWTP;TMT_aedAeg,L,AWTP;TMT_culPip,L,AWTP;TMT_triCas,L,AWSP;TMT_apiMel,L,AWSP;TMT_rhoPro,F,AWTP;TMT_acyPis,L,AWMP;TMT_bomMor,Q,AWTP;TMTa_dapPul,L,AWTP;TMTb_dapPul,L,AWTP;MEL1_homSap,F,SWAP;MEL1_panTro,F,SWAP;MEL1_gorGor,F,SWAP;MEL1_ponAbe,F,SWAP;MEL1_rheMac,F,SWAP;MEL1_calJac,F,SWAP;MEL1_micMur,F,SWAP;MEL1_otoGar,F,SWAP;MEL1_musMus,F,SWAP;MEL1_ratNor,F,SWAP;MEL1_nanEhr,F,SWAP;MEL1_phoSun,F,SWAP;MEL1_bosTau,F,SWAP;MEL1_susScr,F,SWAP;MEL1_equCab,F,SWAP;MEL1_felCat,F,SWAP;MEL1_canFam,F,SWAP;MEL1_myoLuc,F,SWAP;MEL1_pteVam,F,SWAP;MEL1_eriEur,F,SWAP;MEL1_proCap,F,SWAP;MEL1_echTel,F,SWAP;MEL1_monDom,F,SWAP;MEL1_smiCra,F,SWAP;MEL1_ornAna,F,SWCP;MEL1_anoCar,F,SWSP;MEL1_galGal,F,SWSP;MEL1_taeGut,F,SWSP;MEL1_xenTro,F,SWSP;MEL1a_danRer,F,SWSP;MEL1b_danRer,F,SWSP;MEL1_tetNig,F,SWSP;MEL1_takRub,F,SWSP;MEL1_gasAcu,F,SWSP;MEL1_oryLat,F,SWSP;MEL1a_calMil,F,SWSP;MEL1_petMar,F,SWSP;MEL1_cioInt,F,SWMP;MEL1_cioSav,F,SWMP;MELx_braFlo,F,CWCP;MEL_braFlo,Y,AWTP;MEL_braBel,Y,AWTP;MEL6_braFlo,F,CWCP;MEL6_braBel,F,CWCP;MEL2_galGal,F,SWSP;MEL2_anoCar,F,SWSP;MEL2_xenLae,F,SWSP;MEL2_danRer,F,SWAP;MEL2_tetNig,F,SWSP;MEL2_takRub,F,SWSP;MEL2_gasAcu,F,SWSP;MEL2_oryLat,F,SWSP;MEL1_strPur,F,SWMP;MEL2_strPur,F,AWFP;MEL1_plaDum,F,SWTP;MEL1_capCap,F,SWTP;MEL1_helRob,F,CWVP;MEL2_helRob,F,CWTP;MEL1_schMed,F,SWTP;MEL1_schMan,F,SWSP;MEL2_schMan,F,CWTP;MEL3_schMan,F,SWTP;MEL1_lotGig,F,SWTP;MEL1_sepOff,F,SWSP;MEL1_todPac,F,SWSP;MEL1_entDof,F,SWSP;MEL1_aplCal,F,SWSP;MEL2_aplCal,F,AWTP;MEL2_lotGig,F,SWVP;MEL1_patYes,F,SWSP;MEL1_dapPul,F,AWTP;UV7_ixoSca,W,AWTP;UV7_tetUrt,W,SWTP;UV7a_acyPis,W,SWTP;UV7b_acyPis,W,SWTP;UV7_rhoPro,W,SWTP;UV7_pedHum,W,SWTP;UV7_anoGam,W,AWTP;UV7_aedAeg,W,AWTP;UV7_culQui,W,AWTP;UV7_bomMor,W,SWTP;UV7_droMel,Y,AWSP;UV7_droYak,Y,AWSP;UV7_droAna,Y,AWSP;UV7_droPse,Y,AWSP;UV7_droWil,Y,AWSP;UV7_droMoj,Y,AWSP;UV5_plePay,W,SWAP;UV5_hasAda,W,SWVP;UV5_braKug,Y,SWTP;UV5_triLon,Y,AWTP;UV5_triGra,Y,AWTP;UV5a_dapPul,Y,AWTP;UV5b_dapPul,W,SWTP;UV5_triCas,Y,SWTP;UV5_lucCru,Y,SWTP;UV5_anoGam,Y,SWTP;UVB_anoGam,Y,AWTP;UV5B_droMel,Y,AWTP;UV4_droMel,F,SWTP;UV3_droMel,F,SWTP;UV5_acyPis,Y,SWTP;UV5_rhoPro,Y,SWTP;UV5_apiMel,Y,SWTP;UV5_nasVit,Y,AWTP;UV5_papXut,Y,SWTP;UV5_bomMor,Y,SWTP;UV5_manSex,Y,SWTP;UV5_pedHum,Y,SWTP;UV5_diaNig,Y,SWTP;UVB_acyPis,Y,AWTP;UVB_megVic,Y,AWTP;UVB_apiMel,Y,AWTP;UVB_nasVit,Y,SWTP;UVB_bomMor,Y,AWTP;UVB_manSex,Y,AWTP;UVB_diaNig,Y,SWTP;LMS_ixoSca,Y,AWTP;LMS_tetUrt,Y,AWTP;LMS1_plePay,Y,AWTP;LMS2_plePay,Y,AWTP;LMS1_hasAda,Y,AWTP;LMS2_hasAda,Y,AWTP;LMS_limPol,Y,AWTP;LMS_NeoOer,Y,AWTP;LMS_lucCru,Y,AWTP;LMS_triCas,Y,AWTP;LMS1_droMel,Y,AWTP;LMS6_droMel,Y,AWTP;LMS_anoGam,Y,AWTP;LMS2_droMel,Y,AWTP;LMS_rhoPro,Y,AWTP;LMS_acyPis,Y,AWTP;LMS_homCoa,Y,AWTP;LMSa_nasVit,Y,AWTP;LMSb_nasVit,Y,AWTP;LMSa_apiMel,Y,AWTP;LMSb_apiMel,Y,AWTP;LMS_manSex,Y,AWTP;LMS_bomMor,Y,AWTP;LMS_papXut,Y,AWTP;LMS_schGre,Y,GWTP;BCR_limPol,Y,SWTP;BCR_triGra,Y,MWTP;BCR2_triLon,Y,MWTP;BCR1_triGra,W,AWTP;BCR2_triGra,Y,AWSP;BCR3_triGra,Y,AWTP;BCR1_triLon,Y,AWSP;BCR2_braKug,Y,IWTP;BCR3_braKug,Y,IWTP;BCRa_dapPul,Y,AWTP;BCRa_hemSan,F,CWTP;BCRb_hemSan,Y,CWTP;BCR_porPel,Y,CWTP;RGR1_homSap,F,GWGP;RGR1_ornAna,F,CWGP;RGR1_galGal,F,CWGP;RGR1_xenTro,F,CWGP;RGR1_gasAcu,M,CWGP;RGR1_calMil,F,CWGP;RGRa_cioInt,F,GLLP;RGRa_cioSav,F,SLLP;RGRb1_cioInt,F,CYLP;RGRb2_cioInt,F,GYLP;RGRb2_cioSav,F,GYLP;RGR2_danRer,L,CWGP;RGR2_tetNig,L,CWGP;RGR2_gasAcu,L,CWGP;RGR2_oryLat,L,CWGP;RGR2_pimPro,L,CWGP;PER1_homSap,F,AWSP;PER1_monDom,F,AWSP;PER1_ornAna,F,AWSP;PER1_xenTro,F,AWSP;PER1_gasAcu,F,AWSP;PER1_braFlo,Y,AWTP;PER1_braBel,Y,AWTP;PER2_braFlo,Y,SWTP;PER2_braBel,Y,SWTP;PER3_braFlo,F,AWTP;PER3_braBel,F,AWTP;PER2a_strPur,Y,VWAP;PER2b_strPur,F,VWTP;PER1a_sacKol,Y,CWSL;PER1b_sacKol,F,SWFP;PER1_lotGig,F,GWGP;PER1_aplCal,F,GWGP;PER1_todPac,F,SWSG;PER2_patYes,Y,AWTP;PER_hasAda,F,AWSP;PER_ixoSca,F,AWTP;NEUR1_homSap,L,AWIP;NEUR1_calJac,L,AWIP;NEUR1_musMus,L,AWIP;NEUR1_ochPri,L,AWIP;NEUR1_canFam,L,AWIP;NEUR1_bosTau,L,AWIP;NEUR1_loxAfr,L,AWIP;NEUR1_dasNov,L,AWIP;NEUR1_monDom,L,AWIP;NEUR1_ornAna,L,AWIP;NEUR1_galGal,L,AWIP;NEUR1_xenTro,L,AWFP;NEUR1_danRer,L,AWIP;NEUR_strPur,F,SWTP;NEUR2_galGal,F,AWSP;NEUR2_anoCar,Y,AWSP;NEUR2_xenTro,Y,AWTP;NEUR2_danRer,Y,AWSP;NEUR3_galGal,T,AWTP;NEUR3_taeGut,T,SWTP;NEUR3_anoCar,T,AWTP;NEUR3_xenTro,T,SWTP;NEUR3a_danRer,T,SWAP;NEUR3b_danRer,L,CWAP;NEUR3a_tetNig,T,SWAP;NEUR4_ornAna,F,AWSP;NEUR4_galGal,F,AWSP;NEUR4_taeGut,F,AWSP;NEUR4_anocar,F,AWSP;NEUR4_xenTro,F,AWSP;NEUR4_danRer,F,AWSP;NEUR4_tetNig,F,AWSP;NEUR4_gasAcu,F,AWSP;NEUR4_calMil,F,AWSP;TMT_triCys,F,AWLP;CUBOP_carRas,Y,AWTP;P2RY8_homSap,F,CFAP;UROPS1_triAdh,F,CFLP;GPR17_homSap,F,CFVP;MTNR1A_homSap,F,CWAP;QRFPR_homSap,F,CWAP;HRH2_homSap,F,CWFP;ADRB1_melGal,F,CWLP;ADRB2_homSap,F,CWLP;BDKRB2_homSap,F,CWLP;UROPS2_triAdh,F,CWLP;SSTR1_homSap,F,CWMP;OPRM1_homSap,F,CWTP;CYSLTR1_homSap,F,SFMP;GPR19_homSap,F,SWLP;CCR4_homSap,L,FWTP;ADORA2A_homSap,V,CWLP;NMUR2_homSap,Y,CWAP;ADRA1D_homSap,Y,CWFP;NPY1R_homSap,Y,CWLP;PRLHR_homSap,Y,CWLP;TACR2_homSap,Y,CWLP;HCRTR1_homSap,Y,CYLP;GPR21_homSap,Y,LWLP;GPR52_homSap,Y,LWLP;PPYR1_homSap,Y,LWLP;TRHR_homSap,Y,LWMP;GALR1_homSap,Y,SWLP;GPR161_homSap,Y,TWGP

Structural constraints on opsin evolution

It's proven extremely difficult to determine the 3D structure of additional GPCRs despite an immense research effort but finally in August 2007 that of a construct based on human beta2 adrenergic receptor (intronless gene ADRB2) was obtained. Beyond other intronless adrenergic receptors, it's most closely related within the human genome to dopamine and serotonin receptors (DRD1 and HTR4 resp.) The latter has 8 coding introns which we'll consider later as a control on specificity. Ominously, ADRB2 has best blastp to our putative new Nematostella melanopsin nemVec1 and one from annelid, MEL2_capCap and otherwise resembles melanopsins at the 30% identity level!

In Nov 2008 the structure of ADORA2A, a human A2a adenosine receptor GPCR, was determined to 2.6 angstroms, 3EML with transmembrane segments positionable by OPM. Note the third cytoplasmic loop had to be replaced with T4 lysozyme bacteriophage and the long carboxy-terminal tail (A317–S412) deleted to obtain good crystals.

ADORA2A is a 2-exon gene on chr 22q11 with nearest paralogs adenosine A2b, A1 and A3 receptors, all of which appear descended from a retro-processed gene that later acquired a new intron prior to family expansion. These proteins use Gs and Go transducin and raise cAMP upon activation, but can be blocked by methylxanthines such as caffeine. The A2A adenosine receptor does not contain a palmitoylation site as do the majority of GPCRs but has four disulfide bridges in the extracellular domain.

The 7 transmembrane helices are arranged rather similarly to recently determined human β2AR (2rh1), turkey β1AR (2vt4), modeled A3 (1R7N) and squid and bovine rhodopsins. However the binding pocket is located somewhat differently, allowing a more extended configuration of bound adenosine. This raises the question of whether K-rhodopsins (eg cnidarian rhodopsin-class non-rhodopins but with conserved retinal lysine) could bind both retinal and a second modulatory agonist.

AdenosineA2R.jpg

There's far less known about rhabdomeric opsin structure but the rule of thumb in crystallography of soluble proteins is that an unknown sequence can be reliably fitted to a known 3D structure if homology exceeds the 30% identity level, with the big picture retained even at much lower levels. These may apply to membrane opsins because the 7-transmembrane topology (deduced from hydrophobic periodicity plots) is a very deeply conserved feature. Indeed melanopsin models satisfactorily. However subtleties of ectopic interactions may not emerge from structural fitting despite the many constraints provided by invariant residues, though residue covariance can sometimes be inferred from direct statistical study of the sequences themselves. The authors of the beta adrenergic study are skeptical of such fitting.

Bovine RHO1 can be expressed in place of the main drosophila rhabdopsin. This works fairly well despite the 22% nominal identity but exposes numerous differences in the various structural requirements. Lophotrochozoan opsin structure has been studied in squid. The opsin is arranged in an ordered lattice in the photoreceptor membranes with a consistent optimal orientation of the retinal that allows sensing the plane of polarized light. Docking into bovine rhodopsin indicates the usual helical packing and extracellular plug structure in EXC2. But the intermolecular contacts are made by a novel cytoplasmic transverse C-terminal helix. A substantial insertion relative to vertebrate opsins occurs in CYT3. As this also occurs in most GPCRs, this extra loop is probably ancestral, ie a deletion in stem vertebrates.

Heuristic ancestral sequences

We'll take a heuristic approach here to ancestral sequence reconstruction because not all possible evolutionary nuances have tangible sequaleae to our central focus of disentangling very ancient gene duplications and divergences for the purpose of photoreceptor functional homologization. It is very likely that a pragmatic hand-curational approach informed by expert opinion and tailored to the particular circumstances of opsin structure/function produces a better product than blind application of statistical web software whose appealing 'objectivity' only masks massive internal subjectivity in parameter choices and mutational processes. However with opsins the outcomes may scarcely differ in the early rounds of reconstruction at the determinable positions, for example the opsin portfolio at lamprey node.

We can expect as ancillary benefits (1) a tenfold reduction in the number of sequences under management, (2) a small set of proxy sequences that retains all of the information (including intron and indel rare genomic events) but none of the idiosyncrasies, (3) a blast query that significantly outperforms any of its constituent sequences on outside opsins because it has taken off 500 million years of divergence time, (4) and a sequence less likely to be fooled by non-opsin rhodopsin superfamily members or generic GPCR.

It's best proceed in stages with the actual work of ancestral sequence reconstruction, as determined by phylogenetically dispersed sampling density. That is, lophotrochozoan ciliary opsins are known in too low numbers in too few species, whereas an excessive number of insect rhabdomeric and teleost cone opsins are available. After a bioinformatic push on new genomes, the resultant data set allows ancestral sequence reconstructions at common ancestor with lamprey for all classes of deuterostome opsins and at the ancestral arthropod for rhabdomeric imaging opsins. For ciliary opsins in lophotrochozoa, cnidaria, and early diverging deuterostomes, the sparse set of individual sequences must initially be retained. This will unavoidably mix filtered ancestral sequences with noisy contemporary species-level opsin sequences at the interpretative stage. That's the usual state of affairs in bioinformatics and hardly a show-stopper.

In actual ancestral opsin reconstruction, we won't use consensus sequence except as a heuristic because that doesn't exploit the known gene tree and species tree. There's a potential benefit to single-species consistency -- species such as Xenopus have nearly a full set -- because an actual sequence preserves subtle co-evolving residue pairs. Profile sequences (which retain the dispersion over the 20 possible amino acids at each reconstructed position) are powerful but unwieldy -- the most useful output is a logos graphic which however requires trimming sequences to fixed length and loses text character. Most of the benefit can there skimmed by use of reduced alphabet at positions where this is necessary. That is, at some residues the ancestral value is truly undeterminable, being at any given time a polymorphic mix of more or less equally acceptable alternatives (eg asparagine/glutamine waffling). Special symbol used to indicate reduced alphabets can become unrecognizable to blast type tools which expect the standard 20.

Outgroups have an important role in arbitrating ancestral residue choice in the situation two sister clades might disagree. Here simple parsimony drives the decision. If say threonine is used in one clade of cone opsins and serine in another, while pinopsin and the others use threonine, then the ancestral residue is threonine and not as reduced alphabet threonine/serine, ie the serine is taken as a clade-specific change on that stem. This extends to an 8 row parsimony decision table that covers all the combinatorial possibilities.

Understanding stratified invariance in opsin alignments

Before going there, let's take a quick overview using stratified invariance in post-lamprey post-encephalopsin ciliary opsins. That's done by aligning the opsins and taking the consensus line at incrementally declining percent identity requirements, that is 100% invariant, 95%, 90%, etc. We need to know which residue are conserved at what depth and why.

Opsins are hair-trigger, fail-safe by structural design: human RHO1 can detect as few as 5 photons, each one of which can consistently activate hundreds of G-proteins via Gt transducin. Very rarely does an unactivated opsin cause G-protein signaling. An activated opsin cannot cross-signal through a non-cognate G-protein even though these alpha proteins are themselves ultimately descended from a single gene. Tolerated mutational change must work around these constraints. (Recall though the special situation of parietopsin co-expressed in the same parietal cell as pinopsin, with both transducin-like Gd gusducin hyperpolarizing and Go depolarizing responses ongoing.)

Now upon binding of agonist, the trigger to signal via conformational change may not be that different between opsins and say seratonin receptor. However the retinal binding pocket is so well described that conservation at those residues alone can discriminate opsins from generic GPCR. These residues implausibly distinguish Go, Gq, and Gt opsin classes. Yet somewhere in each opsin class conservation pattern must lie a diagnostic signature that prevents cross-activation despite the inevitable fold similarity of paralogous heterotrimeric G-protein alpha subunits.

Opsins class tree.png

That specificity signature presumably consists of cytoplasmic regions where physical binding occurs. It can't lie in N- or distal C-termini because the former is extra-cellular and the latter lack common ground beyond residue 320 across even ciliary (much less rhabodomeric or Gt) opsins, even as the G-proteins remain fixed within a given species.

Thus the opsin class tree must have nested conservation paths, for example generic GPCR < rhodopsin-superfamily < opsin < ciliary opsin < imaging opsin < cone opsin < ultraviolet sensing. Our bioinformatic objective is to place an adequate discriminatory function on each node of the class tree based on comparative genomics alone. Rather than opaque analysis (eg support vector machine) that perhaps does the job optimally but does not explicitly finger key residues or structural motifs, we seek a transparent method that provides actual insight. The Opsin Classifier already does this by clustering full length sequences in homology space, but we're far better off integrating that with the four independent methods (indels, introns, synteny, and discriminatory internal residues).

(The tree is for illustrative purposes only. It can be redrawn to suit by simply adjusting its Newick string: (misc-GPCR:7,(dopamine-receptor:6,((retinal-isomerases:4,(melanopsins:3,pteropsins:3)Gq:1):1, (Go-opsins:4,(encephalopsins:3,(pinopsins:2,(cone-opsins:1, rod-opsins:1):1):1):1)Gt:1)opsins:1):1):1;

Stratified invariance is an intuitive approach to diagnostic that allows conservation 'subtraction'. That is, we can successively peel off respectively conservation attributable to generic GPCR, rhodopsin-superfamily, and opsin. From that we remove conservation specific to pteropsins, melanopsins, retinal isomerases, encephalopsins, pinopsins, cone opsins, etc that might specify arrestin tuning and so forth. That leaves conservation specific to ciliary opsins. That conservation invites experimental validation through targeted mutagenesis and helps find and assess remotely homologous opsins in early diverging species.


Stratified Invariance in Ciliary Opsins
column height = conservation depth; rho1 = human rhodopsin RHO1 
opns line = conserved all opsins 90% caps/50% lower
special symbols = reduced alphabets

100% ..............y.................N..................................................G...C..#.......G............eR..V!..P..................W..........................
 95% ..............f.................N............L....N....n.......................%%..G...C..#.%.....G............eR..V!C.P..................W.........P...W..%........C
 90% ..............y......M..........N............LR...N....N.......................%F..G...C..#G%.....G............ER..V!C.P..................W.........P..GW..%........C
 85% ..............f......M..........N......T.....LR.P$N....Nv...#.................GYF..G...C..EG%.....G...$.S....A.ER..V!C.P.g........A.......W........PP..GW..Y...G...SC
 80% ..............y......M..........N......T.....LR.PLN.!..NL...#.................GYF..G...C..EGF.....G...L.SL...A.ER..V!C.P.G........A.......W........PP..GW..Y...G...SC
 75% ..............f......M..........N......T...K.LR.PLN%!LVNL..A#......g..........GYF..G...C..EGF.....G...LWSL.!.A.ERy.V!CKP.G...F....A..G....W........PP..GWS.Y.PEG...SC
 70% ..............y......M..........N......T...K.LR.PLNYILVNLA!A#L.....G..........GYF..G...C..EGF.....G...LWSL.!.A.ERy.V!CKP.G#..F...HA..G....W........PP..GWS.Y.PEG...SC
 65% P...Pq........f......M..........N.vV..!T...KKLR.PLNYILVNLA!ADL.....G..........GYF..G...C..EGF.....G.!.LWSL.!$A.ERy.V!CKP.G#..F...HA..G!.%.W........PPL.GWS.Y.PEG...SC
 60% P...Pq...A....y......M..........N.LV..VT.k.KKLR.PLNYILVNLA!ADL.....G..........GYF..G...C..EGF.....G.!.LWSLa!LA.ERY.V!CKP.GN..F...HA..G!.F.W!.......PPLFGWS.Y.PEG$..SC
 55% Pf..Pq...A..w.f...A..M..........N.LV..VT.KfKKLR.PLNYILVNLA!ADL.....G.......#..GYF.$G...C..EGF.v...G!V.LWSLAVLA.ERY.V!CKP.GN..F...HA..G!.F.W!....W..PPLFGWSRY.PEG$.TSC
 50% PF..PQ...A.PW.Y..LA..M..v.......N.LV!.VT.KfKKLR.PLNYILVNLAVADL.....G.T!....#..GYF.LG...C..EGF.V...G!V.LWSLAVLA%ERY.VVCKP$GNF.F...HA..G!.FTW!....W..PPLFGWSRY.PEG$.TSC
opsn:................................N..v.......k.lr.p.n....nla..d...................f..g...C...gf.....g..s...l..la..Ry.vi..p..........a.......W.....w...pl.gw..y.peg..tsC
rho1:PFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIPEGLQCSC

100: ...........................................................#......!..M!..%.....PY......................P....K....%NP.IY...N..F...................................
 95: ...............%...........P...I...Y.......................#.....MV..M!..%.....PY......................P....K....YNP.IY..$N.#F...................................
 90: ...............%...........P...I...Y.......................E..V..MV..M!..%.....PY......................P....K....YNP.IY..$N.QFR..................................
 85: ..#.%..........%........F..P...I...Y.......................E.#V..MV!.M!..F..CW.PY......................P.%F.K...!YNP!IY..$N.QFR.C.......G........................
 80: ..#.%..........%........F..P...I...Y.......................E.#V.RMV!.M!..F..CW.PY...A..................P.%F.K...!YNP!IY!.$N.QFR.C.......G........................
 75: .P#.%.........S%....F..CF..P..!I...Y.............#.....t..AE.#V.RMV!.M!..F..CW.PY...A..................P.%F.K...!YNP!IY!.$N.QFR.C.......G.......#................
 70: .P#.%.........S%....F..CF..P..!I...Y..L.......A.#..#...T..AE.#V.RMV!!M!..F..CW.PYA..A..................P.%F.K...!YNP!IY!.MNKQFR.C......cG.......#................
 65: .P#WY.........SY!...F..CF..P..!I...Y..L...$...A.QQ.#...T.KAE.EV.RMV!!MV..F$.CW.PYA..A..................P.%F.K...!YNP!IY!%MNKQFR.C......CG.......#...t.S.V....
 60: GP#WY.......#.SY!...F..CF..PL.!I.%.Y..L...$..!A.QQ.Es..TQKAE.EV.RMV!!MV.AFL!CW.PYA.fA..!..#......P....!P.%F.K...!YNPIIY!FMNKQFR.C......CG.......#...T#S.VS...!.P.
 55: GPDWY....#..#.SY!!.$F..CF.!PL.!I.F.Y..L$..LR.VA.QQ.ES..TQKAE.EV.RMV!VMV.AFL!CW.PYA.FA..!..N.....#P..A.!PA%F.K.S.VYNPIIY!FMNKQFR#C......CG.....#.#...T#SSVS...V.P.
 50: GPDWY....#..#.SY!!.$F..CF.!PL.!I.FSY..LL..LR.VA.QQ.ES..TQKAE.EVTRMVVVMV.AFL!CW.PYA.FA$.!..N.....#P..AT!PA%F.KSStVYNPIIY!FMNKQFR#C.$....CGK.p..#.#.S.T#SSVS.s.V.P.
opsn:..d...........sy........f..Pl..i...Y.......................e.....m...m...f...w.PYa...............p.....p..f.k.s..ynpiiy...n..fr..................................
rho1:GIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA

There's not conservation depth outside the opsin core, which begins at a classical waffle residue (Y or F) not much sooner than a very deeply conserved asparagine, position ASN55 in the (FSMLAAYMFLLIVLGFPIN region in human RHO1 terminology). That Asn55 is known to be conserved within GPCR far outside of opsins, making it diagnostically useless. The reason for the prodigious conservation of this particular amino acid -- which likely exceeds many trillion years of branch length considering all the family members and all the species -- is apparently structural. Its side chain makes two interhelical hydrogen bonds to Asp83 in TMH2 and to the peptide carbonyl of Ala299 of TMH7. Asp83 is in turn connected via a water molecule to the peptide carbonyl of Gly120 in TMH3 (ie is not side-chain specific). Nearby Asn78 also in TMH2 also constrains three helices via hydrogen bonds to hydroxyl groups of Ser127 of TMH3 and Thr160 +Trp161 of TMH4. Of course glutamines could furnish these same exact bond donors at the same exact geometry but the extra CH2 group would push the bonds forward; no coevolutionary change in acceptor can accommodate this given the palette of 20 amino acids so it is never seen.

Opsins asn55 dry134.png

The ERY motif (which can accommodate D in first position and W in third) is another huge source of confusion in the opsin literature. It too is not at all specific to opsins within GPCR nor G-coupling type; consequently it cannot be used to argue say that distant Nematostella or Hydra blast matches are indeed opsins rather than some other class of GPCR. The ERY motif is also structural. Glu134 forms a salt-bridge with the guanidium moiety of adjacent Arg135 which in turn hydrogen-bonded to Glu247 and Thr251 in TMH6, a relation possibly critical to keeping rhodopsin in the inactive conformation -- mutations lead to constitutive activity. Movement in TMH3 during the photoreception cycle changes the environment of the ERY motif causing its reorientation.

The NPxxY motif is a third conserved patch lying in TMH7 specific to rhodopsin-superfamily but not to opsins. However the stratified alignment shows that a slightly larger patch might be diagnostic for ciliary opsins and distinguish them from rhabdomeric, say VYNPVIYI (with specifically reduced alphabets for the hydrophobic residues). This type of exercise requires a massive opsin reference collection to be sure the full range of natural variation is seen. Here the structural and functional significance of this motif is murky. The two polar residues Asn302 and Tyr306 are internal, with the former possibly hydrogen bonding via a water bridge to Asp83 and the latter's hydroxyl close to Asn73 (highly conserved among generic GPCRs).

Disulfides, glycosylation, palmitylation, phosphorylation

The single disulfide observed in all opsins has a familiar explanation, inflexibly linking a portion EXC2 to TMH3 throughout the photoreception cycle. It raises the question of what the other 6 semi-conserved cysteines are doing (only one is in the extracellular oxidizing environment). We know the CC pair Cys322 Cys323 are too close to form a disulfide; indeed they are the known double palmitylation site in RHO1. This would profoundly anchor that region to the outer leaf of the (specialized here) cytoplasmic membrane. Although invented already at lamprey common ancestor, palmitylation is not even an option across all post_LWS opsins for lack of attachment sites. Thus it is a feature of rod opsins. Palmitylation occurs in over a hundred gene families; commonly localizing proteins to lipid rafts and affecting their trafficking.

The RHO1-RHO2 opsin class is also distinct in have two glycosylation sites (ie NxS or NxT, x not P), both extracellular as required and in fact utilized. Asn2 is adjacent to the initial methionine and Asn15 follows a small beta sheet prior to entry into TMH1. These may inhibit the seemingly random walk of amino termini seen in other opsin classes (and many other gene families). SWS2 does not share either site but has a non-homologous glycosylation motif in this same EXC1 region perfectly conserved back to lamprey in an otherwise fast evolving region. SWS1 and LWS are similar, with homology murky due to fading biflanking fixed anchors. SWS1 has a puzzling site invariant in 11/11 sequences that follows the normally well-conserved NVAVADL motif within TMH2. It scarcely seems possible this loops out to extracellular to be glycosylated. What's happened here is the N is conserved for other reasons, the V is neutral for glycosylation and the S or T completing the motif are valid reduced alphabet values (A T S V) -- it's a coincidental match. Finally, RHO1 through SWS2 have a site following the disulfide cysteine but seemingly just inside TMH5.

Opsin phospho sites.png

Bovine rhodopsin has 7 phosphorylation sites, all C-terminal cytoplasmic and activated by phosphokinase, of which 2-3 suffice for full arrestin recognition and subsequent quenching of signaling. Ser334 and Ser343 are said especially important, Thr340, Thr342 less so.

However comparative genomics of the post-palmitylation carboxy terminus using the Opsin Classifier collection paints a very different picture. It's clear that Theran mammal phosphorylation sites (ie bovine RHO1) are anomalous for a 5 residue deletion of 3 otherwise well-conserved and likely phosphorylated threonine and serine sites. Thus arrestin studies on RHO1_bosTau won't be applicable even to platypus RHO1, much less to lamprey. Given that 20 years of arrestin experiments haven't fully resolved what's important even in the much-studied bovine RHO1, the overall situation presents an experimental nightmare.

The only truly invariant site in visual opsins is Thr340 (after re-gapping chicken pinopsin). SWS1 opsins may contain an additional 3 potential phosphorylation sites upstream. Conservation in this region extends through LWS but barely into pinopsin or beyond. That might make sense if cone and rod opsins had different physiological needs for arrestin function. That need (rapid regeneration of photoreceptor in bright light) barely extends to pinopsin phosphorylation.


Interaction with arrestin

From comparative genomics alone, we expect to find a number of arrestin paralogs, each subfunctionalized to particular opsin classes with a switch in cognate arrestin from pinopsin to encephalopsin. It's important to examine this issue afresh because pre-genomic era experimentalists may have known only about a misleading subset. Arrestins are not necessarily a digression because they help understand conservation (notably of serines and threonines) in the opsin carboxy terminal region.

The complete human genome contains four similarly intronated beta arrestins on 4 different chromosomes, namely SAG, ARR3, ARRB1, and ARRB2, implying two or more rounds of segmental duplication. Despite their four members, beta arrestins did not arise from the supposed whole genome duplications in vertebrate evolution but in fact contradict that, with a single arrestin present in pre-chordates, lamprley, shark with the full complement first appearing in the last common ancestor with frog.

Nine addition genes ('alpha arrestins') contain the signature PFAM arrestin domain (an ancient internal tandem duplication of beta sheet) at barely detectable alignment levels at yet other chromosomal locations, ARRDC1-ARRDC5, TXNIP (VDUP1), VPS26A, VPS26B, and DSCR3; these have no known relevence to opsin signaling modulation. These conserve 8 coding exons, half the number of beta arrestins.

Recall humans lack 5 of 14 amniote opsins so their arrestin repertoire may also be unrepresentative. However their arrestin complement is that of any terrestrial vertebrate. Thus arrestins were never in 1:1 correspondence with imaging opsins during bilateran evolution; indeed a limited arrestin set has long serviced a much larger set of opsins and non-opsin GPCR.

SAG (S-arrestin) is expressed in retina (notably retinal rod outer segments) and pineal. Mutations cause Oguchi type-1 night blindness. ARR3 (cone arrestin) expression occurs in inner and outer retinal segments, inner plexiform, and a subset of pinealocytes but has not been studied in conjunction with non-visual opsin paralogs, leaving cognate arrestin use there (if any) obscure, though co-expression could easily be studied at the Allen Mouse Brain, noting mouse too has been gutted for six important opsin classes (two cone opsins and only encephalopsin retained out of the pinopsin, parapinopsin, VAOP, parietopsin group).

It's unclear that neuropsin, peropsin, and RGR interact with an arrestin at all (perhaps no need if mere photoisomerases). Even peropsin, closest to conventional ciliary opsins, has only marginally conserved serine and threonine in its cytoplasmic tail. While all three have potentially phosphorylated distal residues, that is also expected by chance because and threonine are two of the most common amino acids.

The ubiquitously expressed beta-arrestins ARRB1 and ARRB2 are usually assigned to beta-adrenergic receptors, but here too we have to consider cross-over functionality -- notably to melanopsins -- because the 4 arrestin paralogs have to serve nearly the whole portfolio of hundreds of distinct phosphorylated GPCR. There is no conserved sequence motif in the tail, merely serines and threonines; recognition by arrestins must involve earlier occuring elements exposed in cytoplasm.

This has been considered previously, most interestingly in the case of mouse 480 nm melanopsin. Here both mouse ARRB1a and ARRB2 (and indeed drosophila arrestin ARR2, accession M32141) improved melanopsin cycling. Melanopsin has intrinsic arrestin-dependent photoisomerase activity unlike ciliary opsins which need externally replenished cis-retinal. Curiously melanopsin is also expressed in a rare ciliary cone type in the peripheral human and mouse retina where it constitutes ~0.3% of the entire cone population. This has potentially profound implications for the evolution of the deuterostome imaging eye. Melanopsin carboxy termini are greatly extended in some species; the consequences of experimental deletions have not been studied.

Arrestins have been studied at some comparative genomics depth. Overall, the gene family emerged recognizably in fungi, much earlier than opsins, as did GPCR. Only a single arrestin occurs in 'complete' genomes of trichoplax, sponge, and cnidarians. Moderate gene expansion observed in arthropods may have occured independently from that in the vertebrate lineage because hemichordate, echinoderm, amphioxus and tunicate all have but a single arrestin. Beta arrestins are all very similarly intronated, establishing that the intron pattern was already fixed prior to the emergence of metazoans. These introns cannot be reliably comparted to those of alpha arrestins because of poor alignability,

Possibly early opsins already interacted with the single early arrestin and that interaction was carried forward in the earliest bilateran, to become independently specialized in the arthropod and terrestrial vertebrates after separate gene duplications in these lineages. Optimization of imaging vision may have driven the retention and specialization of these duplications.

>ARRB1_homSap Homo sapiens (human)
0 MGDKGTR 2
1 VFKKASPNGK 0
0 LTVYLGKRDFVDHIDLVDPV 1
2 DGVVLVDPEYLKERR 1
2 VYVTLTCAFRYGREDLDVLGLTFRKDLFVANVQSFPPAPEDKKPLTRLQERLIKKLGEHAYPFTFE 0
0 IPPNLPCSVTLQPGPEDTGK 0
0 ACGVDYEVKAFCAENLEEKIHKR 2
1 NSVRLVIRKVQYAPERPGPQPTAETTRQFLMSDKPLHLEASLDKE 0
0 IYYHGEPISVNVHVTNNTNKTVKKIKIS 1
2 VRQYADICLFNTAQYKCPVAMEEAD 2
1 DTVAPSSTFCKVYTLTPFLANNREKRGLALDGKLKHEDTNLASSTL 2
1 LREGANREILGIIVSYKVKVKLVVSRGG 2
1 LLGDLASS 21 DVAVELPFTLMHPKPKEEPPHRE 1
2 VPENETPVDTNLIELDTN 2
1 DDDIVFEDFARQRLKGMKDDKEEEEDGTGSPQLNNR* 0

>ARRB1_acrMil Acropora millepora (coral) EZ042154 454 transcriptome assembly PUBMED 19435504 planula larvae
MDNADSKKPGTR
VFKKTSPNGK
ITTYLGKRDFVDHIDHIDPV
DGVVLVDPEYVQEGKK
VFAQVLAAFR
YGREDLDVLGLTFRKDLFL
ACMQVYPPKPEDEVPLTRLQERLRKKLGENAYPFKFE
LPKGSPSSVTLQPAPGDTGK
PCGVDYELKTYVMEEKKDKEDKLEEKPHKR
DTVRLAIRKITYAPELPLAQPRAETDK
EFMLSVHKLHIEASLDKGMYYHGEE
IGVNVHIANSSSKTCKKIKIT
VRQFADICLFSTAQYKCPVASLESE
DGFPVGQSGTLSKVYRLTPLLANNR
DKRGLALDGKLKHEDTNLASSTI
RDENTPKENLGIIVQYKVKVRVMVAY
GSDVVLELPFKLSHPKPPEETPPPTPSTQP
ASGGQVAAGAQLADAPAVDHNLIDFDTD
GPDKHEDDDLIFEDFARLRLKESEHLGSAEA*

>ARRB1_anemVec Nematostella vectensis (anemone) ABAV01007513 XM_001635548 (truncated, wrong iMet)
0 MENNENAAEEATKRTGTR 2
1 VFKKTSPSAK 0
0 ITTYLGKRDFIDHVKHIDPI 1
2 DGVVLVDPEYVKDGKK 1
2 VFAHVLAAFR 2
1 YGREDLDVLGLTFRKDLFL 0
0 ATVQVYPPKTDDQKALTRLQERLLKKLGSNAYPFKFE 0
0 LPPGSPSSVTLQPAPGDTGK 0
0 PCGVDYELKTYVAESLEEKPHKR 2
1 DTVRLAIRKLTYATEQPQPQ 0
0 PFSEGEKDFMMSQHPLHVEASLDKG 0
0 LYYHGETIAVNVIISNRSSRNCKKIKIT 1
2 VRQFADICLFSTAQYKCPVASLESE 2
1 DGFPVHPSGTLTKVYCLTPLLGENR 0
0 DKRGLALDGKLKHEDTNLASSTM 2
1 PDIPDIPKENLGIIVQYKVKVRIIVAYGG 2
1 DLTLELPFMLSHPKPCEDPTPPPTPAKQP 1
2 AIALPILYFAGNNEQAAVDHNLIDFDTE 2
1 GPEQDNNDDLIFEDFARLRLKGSDHTGSADA* 0

>ARRB_ampQue Amphimedon queenslandica (sponge) ACUQ01000747 GW180307 GW176911 GW176912 transcripts one day larva
0 MAESTEPKETDKLVDHEEPPAIKTVKRDGTR 2
1 VFKKTSPNTK 0
0 TTVYVGKRDFVDHVTEVDPL 1
2 DGVILIDPEYFKKEAKKDRK 1
2 VFAQILVGFRYGRDDLDVLGLNYRRDLLDV 12 IQVYPPPDPSKPQILTLLQVRLLKKLGRNAYPFTF 0
0 LKPGLPSSVSLQPSPNASSQEGEK 0
0 PCGVDFILRCYVAKNKEDKIEKR 2
1 NSVRLSVKKITHASDEHTQR 00 PSIELTKQYLLSSHPLTVEANLDKG 0
0 TYYHvEPIRVNVSITNRSSKTIRKIRVS 1
2 VRQFAAICLFANSEYKCTVAELESS 2
1 EGLPIGTGGSLQKSYEITPLLKDNR 00 NKKGLALDGQIKHEDTCLASSTM 2
1 LPSGVEDSReL 12 RESFGIVVHYSVKVRCIDNLGS 2
1 DLTLELPFTLTHPKPKERVISQVITLPPRSSLSSITDPKDDKSKAPPEAKPE 1
2 GIPADDTVSVHDVIDHNLITFDT 2
1 DDATNDQDDFVFEEFVRLRVTGMDDNNETEA* 0

>ARRB_triAdh Trichoplax adhaerens (trichoplax) ABGP01000766 XM_002116152 
0 MADAANKPTTENNNEDASKKAGTR 2
1 VFKKSSPNGK 0
0 LTTYLGKRDFVDHIDHIDPV 1
2 DGVVLVDPEYIADKK 1
2 VYVHVLAAFR 21 YGREDLDVLGLTFRKDLFL 00 STLQIYPPLPENERPLTKLQERLIKKLGENAYPFYFE 0
0 LSVGSPSSVTLQPAPGDTGK 0
0 PCGVDYELKTYVGDSPDDKAHKR 2
1 DTIRLAIRKITYAPDENIPQ 00 PTAEITKEFMMSSYPLHLECTLDKG 0
0 MYYHGEPIKVNVSIANRSSKTVKKIRLS 1
2 VRQFADICLFSTAQYKCPVAVVESE 2
1 DGFPLNPGGTLNKIFTLVPLLEDNR 00 DKRGLALDGKLKHEDTNLASSTM 2
1 YDPGVSKENLGIVVQYKVKVRLLVALGR 2
1 DVALELPFTLTHPKPIEPEEPIVTQPEVNPPQ 1
2 AVADTKPETKNTEAPIDNNLIMFDTR 2
1 GTGALLADQDDDFIIEEFVRMRLKDHSKDSSEA* 0


Landmarks along the opsin protein

Broad conservation ends well before the stop codon at FRNCMLTTICCG (position locatable by web browser search in the sequence collection). That's not to say there's not good information earlier about evolution strictly within cone opsins (such as the 1 residue deletion after PFEYPQY uniting RHO1 through LWS, and the 2 residue insert uniting RHO1 through PIN) but we're looking at a very much deeper time scale for now for all of Metazoa. Note from the 'opsn' line that opsins very broadly considered (cnidarian, protostome, deuterostome; Go, Gt, Gq) share considerable conservation at 70 positions of 288. That's to say 25% identity is the approximate floor (lower bound) for a blast search. These residues may be so fundamental to the GPCR and rhodopsin superfamily that blast hits to slow-evolving non-opsins won't be that different. Therefore sequence alignment alone cannot be used to show remote sequences in sponges and cnidarians are truly opsins.

Opsin bovRHO1.png

Landmarks in Bovine Rhodopsin RHO1 sequence explain residue conservation:

194 residues in seven transmembrane helices
 35 to  64 for TMH1 Asn55 hydrogen bonded to Asp83 TMH2 and Ala299 TMH7 | Phe45 Met49 Phe52 dimer interface
 71 to 100 for TMH2 Gly90 night blindness | Tyr96 His100 dimer interface
107 to 139 for TMH3 Cys110 half-disulfide | Glu113 salt bridge counterion | ERY motif hydrogen bonds Arg135 and Glu247 Thr251 in TMH6
151 to 173 for TMH4
200 to 225 for TMH5
247 to 277 for TMH6
286 to 306 for TMH7 Lys296 11-cis-retinal NPviY motif 302-306

74 residues extracellular in 3 loops and tail
  1 to  34 for nTER Asn2 oligosaccharide | Gly3-Pro12 beta sheet parallel membrane | Asn15 oligosaccharide | Pro23  Gln28  retinitis pigmentosa maintain  orientation between EXC1 and nTER 
101 to 106 for EXC1
174 to 199 for EXC2 Cys187 half disulfide | Glu181 alternate counterion
278 to 285 for EXC3

70 residues cytoplasmic in 2 loops and tail
 65 to  70 for CYT1
140 to 150 for CYT2
226 to 246 for CYT3
307 to 348 for cTER  Cys32  Cys323 covalent palmitate tails last 15-amino acids unstructured

Provisional ancestral proxy sequences

provisional trimmed ancestral proxy sequences for vertebrate ciliary opsins
>ANC_RHO1_14
MFfLIlvgFPvNFLTLfVTvqHKKLRtPLNYILLNLAvAnLFMVlfGFtvTmYTsmnGYFvfGptgCniEGFFATLGGEiaLWsLVVLAiERYvViCKPMsNFRFGntHAImGVaFTWiMALaCAaPPLvGWSRYIPEGmQCSCGvDYYTlkPeiNNESFVIYMFvVHFtIPfivIF
FCYGrLlcTVKeAAAqQQESasTQkAEkEVTRMVvlMVIaFLvCWVPYASVAfYIFthQGsdFGptFMTvPAFFAKSsalYNPvIYIlmNKQFRNCMITTlCCG

>ANC_RHO1_06
mFfLIitGlPiNiLTLlVTFkHKKLRQPLNYILVNLAvAdLfmvcfGFTVTFytawngYFvfGPiGCAiEGFfATlGGqVALWSLVVLAIERYIVvCKPMGNFRFsatHaimGIaFTWfmAlsCAaPPLfGWSRyiPEGlQCSCGPDYYTlNPDfHNESyViYmFvVHFliPvviIF
fsYGRLiCKVrEAAAQQQESAsTQKAEkEVTRMVILMVlGFllAWtPYAsvAfWIFtNkGAeFsaTlMtvPAFFSKSSslyNPIIYVL$NKQFRNCMiTTiCCG

>ANC_SWS2_09
MfflvilGfpiNvLTifCTikyKKLRSHLNYILVNLAvaNLlVvcvGStTAFySFsqmYFalGplaCKiEGFaATLGGMvSLWSLAVvAFERfLVICKPlGNFtFrgtHAvlgCvaTWvfglaaSaPPLfGWSRYIPEGLQCSCGPDWYTTnNKwNNESYVlFLFgFC
FgvPlaiIlFsYgrLLltLravAkqQeqAsTQKAEREVTrMVVvMVlGFLVCWlPYaSFALWvVtnRGepFDLrlAsIPsVFSKaStVYNPvIYvfmNKQFRSCMmKmffcG

>ANC_SWS1_11
MGfVFfaGTPLNaiVLvvTikYKKLRQPLNYILVNIsaaGFvfcvFSvftVFvaSsqGYfffGktvCalEafvGslaGLVTGWSLAfLAFERYiVICKPFGnFrFsSkHAlaaVvaTWiiGvgvsiPPFFGWSRYIPEGLqCSCGPDWYTvgtkYkSEyY
TwFLfifCFivPlsiIiFSYsQLLgALRAVAAQQqESAtTQKAEREVSRMvivMVgSFclCYvPYAalAmYmvnnrdhglDlRLVTIPAFFSKSscVYNPiIYcFMNKQFraCIMEtVcG

>ANC_LWS_13
MifVVaaSvFTNGLVLVATaKFKKLRHPLNWILVNliAiADLGETvfASTiSVcNQvfGYFILGHP$CVfEGytVSyCGItaLWSLtIIsWERWvVVCKPFGNiKFDgKwAtaGI!FSWVWsavWcaPPiFGWSRyWPHGLKTSCGPDVFSGssd
pGvqSyMivLMiTCCfiPLaiIilCYlqVwlaIraVAkQQKESEsTQKAEkEVSRMVVVMilAycfCWGPYtfFACFaAaNPGYAFHPLaAalPAYFAKSATIYNPIIYVFMNRQFRNCImQLFG

>ANC_PIN_06
MGmVVisAffVNGLVIVVSlkyKKLRSPLNYILVNLAiADLLVTfFGStiSFvNNivGFFvfGktmCEfEGFMVSLTGIVGLWSLAILAFERYlVICKPvGDFrFQqRHAVlGCaFTWgWsliWTsPPLfGWsSYVPEGLrTSCGPNWY
tGGsnNNSYImaLFvTCFamPLstIlFSYaNLLltLRAVAAQQKEsETTQRAErEVTRMVIaMVlAFLbCWLPYAsFAmVVAthKdlvIqPqLASLPSYFSKTATVYNPIIYVFMNKQFRsCLltlmcCG

>ANC_VAOP_07
MfvvTaLSLaENFaVilVTfkFkQLRQPLNYiiVNLsvADfLVSliGGsiSFlTNykGYFfLGkwACVLEGFAVTfFGiVALWSLAlLAFERyfVICRPLGNmRLrgKHAaLGlafVWtFSfiwTvPPvlGWSSYtv
SkIGTTCEPNWYSGnfhDHTfIitFFsTCFIfPLgVIfvsYGKLirKLrKvSnTqgrLgntRkpErQVTRMVVVMIlAFmvcWtPYAaFSIlvTAhPtIhLDPrLAAiPAFFSKTAtVYNPiIYvFMNKQFRkClvQlfsc

>ANC_PPIN_07
mavfsvsgvLNstVIiVTlryrQLRqPlNysLVNLAvADLGcavfGGlltveTNAvGYFnLGRVGCVlEGFAVAFFGIAaLCtiAVIAvDRyvVVCkPlGtvmFttrhAlaG!awSWlWSfvWNTPPLFGWGselLEGVrTSCAPnWYsrD
PaNvSYIvcYFafCFAiPFsvIvvSYgrLlwTLhQVaKLgvlesGSTakaEaQVsRMVvVMvmAFLlcWLPYAaFAltVildPnlyInPvIATvPMYLtKsSTVyNPIIYIFMNrQFRDcavPfLLCG

(to be continued)

See also: Curated Sequences | Alignment | Ancestral Introns | Informative Indels | Cytoplasmic face | Update Blog