Opsin evolution: annotation tricks

From genomewiki
Revision as of 19:52, 28 November 2007 by Tomemerald (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Below is a step-by-step example showing how the Opsin Classifier is used within an overall annotation strategy for recovering intronated opsin homologs in an unstudied species, here the hemichordate Saccoglossus kowalevskii. It's best to follow along actively replicating the steps on actual live web tools.

Here we expect to find minimally an opsin that underlies epidermal photoreception. The best inital queries (for detecting diverged sequences) are likely fellow Ambulacraria (ie sea urchin) opsins. The best initial target is Trace Archives Saccloglossus 'other' because these are transcripts which give longer blastn matches than dispersed exons. However this draws a blank, possibly because cell types with opsin transcripts are exceedingly rare. Hence we move to the "wgs" division.

A flawed but long sea urchin dna query, XM_778236, a GNOMON pipeline product labelled "similar to Go-coupled rhodopsin" elicits a significant hit provided the "Somewhat similar sequences (blastn)" trace query option has been selected. This option is critical for queries across diverged species or diverged opsin classes.

The key feature of this match is length. The rule of thumb is anything exceeding 40 bp will prove informative; here we have 71 bp. The percent identity is quite respectable at 76% given the time elapsed since sea urchin and acornworm divergence -- and this may be even be a match of paralogs. Note however the two gaps. The second can be written off as a slight gap misplacement by blastn. The first cannot. It represents an apparent inactivating frameshift (change in reading frame). However traces contain many artefactual indels and most likely this is sequencing error. We cannot rule out a recent processed pseudogene at this point -- indeed they give better initial hits than a multi-exonic gene.

The trace blast graphic shows five stacked hits. I selected one of these because its length was good at 917 bp and sufficient material was left over on both flanks, that is 754 bp upstream and 94 bp downstream.

>gnl|ti|1695985935 Length=917 Score = 41.0 bits Expect = 9.0 Identities = 54/71 (76%), Gaps = 3/71 (4%) Strand=Plus/Minus

Query  474  CCTT-TTCTGGACTATCACACCGTTCTTTGGATGGAGCAGCTACAC-CTACGAACCATTTGGCACGTCGTG  542
            |||| |||||| | || |  ||| | |||||||||||||||||  | ||| ||||||   || || |||||
Sbjct  163  CCTTATTCTGGTCGATGATGCCGCTGTTTGGATGGAGCAGCTATGCGCTA-GAACCAGAAGGTACATCGTG  94

Next, the retrieved full length trace is back-blastxed against the Opsin Classifier collection. This gives three immediate benefits: properly translating the trace irrespective of its frameshifts (which surface as exon breaks with breaks too short for an intron in the numbering), finding the best available gene model (the initial query choice may have beeen sub-optimal), and extending match length over what Trace Archive blastn could do.

Here we see that the top 15 matches for trace ti|1695985935 are consistently opsins previously classified as peropsins and neuropsins. There appear to be two disjoint segments to the blastx match to PERa_braFlo Branchiostoma floridae but the numbering shows they really reflect a single extended match encompassing a frameshift. Here the trace was on the minus strand which means larger numbers are earlier in the coding sequence. The two fragments can then be joined into a single polypeptide, likely with a one amino acid glitch at the frameshift join. Blast often extends its matches too far into bogus territory and this must be trimmed.

PERa_braFlo     Branchiostoma floridae (amphioxus) ?? ... -1   182  3.6e-28   2
PERa_braBel     Branchiostoma belcheri (amphioxus) ?? ... -1   174  5.7e-27   2
PER_xenTro      Xenopus tropicalis (frog) ?? 0.2.0.2.1... -1   195  4.8e-26   2
NEUR_calMil     Callorhinchus milii (elephantfish) ?? ... -2   151  1.9e-24   2
PER_gasAcu      Gasterosteus aculeatus (stickleback) ?... -1   174  4.2e-23   2
...

Score = 167 (58.8 bits), Expect = 3.6e-28, Sum P(2) = 3.6e-28 Identities = 28/60 (46%), Positives = 41/60 (68%), Frame = -2

Query:   769 GLTIFGMSLSCVSSFAGRWLFGKFGCYFHGFAGMLFGLGSIGNLTVISIDRYIITCKRNL 590
             G+ IFG   S  SS    WLFG  GC ++GF GM FG+ +IG LT +++DRY++ C+++L
Sbjct:    83 GICIFGYPFSGASSLRSHWLFGGVGCQWYGFNGMFFGMANIGLLTCVAVDRYLVICRQDL 142


Score = 182 (64.1 bits), Expect = 3.6e-28, Sum P(2) = 3.6e-28 Identities = 35/82 (42%), Positives = 50/82 (60%), Frame = -1

Query:   266 YVYSCNQNFNYKLHLFTEWSYRHYYALLAVAWSNALFWSMMPLFGWSSYALEPEGTSCTIDWMNNDNQYISYVSCVTVTCFI 21
             Y+  C Q+   K++      Y  Y  + A+ W  A FW+ +PL GW+ Y+LEP GT+CTI+W  ND+ YISYV+    +CFI
Sbjct:   134 YLVICRQDLVDKVN------YNTYGVMAALGWLFAAFWAALPLVGWAEYSLEPSGTACTINWQKNDSLYISYVT----SCFI 205

The two fragments can then be joined into a single polypeptide, likely with a one amino acid glitch at the frameshift join. This 133 residue peptide is blastped against Opsin Classifier. Here the Expect = 5.4e-33 is highly encouraging but the two small insertions raise questions for an opsin. Perhaps extraneous residues have been incorporated into our gene model, currently a fragment corresponding to residues 83-197 of the 361 residue peropsin in amphioxus.

Switching to a six-frame translation view in a second browser tab, we see GLTI is preceded by a stop codons without good opportunities for a GT-AG splice site that does not sacrifice regions of apparent alignment. This could reflect additional frameshifts but without any guidance from blastp output, we have no idea what frame might contain further good upstream sequence. Next we look for exon boundary guidance from the best gene model amphioxus PERa_braFlo. Here we see GLTI is towards the end of an exon. If we look to GRWL as the valid start of the alignment, it lacks a supporting phase 0 intron of PERa_braFlo

Downstream there is a GT splice option for a phase 2 intron at YISYV (which would eliminate the insert) or an option for adding IRSKTDTTFVDT followed by a phase 0 splice start (and a stop codon). However that extension has no support in any known opsin. Further, there is no support from PERa_braFlo exon boundaries.

The information in this trace has not yet been exhausted because we can re-blastn against the trace archives, perhaps finding a trace that stayed on task. Indeed this picks up two high-identity trace with clear multiple exons, ti|1723199539 and ti|1705099698, supporting the notion that the original trace reflects a processed pseudogene and that several complete coding exons can be recovered. To be continued when NCBI completes repairs to its down server!

The fragmentary gene is best further studied after release of acornworm contig assembly.

>PER_sacKol Saccoglossus kowalevskii
GLTIFGMSLSCVSSFAGRWLFGKFGCYFHGFAGMLFGLGSIGNLTVISIDRYIITCKRNLNYKLHLFTEWSYRHYYALLAVAWSNALFWSMMPLFGWSSYALEPEGTSCTIDWMNNDNQYISYVSCVTVTCFI

>PERa_braFlo Branchiostoma floridae (amphioxus) Length = 361 Spaces in alignment show 4 exon boundaries in amphioxus.

 Score = 329 (115.8 bits), Expect = 5.4e-33, P = 5.4e-33  Identities = 60/133 (45%), Positives = 86/133 (64%)

Query:     1 GLTIFGMSLSCVSSFA GRWLFGKFGCYFHGFAGMLFGLGSIGNLTVISIDRYIITCKRNL 60
             G+ IFG   S  SS     WLFG  GC ++GF GM FG+ +IG LT +++DRY++ C+++L
Sbjct:    83 GICIFGYPFSGASSLR SHWLFGGVGCQWYGFNGMFFGMANIGLLTCVAVDRYLVICRQDL 142

Query:    61 NYKLHLFTEWSYRHYYALLAVAWSNALFWSMMPLFGWSSYALEPE GTSCTIDWMNNDNQYISYVSCVTVTCFI irsktdttfvdtvta* 
               K++      Y  Y  + A+ W  A FW+ +PL GW+ Y+LEP  GT+CTI+W  ND+ YISYV+    +CFI
Sbjct:   143 VDKVN------YNTYGVMAALGWLFAAFWAALPLVGWAEYSLEPS GTACTINWQKNDSLYISYV+----+CFI LGFALPLAVMMFCYWQ 197

>PER_xenTro Xenopus tropicalis (frog) ?? 0.2.0.2.1.0.1 indel -CFI +NOLA1 +EGF -ELOVL6 347 aa 000 nm no_ref genome peropsin RRH                                               
0 METLAEVSTLLPAGTGTVNISDASSEVHSVFSQSEHNIVAAYLITA 1
2 GVISILSNIIVLGIFVKYKELRTATNAIIINLAFTDIGVSGIGYPMSAASDLHGSWKFGYVGCQ 0
0 IYAGLNIFFGMASIGLLTVVAIDRYLTICRPDIG 1
2 GRRISGRHYTAMILAAWINAVFWSVMPVVGWSSYAPDPTGATCTINWRKNDV 2
1 SFVSYTMSVVAVNFVVPLMVMFYCYYNVSRTMKGYGSRSSLGGINADWSDQTDVTK 0
0 MSMVMIVMFLVAWSPYSIVCLWSSFGDPRKIPPAMAIIAPLFAKSSTFYNPCIYVIANKK 2
1 FRRAILSMVQCKSRQEVTLDNHFPMNVSQSTLTT* 0

gggtcgcgtattgcaaatatttcattattttacgttttctatgtttactcatgcaatcag
 G  S  R  I  A  N  I  S  L  F  Y  V  F  Y  V  Y  S  C  N  Q 
aattttaattataaacttcatctattcacagaatggtcatatcgccattactacgctcta
 N  F  N  Y  K  L  H  L  F  T  E  W  S  Y  R  H  Y  Y  A  L 
ctcgcagtagcctggtcaaatgccttattctggtcgatgatgccgctgtttggatggagc
 L  A  V  A  W  S  N  A  L  F  W  S  M  M  P  L  F  G  W  S 
agctatgcgctagaaccagaaggtacatcgtgtaccatagattggatgaacaacgataat
 S  Y  A  L  E  P  E  G  T  S  C  T  I  D  W  M  N  N  D  N 
cagtacatttcttacgtaagttgtgttactgtcacgtgtttcat
 Q  Y  I  S  Y  V  S  C  V  T  V  T  C  F    

>PERa_braFlo Branchiostoma floridae (amphioxus) AB050610 peropsin Amphiop3 frag                                               
0 MDIPTETPYGAGDDPAGTGWRWAETDQNGFHKYDHLIVGLYLFVI 1
2 GIIGTVENGITLATFTKFRSLRSPTTMLLVHLAIADLGICIFGYPFSGASSLR 0
0 SHWLFGGVGCQWYGFNGMFFGMANIGLLTCVAVDRYLVICRQDL 1 
2 VDKVNYNTYGVMAALGWLFAAFWAALPLVGWAEYSLEPS 1
2 GTACTINWQKNDSLYISYVTSCFILGFALPLAVMMFCYWQ 0
0 ASCFVNKVLKGDISGDLTFPVAVNVDWEYQNHFSK 0
0 MCLAMVAAFVVAWTPYSVLFLFAAFGNPADIPAWITLLPPLIAKSSALYNPIIYIIANRRFRSAIFSMVKGQNPDVE 0
0 TLFARDFRISPIEDTGKEMSSMGNANA* 0

>gnl|ti|1695985935 name:229621612 mate:1695980389
CGCATGGTCGTAGTTCATCAGATGAAACACGTGACAGTAACACAACTTACGTAAGAAATGTACTGATTAT
CGTTGTTCATCCAATCTATGGTACACGATGTACCTTCTGGTTCTAGCGCATAGCTGCTCCATCCAAACAG
CGGCATCATCGACCAGAATAAGGCATTTGACCAGGCTACTGCGAGTAGAGCGTAGTAATGGCGATATGAC
CATTCTGTGAATAGATGAAGTTTATAATTAAAATTCTGATTACATGAGTAAACATAGAAAACGTCAAATA
ATGAAATATTTTCAATACGCGACCACAATTTGTTTCAAAATCTGGTCAAGAATGGCCAATTGACAGCCAT
CTGACTTTTTGGCCTTACTACTATATTTCCACAATTTATGTACCATTTTCATTCGTATTTGGCACATACA
TCATACATATCAGTGTGTTTCATAATCAGATCAAAGACAGACACCACGTTTGACGTAACAGTAACCGCTT
AAAGGGCGGTGATCGTCGCGCTGCGCATGTGCGACATTTGTGTTGATTAACATTGAAGAGCGCATTGGAT
GGTGGGAAAATACTTAAGTAATACTTACGTAGATTCCTCTTACATGTAATGATATATCTATCAATACTGA
TAACAGTCAGATTGCCAATACTTCCAAGTCCAAACAGCATTCCGGCGAATCCATGGAAATAACAACCAAA
CTTTCCAAATAGCCATCGCCCTGCAAAGCTAGAGACGCACGAAAGCGACATTCCAAAGATTGTCAATCCT
GTTGAAGGAGAAAAAATAGTAAATAATGTCGGGCTCATATATATACTTATGAAAGTGTGTTACATTAAGT
TACTATAGACATGATA