Selenoprotein evolution: GPX
Introduction to glutathione peroxidases
The GPX family of proteins can be traced back to great phylogenetic depth. It has experienced various expansions, some narrowly lineage-specific but others retained very broadly. Gene family contractions are more difficult to establish since absence of evidence in 2x genomic assemblies rarely constitutes evidence of absence. In vertebrates, the most notable recent expansion is of GPX3 in placental mammals into the divergently tandem pair GPX7 and GPX8. Marsupials and earlier diverging species lack these latter genes while retaining GPX3 in syntenically established orthologous position (flanking genes TNIP1 and ANXA6).
GPX gene tree topology
The topology of the GPX gene tree can be unambiguously established as ((((GPX5,GPX6),GPX3),(GPX1,GPX2)),(GPX4,(GPX7,GPX8))) using blastp clustering, chromosomal locations, rare genomic events such as indels, signature residues, and intron positions and phases. The main technical issue is the placement of GPX4, which can be resolved using the large indels it shares with GPX7 and GPX8 in addition to its sequence clustering. GPX4 is by far the most conserved member of this gene family, suggesting its function has not strayed far from its role in ancestral metazoa, even as its gene duplicates have subfunctionalized and sublocalized.
The SECIS elements of GPX3 have lost signifcant nucleotide alignability already at human-marsupial even though the selenocysteine elements here are likely still orthologous as DNA and structurally alignable using parsing provided by the SECIS web tool detector (note the difficulty in quantitating secondary or tertiary structural alignment quality).
The SECIS elements of GPX6 offer the most favorable scenario for detecting paralogous SECIS elements since this gene arose rather recently from GPX3 in the placental mammal stem whereas other duplications appear much older. However no blastn match is observed between any SECIS GPX3 element and any GPX6. This suggests blastn -- even collection to collection -- is not a sufficiently sensitive approach for comparative genomics of SECIS elements, in part because of fairly short length and frequent loop indels. MultAlin however can successfully align the sequences, attaining agreement with key residues identified by RnaFold.
Other approaches need to be considered, such as conserved signature motifs as in Logos or, less feasibly, root-mean-square structural deviations. That latter could experience confounding convergent evolution since all SECIS elements are selectively pressured to conform to the same (ie slowly evolving) SECISBP2/KIAA0256 binding site.
Alignment of GPX sequences
The alignment shows 103 of the 240 available deuterostome GPX sequences evenly distributed (more or less) over both the eight members of the gene tree and the chordate species tree. The reddish color shows residues conserved at the 90% level; the bluish less-conserved at 70%. The selenocysteine is represented by Z because U is not etained by the alignment tool used here.
Note the three large deletions in GPX4, GPX7, and GPX8 relative to other GPX. The latter have ancestral length judging by the GPX4-classifying sequence in the metazoan outgroup species Monosiga brevicollis. Parsimonously, the deletions occured once in the common ancestral sequence to GPX4 and GPX7/8. It is already present in the tubeworm metazona Ridgeia. Thus the pattern of indels suggests grouping GPX4 with GPX7 and GPX8 to the exclusion of GPX1 and GPX2.
Comparative genomics of GPX selenocysteine residues
It can be seen that most proteinwide cysteines occur sporadically or align just within a particular paralog group, the exceptions being the univeral selenocysteine site (occupied by cysteine in some paralog families and serine anomalously in teleost fish) and a following cysteine 30 residues distal (eg UGKTEVNYTQLVDLHARYAECGLRILAFPC in GPX4_homSap). These two residues very likely form a mixed diselenide (resp. disulfide) with an essential role in the redox reactions carried out by glutathione peroxidases. However some exceptions occur in CPX3 and GPX2.
Other cysteines might be in structural disulfides yet in some GPX an odd number occur, meaning they could not all pair off. This, in conjunction with non-homologous positions, argues against structural disulfides in most cases.
Note GPX7 and GPX8 have classical KDEL endoplasmic reticulum retention signals. This implies an interaction with the protein systems responsible for retrograde translocation and retention. This subcellular localization apparently arose in GPX7 post-amphioxus divergence since the motif is missing there, in sea urchin, and early eukaryotes. GPX8 arose as a subsequent duplication of this ancestral GPX7 and so inherited the motif.
Invertebrate GPX2 and GPX4 sequences
>GPX2_litVan Litopenaeus vannamei (shrimp) Metazoa; Arthropoda; Crustacea ASSAIKSFYDLSAKALSGEMVSFKKFQGKVVLVQNTASLuGTTTRDFHQMNQLKEEFGDK LEVLAFPC NQFGHQENTTEGELLSSLRHVRPGNNFEPKMVMFGKVDVNGSTADPVFKYLKERLPLPADDSVSFMSDPKCIIWTPVCRSDIAWNFEKFLIGKDGQPFKRFSKKYETILLKDEIANLLKA* >GPX2_eupScp Euprymna scolopes (scallop) cDNA Metazoa; Mollusca 62% KSFFDFSAKTXAGENIDFSRFKGKVVLVENVASLuGTTTRDFTQMNELVAMFADKLVVLG FPC NQFGHQENADGTEIIQSLCYVRPGNGFRPNFSIMEKVSVNGEKTHPIFDFLKDHLPAPSDDPISLMGNPQFITWKPVKRSDVSWNFEKFLVAPDGKPYMRYSRNFLTINLKADIQKLV >GPX2_capSpp Capitella spp (polychaete) cDNA Metazoa; Annelida 64% MQAAKMAKNFYQLSAELLNGKKVQMSAYKGKVVLVENVASLuGTTVRDYHQMNQLMEQFGDRLQILAFPC NQFGHQENTTNDEILKSLKYVRPGNNYTPKFDMFKKVDVNGETAHPVFQFLREQLPTPSDDTVSLMSNPKFLIWSPVCRNDVSWNFEKFLIGPDGEPVKRYSRHFETINIASDIKKLM >GPX2_helRob Helobdella robusta (leech) cDNA Metazoa; Annelida 66% KNFYQLSAELLNGKKVQMSAYKGKVVLVENVASLuGTTVRDYHQMNQLMEQFGDRLQI LAFPC NQFGHQENTTNDEILKSLKYVRPGNNYTPKFDMFKKVDVNGETAHPVFQFLREQLPTPSDDTVSLMSNPKFLIWSPVCRNDVSWNFEKFLIGPDGEPVKRYSRHFETINIASDIKKLM >GPX2_mesGib Mesobuthus gibbosus (scorpion) cDNA Metazoa; Arthropoda MAKSFYDLSAKLLLTGEKINFSQFKGKVVLIENVASLuGTTVRDYTQMNELLNKFGEELEILGFPC NQFGHQENGNEEEIINSLKYVRPGNGFETKITLFEKIDVNGAGAHQVFQFLRNELPYPIDDPNSLMTNPQCIIWSPVSRNDVGWNFEKFLITRDGTPFRRYSRNYLTSDIARDIQLLI >GPX4_booMic Boophilus microplus (tick) Metazoa; Arthropoda; Chelicerata TMATADDSWKDASSIYDFSAVDIDGNEVSLDKYKGHVALIVNVASKuGKTNKNYTQLVELHEKYAESKGLRILAFPC NQFGGQEPGTETDIKKFVEKYNVKFDMFSKVNVNGDKAHPLWKYLKQKQSGFLTDAIKWNFTKFVVDKEGQPVHRYAPTTDPLDIEPD >GPX4_nasVit Nasonia vitripennis (wasp) Metazoa; Arthropoda; Hexapoda; Insecta AEVKFNQDTDWSKAKSIYEFHAKDIRGNDVSLDKYRGHVAIIVNVASQCGLTDTNYKQLQSLFEKYGKSKGLRILAFPS NEFAGQEPGTSEEILNFVKKYNVSFDMFEKIQVNGDEAHPLYKWLKSQEEGAGTITDGIKWNFTKFLIDKNGKVVSRFAPTTEPFSMEDTITKYL* >GPX4_triCas Tribolium castaneum (red flour beetle) Metazoa; Arthropoda; Hexapoda; Insecta EKPQEAASIYEFTANDIKGEPVSLEKYKGHVCIIVNVASQCGYTKNNYAELVDLFNEYGESKGLRILAFPC NQFAGQEPGTNEEICQFVSSKNVKFDVFEKINVNGNDAHPLWKYLKHKQGGTLGDFIKWNFTKFIIDKNGQPVERHGPSTNPKDLVKSLEKYW* >GPX4_apiMel Apis mellifera (bee) Metazoa; Arthropoda; Hexapoda; Insecta NWKSASTIYDFHAKDIHGNDVSLNKYRGHVCIIVNVASNCGLTDTNYRELVQLYEKYNEKEGLRILAFPS NEFGGQEPGTSVEILEFVKKYNVTFDLFEKINVNGDNAHPLWKWLKTQANGFITDDIKWNFSKFIINKEGKVVSRFAPTVDPLQMESELK >GPX4_plaDum Platynereis dumerilii (flatworm) Metazoa; Annelida; Polychaeta CNMATSTDKNAYKKAGSIYEFSAKDIDGNDEVSLEKYKGEVCLIVNVASKuGLTDKNYRQLQALHEELAGKGLRILAFPC NQFGSQEPGSDEEIKKFATEKYNVQFDMFSKIDVNGSDAHPLWKYLKHKKGGTLGDFIKWNFAKFLVDRQGQPFKRYGNSTAPFDFKKDIE
GPX3, GPX5, and GPX6: mixed retention of seleocysteine
The top image shows that the gene duplication leading to the divergently transcribed tandem pair GPX5 and GPX6 is moderately old. Its stability to subsequent rearrangement over a billion years of placental branch length suggests shared upstream promoter and regulatory elements that are difficult to separate with retention of functionality.
Note the UCSC 28way alignment is misleading already at platypus because the 'tandem' shown really is just two instances of paralogous GPX3. Neither platypus, echidna, nor 3 marsupials contain GPX5 or GPX6, though the new gene pair is present in all available (basal) afrothere and xenarthran genomes. Hence the gene duplication event likely occured during the placental stem. No extant species exhibits the intermediate two gene state GPX3 + GPX4/5.
The lower image suggests GPX5 selenocysteine was displaced by cysteine after the duplication event but prior to divergence of any extant species. GPX6 exhibits a bit of phylogenetic incoherence in its loss of selenocysteine in various placental subclades, whereas GPX3 retains selenocysteine in all known vertebrate species. Thus no species is expected to have a SECIS insertion element downstream from GPX5, whereas some species will for GPX6 (and others may have degenerate elements), whereas all species should have the element downstream of GPX3.
These three paralogs have identical and distinctive intron locations and phases, so represent a distinct subcluster of the overall set of seven GPX genes. GPX3 is the parent gene: full-length genes are available for lamprey, amphioxus, tunicate, and sea urchin but not yet in protostomes (which have clear GPX34 and GPX2 orthologs however).
It is critical in annotating a complex gene family to build up bona fide sets of orthologs for each member validated by blast clustering, indels, intron pattern, and chromsomal synteny. These initial reliable sets then can correctly assign additional sequences to their correct class using them as classifying target database at Multalign.