Sulfatase evolution: ARSK

From genomewiki
Jump to navigationJump to search

Introduction to sulfatases

Sulfatases are an old and deeply diverged family of hydratases that remove sulfate moieties from a variety of small and large molecules. Despite the apparent simplicity of this reaction, the sulfatase domain fold is perhaps the largest known for any enzyme and an unprecedented formyl glycine post-translational modification of encoded cysteine, serine or threonine is critical to activity. The fold is closely related to that of alkaline phosphatases, though primary sequence alignability has almost completely dissipated.

The 17 human paralogs reside either in lysozomes or endoplasmic reticulum. Mutations in these genes result in diseases that provide important clues as to natural substrates (which accumulate in lysosomal storage diseases). However only 8 of the 17 genes have an associated disease at OMIM as of Sept 2010. Functions of the remaining sulfatases have yet to be discovered, perhaps because the accumulating metabolite is not toxic or has an alternative catabolic pathway. Such diseases could be recessive and hence rare in the case of unassigned autosomal sulfatases.

ARSK is such a gene. First described in 2003 as SULFX, the substrate and function of ARSK remain unknown -- it has not yet been the focus of a single experimental paper. ARSK is a fairly typical sulfatase of 536 amino acids encoded by eight exons on human chr 5 with a conventional CPSRA formylglycine motif. It lacks overt membrane insertional regions and GPI terminal motif so is presumably soluble.

ARSK is however peculiar in several bioinformatic respects. Although clearly a full length duplicate of an ancestral sulfatase, its opaque evolutionary relationship to non-orthologous sulfatases makes it difficult to place in the sulfate gene tree. Its closest affinity (percent identity low 20's) is perhaps with IDS which removes the sulfate from iduronate, though the ARSK substrate may have drifted off to something else entirely during the 600+ million years since gene duplication.

A second unusual feature is that the 7 introns within the coding region of ARSK do not bear any relationship in position or phase to those of other human sulfatases. This suggests that the gene duplication event leading to ARSK and other sulfatases preceded the main era of gene intronation, ie sulfatses initially had no introns (as in bacterial genes) and were subsequently independently intronated in early eukaryotes. Once established, the introns of ARSK have been stable over billions of years of gene tree branch length.

The phylogenetic distribution of ARSK also raises many questions. Within deuterostomes, orthologs are readily located in representatives of all major subclades with the exception of echinoderms and tunicates. ARSK has evolved quite conservatively here, with the human protein still having 54% and 52% identity over 500 residues to Branchiostoma (amphioxus) and Saccoglossus (acornworm) respectively, despite divergences that preceded the Cambrian. Intron positions and phases are precisely preserved beyond two minor fission events, leaving no doubt of orthology within deuterostomes. However, ARSK is otherwise completely missing from other eumetazoans (ecdysozoa, lophotrochozoa, and cnidaria).

A final oddity of ARSK observed early on is its close proximity to an apparently unrelated gene, TTC37 (twenty tetratricopeptide repeats 37): only 144 bp separate the two genes. These are transcribed divergently and could well share a bidirection promoter or overlap in 5' UTR. This relationship is by no means restricted to the human genes -- it is readily traced back throughout vertebrates. The putative chaperone function of TTC37 remains unspecified, though in June 2010 a disease has been assigned to it: trichohepatoenteric syndrome (THES) -- an "autosomal-recessive disorder characterized by life-threatening diarrhea in infancy, immunodeficiency, liver disease, trichorrhexis nodosa, facial dysmorphism, hypopigmentation, and cardiac defects". This does not immediately suggest why ARSK and TTC37 should be so closely linked.

ARSK reference sequences

>ARSK_homSap Homo sapiens (human) 544 aa 8 exons
0 MLLLWVSVVAALALAVLAPGAGEQRRRAAKAPNVVLVVSDSF 0
0 DGRLTFHPGSQVVKLPFINFMKTRGTSFLNAYTNSPICCPSRA 1
2 AMWSGLFTHLTESWNNFKGLDPNYTTWMDVMERHGYRTQKFGKLDYTSGHHSIS 2
1 NRVEAWTRDVAFLLRQEGRPMVNLIRNRTKVRVMERDWQNTDKAVNWLRKEAINYTEPFVIYLGLNLPHPYPSPSSGENFGSSTFHTSLYWLEK 00 VSHDAIKIPKWSPLSEMHPVDYYSSYTKNCTGRFTKKEIKNIRAFYYAMCAETDAML 1
2 GEIILALHQLDLLQKTIVIYSSDHGELAMEHRQFYKMSMYEASAHVPLLMMGPGIKAGLQVSNVVSLVDIYPTML 1
2 DIAGIPLPQNLSGYSLLPLSSETFKNEHKVKNLHPPWILSEFHGCNVNASTYMLRTNHWKYIAYSDGASILPQLF 1
2 DLSSDPDELTNVAVKFPEITYSLDQKLHSIINYPKVSASVHQYNKEQFIKWKQSIGQNYSNVIANLRWHQDWQKEPRKYENAIDQWLKTHMNPRAV* 0

>ARSK_takRub Takifugu rubripes (fugu) 8 exons 504 aa single copy retained after whole genome duplication
0 MSVKLSALILLFLAFHQVLARNRTRPNFLVVMSDAF 0
0 DGRLTFDPGSKVVKLPFINYLRELGVTFINAYTNSPICCPSRA 1
2 AMWSGQFVHLTQSWNNYKCLDANATTWMDLLEVNGYLTKMMGKLDYTSGSHSvs 0
1 NRVEAWTRDVQFLLRQEGRPVTQLVGNMSTVRIMGKDWENIDKATQWIQQRAESSQQPFALYLGLNLPHPYKTESLGPTAGGSTFRTSPHWLEK 00 VSSEHVTVPKWLPGAAMHPVDFYSTFTKNCSGFFTEEEIMNIRAFYYAMCAEADAML 1
2 GQLISALRETHLLNNTVVIFTADHGELAMEHRQFYKMSMFEGSSHVPLLFMGPGLMSGVEADQLVSLVDIYPTVL 1
2 DLADVPPVGSLSGYSLLPLLSTCSSCPGRPHPDWVLSEYHGCNANASTYMLRSGRWKYIAYADGLRVPPQLF 1
2 DMILDKEELHNVVFKFSEVSAQLDKLLRSIVHYPEVSAAVHRYNKESFVAWRHTLGRNYSQVISSLRWHVDWQRNPLANERAIDEWLYGSF* 0

>ARSK_braFlo Branchiostoma floridae (amphioxus) XM_002594507
0 MRMKLDCSAGFLLFWWFTSAVGGTRDDRKNIVFVICDSM 0
0 DGRLIGRGQDSVVDLPNLNYMVQNGVNFRSTYTNSPICCPSRS 1
2 ALWSGLHTHVTQSWNNYKGLPKNYPTWQVRLEQQGYHTQVYGKTDYVSGDHSES 2
1 NRVEAWTRNVNFTLAQEGRPTPVLV 12 GSSSTDRIQLKDWASTDLASHWLLHEAPKQQKPWLLYLGLNLPHPYPTPSMGKNFGGSTFMTSPYWLKK VNSSKVTIPKWLPFSRMHPVDYYSSATKNCTSDFTRDEIMKIREYYYGMCAETDAML 1
2 GQVLDALKASGQADSTYVFFTSDHGELAMEHRQFYKMSMYEASAHIPMVLTGPEVPAGKAVDDLTSLVDVFPTFM 1
2 DIANASQPPGLNGTSLLPLLRNSSDRVDRPDWVLSQYHGCNVNMSTYMLRTGSLKYVAFGDGPNQVSSQLF 1
2 DLDKDPDELHNLAEERQDLASQLDDKLRKLVDYPTVTREVQKYNRDSFMAWKAKLGSRYKDEIANLRWWKDWQKDPQGNQEKVEEWLNNVVS* 0

>ARSK_sacKow Saccoglossus kowalevskii (acornworm) XM_002732823
0 MFSMMQSSILITVLLFTCTCIPRGNEGKPNNVLFIICDAM 0
0 DGRLVGNNLTAVNMANINNRLVSHGVTFTNAYTNSPICCPSRS 1
2 ALWSGLYTHITHAWNNHEGLPADYPTWKIKLEKAGYDSKILGKTDYVSGRHTLS 2
1 NRVEAWTRNVNFTLAQEGRPTPVLVGNKTTIRVKDVDWDNIDKAKDWLENRKSSKATKPFLLYIGINLPHPYSTPGEGEHPGGSTFMTSPYW LQYVDMSKVTIPKWTPLDKMHPVDYYESATKNCTSHFTKDEIRKIRAYYYGMCAEVDGMV 1
2 GEILDQLDSLGLTNTTQVIFTSDHGEMAMEHRQFYKMTMYEASSHVPLIITNPTVPSRQGVAVNDPVSLVDIFPTLM 1
2 DMAAIHHPVGLNGTSLMPYLEGKSHVKKPDWVLSQYHGCNVNMSAYMLRRQEWKYITYGNGKQVAPHLF 1
2 NLDEDPDELHDYANERHDIIAEMDNKLRSIIDYADITNEVSRYNKESFSSWKTSIGDKYSDTIANLRWWKDWQKDPNGNEQRIEEWLKSVE* 0

>ARSK?_monBre Monosiga brevicollis ABFJ01000822 XM_001747506 erroneous gene model
0 MGNPIRGGSLLIVAASLLVCATLGTAKQPNILFVIDESTDAKAYFAKNPEKAPMPLPNLR 0
0 VPAHVMNSYHYHVRRQSAFFLPDASPCMLSQPNFDL 0
0 IHHIPHQQTNPSTGLFVTGAWNNYEGLPE 0
0 NYDLKYSDVLHKGGYNVGIFGKTDFTAGGHTVDARVTAWTNKVNFPFTLQNGSAGWYDETGPLVRTVNVSK 0
0 VVHVSDWNHANQTAKFIADAATHDEPWLAYVGFDIVHP 1
2 NYVSSPYWLDQVDMDKVTVPEWIPLDQLHPEDFQATMKKNMANLTHDPAFIKSVRQHYYGMIAE 2
1 YDAILGVVLDAVEASGEADNTY 0
0 IFVTSDHGDMNMEHQQYYK 0
0 MTYYDPSARVPLIVTGPTVQANVTYENLTSHLDFFPTFLELANV 1
2 RKDDYKYVTFGSGKEVAPRLFNMREDPLEMNDLAPSNPSLVAELDAELRSYWDYPSIASTAESYNK 1
2 DSFALLRASFNDEDKFKAYLATLRWSTSWSYDPEGSYAAIEAWLKTPNSTFEWAFP* 0