Utilities for comparative genomics: Difference between revisions
Tomemerald (talk | contribs) No edit summary |
Tomemerald (talk | contribs) No edit summary |
||
Line 154: | Line 154: | ||
Thus the processing algorithm could gather these spanning numbers. Entrez retrieval allows limitation to these in conjunction with the accesssion numbers. That would allow for orderly recovery of adequately padded exonic regions which could then be concatenated for translation in a consistent frame with validation of intron position and phase. Alignment could then be performed in an altered version of blast more sympathetic to the goals here. This would better address the special problem of split codon reconstruction in the 12 and 21 overhang situations (where the completed codon does not appear in extended translation of either exon. | Thus the processing algorithm could gather these spanning numbers. Entrez retrieval allows limitation to these in conjunction with the accesssion numbers. That would allow for orderly recovery of adequately padded exonic regions which could then be concatenated for translation in a consistent frame with validation of intron position and phase. Alignment could then be performed in an altered version of blast more sympathetic to the goals here. This would better address the special problem of split codon reconstruction in the 12 and 21 overhang situations (where the completed codon does not appear in extended translation of either exon. | ||
== Species commonly available == | === Species commonly available === | ||
To populate exons from a given gene, place the list below in one column of a spreadsheet (after search-replace has put in the name of the gene being specifically investigated in place of 'gene'). The next 3 column3 contain the splice acceptor phase, the exon sequences, and the donor phase, respectively. After all the data is collected, it is converted to standard exonic fasta format by replacing tabs with returns. The numbers to the left allow the spreadsheet to be sorted either alphabetically or phylogenetically. | |||
<pre> | <pre> | ||
>gene_homSap Homo sapiens (human) | >10.gene_homSap Homo sapiens (human) | ||
>gene_panTro Pan troglodytes (chimp) | >11.gene_panTro Pan troglodytes (chimp) | ||
>gene_gorGor Gorilla gorilla (gorilla) | >12.gene_gorGor Gorilla gorilla (gorilla) | ||
>gene_ponPyg Pongo pygmaeus (orang_sumatran) | >13.gene_ponPyg Pongo pygmaeus (orang_sumatran) | ||
>gene_nomLeu Nomascus leucogenys (gibbon) | >14.gene_nomLeu Nomascus leucogenys (gibbon) | ||
>gene_macMul Macaca mulatta (rhesus) | >15.gene_macMul Macaca mulatta (rhesus) | ||
>gene_papAnu Papio anubis (baboon) | >16.gene_papAnu Papio anubis (baboon) | ||
>gene_papHam Papio hamadryas (baboon) | >17.gene_papHam Papio hamadryas (baboon) | ||
>gene_calJac Callithrix jacchus (marmoset) | >18.gene_calJac Callithrix jacchus (marmoset) | ||
>gene_tarSyr Tarsius syrichta (tarsier) | >19.gene_tarSyr Tarsius syrichta (tarsier) | ||
>gene_otoGar Otolemur garnettii (bushbaby) | >20.gene_otoGar Otolemur garnettii (bushbaby) | ||
>gene_micMur Microcebus murinus (mouse_lemur) | >21.gene_micMur Microcebus murinus (mouse_lemur) | ||
>gene_cynVol Cynocephalus volans (flying_lemur) | >22.gene_cynVol Cynocephalus volans (flying_lemur) | ||
>gene_tupBel Tupaia belangeri (tree_shrew) | >23.gene_tupBel Tupaia belangeri (tree_shrew) | ||
>gene_musMus Mus musculus (mouse) | >24.gene_musMus Mus musculus (mouse) | ||
>gene_ratNor Rattus norvegicus (rat) | >25.gene_ratNor Rattus norvegicus (rat) | ||
>gene_cavPor Cavia porcellus (guinea_pig) | >26.gene_cavPor Cavia porcellus (guinea_pig) | ||
>gene_speTri Spermophilus tridecemlineatus (squirrel) | >27.gene_speTri Spermophilus tridecemlineatus (squirrel) | ||
>gene_dipOrd Dipodomys ordii (kangaroo_rat) | >28.gene_dipOrd Dipodomys ordii (kangaroo_rat) | ||
>gene_oryCun Oryctolagus cuniculus (rabbit) | >29.gene_oryCun Oryctolagus cuniculus (rabbit) | ||
>gene_ochPri Ochotona princeps (pika) | >30.gene_ochPri Ochotona princeps (pika) | ||
>gene_canFam Canis familiaris (dog) | >31.gene_canFam Canis familiaris (dog) | ||
>gene_felCat Felis catus (cat) | >32.gene_felCat Felis catus (cat) | ||
>gene_bosTau Bos taurus (cow) | >33.gene_bosTau Bos taurus (cow) | ||
>gene_oviAri Ovis aries (sheep) | >34.gene_oviAri Ovis aries (sheep) | ||
>gene_susScr Sus scrofa (pig) | >35.gene_susScr Sus scrofa (pig) | ||
>gene_equCab Equus caballus (horse) | >36.gene_equCab Equus caballus (horse) | ||
>gene_myoLuc Myotis lucifugus (microbat) | >37.gene_myoLuc Myotis lucifugus (microbat) | ||
>gene_pteVam Pteropus vampyrus (macrobat) | >38.gene_pteVam Pteropus vampyrus (macrobat) | ||
>gene_turTru Tursiops truncatus (dolphin) | >39.gene_turTru Tursiops truncatus (dolphin) | ||
>gene_susScr Sus scrofa (pig) | >40.gene_susScr Sus scrofa (pig) | ||
>gene_eriEur Erinaceus europaeus (hedgehog) | >41.gene_eriEur Erinaceus europaeus (hedgehog) | ||
>gene_sorAra Sorex araneus (shrew) | >42.gene_sorAra Sorex araneus (shrew) | ||
>gene_borAnc Boreoeuthere ancestralis (ancestral) | >43.gene_borAnc Boreoeuthere ancestralis (ancestral) | ||
>gene_dasNov Dasypus novemcinctus (armadillo) | >44.gene_dasNov Dasypus novemcinctus (armadillo) | ||
>gene_choHof Choloepus hoffmanni (sloth) | >45.gene_choHof Choloepus hoffmanni (sloth) | ||
>gene_loxAfr Loxodonta africana (elephant) | >46.gene_loxAfr Loxodonta africana (elephant) | ||
>gene_proCap Procavia capensis (hyrax) | >47.gene_proCap Procavia capensis (hyrax) | ||
>gene_echTel Echinops telfairi (tenrec) | >48.gene_echTel Echinops telfairi (tenrec) | ||
>gene_monDom Monodelphis domestica (opossum) | >49.gene_monDom Monodelphis domestica (opossum) | ||
>gene_macEug Macropus eugenii (wallaby) | >50.gene_macEug Macropus eugenii (wallaby) | ||
>gene_triVul Trichosurus vulpecula (possum) | >51.gene_triVul Trichosurus vulpecula (possum) | ||
>gene_ornAna Ornithorhynchus anatinus (platypus) | >52.gene_ornAna Ornithorhynchus anatinus (platypus) | ||
>gene_tacAcu Tachyglossus aculeatus (echidna) | >53.gene_tacAcu Tachyglossus aculeatus (echidna) | ||
>gene_galGal Gallus gallus (chicken) | >54.gene_galGal Gallus gallus (chicken) | ||
>gene_taeGut Taeniopygia guttata (finch) | >55.gene_taeGut Taeniopygia guttata (finch) | ||
>gene_anoCar Anolis carolinensis (lizard) | >56.gene_anoCar Anolis carolinensis (lizard) | ||
>gene_xenTro Xenopus tropicalis (frog) | >57.gene_xenTro Xenopus tropicalis (frog) | ||
>gene_xenTro Xenopus laevis (frog) | >58.gene_xenTro Xenopus laevis (frog) | ||
>gene_danRer Danio rerio (zebrafish) | >59.gene_danRer Danio rerio (zebrafish) | ||
>gene_tetNig Tetraodon nigroviridis (pufferfish) | >60.gene_tetNig Tetraodon nigroviridis (pufferfish) | ||
>gene_takRub Takifugu rubripes (fugu) | >61.gene_takRub Takifugu rubripes (fugu) | ||
>gene_gasAcu Gasterosteus aculeatus (stickleback) | >62.gene_gasAcu Gasterosteus aculeatus (stickleback) | ||
>gene_oryLap Oryzias latipes (medaka) | >63.gene_oryLap Oryzias latipes (medaka) | ||
>gene_ictPun Ictalurus punctatus (fish) | >64.gene_ictPun Ictalurus punctatus (fish) | ||
>gene_oncMyk Oncorhynchus mykiss (trout) | >65.gene_oncMyk Oncorhynchus mykiss (trout) | ||
>gene_calMil Callorhinchus milii (elephantfish) | >66.gene_calMil Callorhinchus milii (elephantfish) | ||
>gene_squAca Squalus acanthias (spiny dogfish) | >67.gene_squAca Squalus acanthias (spiny dogfish) | ||
>gene_petMar Petromyzon marinus (lamprey) | >68.gene_petMar Petromyzon marinus (lamprey) | ||
</pre> | </pre> | ||
== Obtaining Sequences from 454 Transcript Runs == | |||
Sanger genomic sequencing basically shut down in April 2008. The major centers sequencing vertebrates have shifted over to new technologies that are faster and cheaper, predominantly 454. These reads are deposited in a user-unfriendly form at the NCBI [tp://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi Short Read Archives.] | |||
For the average user, there are 3 issues that impede effective access to all this great new data: | |||
* file size is enormous, in part because bulky read quality data is mixed in with fasta. | |||
* to extract the fasta sequence, proprietary software called sffinfo must be obtained (or a script written). | |||
* | |||
[[Category:Comparative Genomics]] | [[Category:Comparative Genomics]] |
Revision as of 12:57, 22 April 2008
State of the art: Comparative Genomics
Populating a feature stack
Populating a feature stack (eg of a coding exon) is the central task in comparative genomics. Feature stacks collect all information available from all available species on a suitable topic. Ideally the topic, for example evolutionary history of a coding exon, has width in base pairs a fraction of characteristic width of available experimental data. That is, a coding gene will have single exons wholly contained within individual trace reads and multiple exons within gapless GenBank wgs contigs.
On the other hand, a segmental duplication of 10 genes, say the one on chr20q/chr8, is very unsuited to a feature stack today because only a couple mammals have an adequate assembly over the intrinsic span of the feature.
Thus it is timely to construct feature stacks for coding exons and the like but premature to consider the comparative genomics of longer features. Consequently all research today that can fully exploit the power of comparative genomics is restricted to computable feature stacks. In other words, if someone can precompute all possible contemporary feature stacks, in effect they have written all possible research papers. This makes them the masters of the comparative genomics universe.
Brian Raney recently has computed this for a genome-based multiz alignment of 28 species, the UCSC 28way conservation track for both nucleotides and proteins.
How many feature stacks are there, how difficult is to compute, store, and query them, and what associated precomputed products go with them? Suppose there are 190,000 coding exon and 3 non-coding features per 20,000 gene for 240,000 features of width 500 bp and needed depth of 50 species. That would fit handily within an excel spreadsheet. So the number, storage, and query are non-issues.
Note thought that the 28way alignment is far from perfect with respect to mis-populating exons with pseudogenes and misalignments and infill completeness, not to mention that 50 vertebrate genomes are actually available in some form. Thus considerable manual curation and infill from the trace archives (long and short) and GenBank database divisions such as nr, est_others, and wgs. In May 2008, typically 43-44 orthologs for a given exon can be located
Here's an example of a feature stack and the implicit paper that practically writes itself:
Comparative Genomics of DRY motifs in exon 3 of RGR Opsins: 1 RWPYGSDGCQAHGFQGFVTALASICSSAAIAWGRYHHYCT 1 human 1 RWPYGSDGCQAHGFQGFVTALASICSSAAIAWGRYHHYCT 1 macaque 1 RWPYGSGGCQAHGFQGFTTALASICGSAAIAWGRYHHYCT 1 lemur 1 RWPHGSEGCQVHGFQGFATALASICGSAAVAWGRYHHYCT 1 mouse 1 RWPYGSDGCQAHGFQGFATALASICGSAAIAWGRYHHYCT 1 rabbit 1 RWPYGSDGCQAHGFQGFVTALASICSSAAIAWGRYHHYCT 1 horse 1 RWPYGPDGCQAHGFQGFATALASICSSAALAWGRYHHYCT 1 dog 1 RWPYGSGGCQAHGFQGFAAALASICGSAAVAWGRYHHYCT 1 bat 1 RWPYGSDGCQAHGFQGFVTALASICSCAAIAWERYHHYCT 1 elephant 1 HWPYGSGGCQAHGFQGFTVALASICSCAAIAWERYHHYCT 1 tenrec 1 RWPHGSDSCQAHSFQGFATALASISSSAAIAWERYRHHCT 1 sloth 1 RWPYGSGGCQAHGFQGFVTALASISSSAAIAWERCHRHCI 1 armadillo 1 HWPYGAEGCRLHGFQGFATALASISLSAAIGWDRYLRHCS 1 platypus 1 YWPYGSDGCQIHGFHGFLTALTSISSAAAVAWDRHHQYCT 1 lizard 1 YWPYGSEGCQIHGFQGFLTALASISSSAAVAWDRYHHYCT 1 chicken 1 YWPYGSEGCQIHGFQGFVAALSSIGSCAAIAWDRYHQYCT 1 frog 1 YWPYGSDGCQTHGFQGFVTALASIHFIAAIAWDRYHQYCT 1 stickleback 1 YWPYGSDGCQTHGFQGFVTALASIHFVAAIAWDRYHQYCT 1 fugu 1 YWPYGSEGCQTHGFHGFLTALASIHFIAAIAWDRYHQYCT 1 medaka 1 YWPYGSEGCQTHGFHGFLMALASINACAAIAWDRYHQNCS 1 elephantshark 1 EWPFGSIGCQLDAFIGMAPTFISIAGAALVAKDKYYRICK 1 tunicate
Now half the data needed for a full feature stack is not available at any genome browser but rather at GenBank etc. Whether that data is retrieved by automated pipeline or by one-off queries, there's a universal roadblock in that blast output is terribly formatted for comparative genomics purposes:
- the blast algorithm is unaware of splice donors and acceptors and ignores user parsing
- blocks are provided in score order instead of natural query exon order
- blocks have faux match extensions at the ends (exon dribble) complicating clean ortholog extraction
- blocks can contain multiple exons when introns are short leading to nonsense in tblastn
- line width of 60 is way too short, causing string searches to fail because of carriage returns
- gaps necessitate dashes causing string searches to fail
- the match genus and species are inconveniently provided
Fixing Blast output in order to populate a comparative genomics feature stack: No utility web tool exists anywhere on the internet that can repair blast output. Consequently one would make a great addition. The needed algorithm would largely carry over to any pipeline program to populate feature stacks genomewide. The example below shows what needs to happen, as limited to one species. In this case, a single tblastn query to GenBank wgs adds 19 mammals to the feature stack.
Here is the original query, 3 exons of bovine rhodpsin marked up for reading frame >RHO1_bosTau Bos taurus (cow) 2 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSR 2 1 YIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKE 0 0 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ 0
Here is the raw output match to marsupial >gb|AAFR03021222.1| Monodelphis domestica cont3.021221, whole genome shotgun sequence Length=94985 Score = 149 bits (377), Expect = 9e-34, Method: Compositional matrix adjust. Identities = 71/74 (95%), Positives = 74/74 (100%), Gaps = 0/74 (0%) Frame = +1 Query 119 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTS 178 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S Sbjct 21139 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSS 21318 Query 179 AVYNPVIYIMMNKQ 192 +VYNPVIYIMMNKQ Sbjct 21319 SVYNPVIYIMMNKQ 21360 Score = 117 bits (294), Expect = 5e-24, Method: Compositional matrix adjust. Identities = 54/59 (91%), Positives = 57/59 (96%), Gaps = 0/59 (0%) Frame = +3 Query 1 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI 59 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR + Sbjct 18648 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV 18824 Score = 107 bits (268), Expect = 4e-21, Method: Compositional matrix adjust. Identities = 51/59 (86%), Positives = 52/59 (88%), Gaps = 0/59 (0%) Frame = +2 Query 55 WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA 113 + RYIPEGMQCSCGIDYYT E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE Sbjct 20546 FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV 20722
Stage 1: order the output blocks according to input exon order and fix line width: Query 1 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI 59 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR + Sbjct 18648 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV 18824 Query 55 WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA 113 + RYIPEGMQCSCGIDYYT E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE Sbjct 20546 FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV 20722 Query 119 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ 192 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S+VYNPVIYIMMNKQ Sbjct 21139 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ 21360
Stage 2: Find the correct input exons Query 1 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI 59 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR + Sbjct 18648 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV 18824 Query 55 WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA 113 + RYIPEGMQCSCGIDYYT E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE Sbjct 20546 FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV 20722 Query 119 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ 192 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S+VYNPVIYIMMNKQ Sbjct 21139 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ 21360
Stage 3: Transfer the match to marsupial Query 1 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI 59 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR + Sbjct 18648 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV 18824 Query 55 WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA 113 + RYIPEGMQCSCGIDYYT E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE Sbjct 20546 FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV 20722 Query 119 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ 192 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S+VYNPVIYIMMNKQ Sbjct 21139 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ 21360
Stage 4: Export the desired output with acquired header >RHO1_monDom Monodelphis domesticus (opossum) 2 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSR 2 1 YIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKE 0 0 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ 0
Stage 5: Fill in the respective feature stacks with marsupial exons (only first is shown): >RHO1_homSap GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSR >RHO1_bosTau GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSR >RHO1_monDom GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSR >RHO1_ornAna GEIALWSLVVLAIERYIVVCKPMSNFRFGENHAIMGVAFTWIMALACALPPLVGWSR >RHO1_galGal GEIALWSLVVLAVERYVVVCKPMSNFRFGENHAIMGVAFSWIMAMACAAPPLFGWSR >RHO1_anoCar GEMGLWSLVVLAVERYVVICKPMSNFRFGETHALIGVSCTWIMALACAGPPLLGWSR >RHO1_xenTro GEMALWSLVVLAIERYVVVCKPMANFRFGENHAIMGVVFTWIMALSCAAPPLFGWSR >RHO1_neoFor GIIALWCLVVLAIERYIVVCKPISNFRFGENHAIMGVVFTWIMALACAGPPLFGWSR >RHO1_latCha GQVALWALVVLAIERYVVVCKPMSNFRFGENHAIMGVIFTWIMALSCAVPPLFGWSR >RHO1_takRub GEIALWSLVVLAVERYIVVCKPMTNFRFGEKHAIAGLVFTWIMALTCATPPLLGWSR >RHO1_leuEri GEVGLWCLVVLAIERYMVVCKPMANFRFGSQHAIIGVVFTWIMALSCAGPPLVGWSR >RHO1_calMil GEIGLWSLVVLAIERYVVVCKPMSNFRFGTNHAIMGVAFTWVMALACAVPPLMGWSR >RHO1_petMar DEMSLWSLVVLAIERYIVICKPMGNFRFGSTHAYMGVAFTWFMALSCAAPPLVGWSR
An alternative approach -- simply collecting the start and stop numbering in the Sbjct (match) line -- might work better. Another flaw in blast is that no flanking material is provided in the event of the first residue or two in an exon not matching. In the above example, blast actually dropped a perfect matches to the initial hexapeptide AAAQQQ even though the simple sequence filter was off.
Thus the processing algorithm could gather these spanning numbers. Entrez retrieval allows limitation to these in conjunction with the accesssion numbers. That would allow for orderly recovery of adequately padded exonic regions which could then be concatenated for translation in a consistent frame with validation of intron position and phase. Alignment could then be performed in an altered version of blast more sympathetic to the goals here. This would better address the special problem of split codon reconstruction in the 12 and 21 overhang situations (where the completed codon does not appear in extended translation of either exon.
Species commonly available
To populate exons from a given gene, place the list below in one column of a spreadsheet (after search-replace has put in the name of the gene being specifically investigated in place of 'gene'). The next 3 column3 contain the splice acceptor phase, the exon sequences, and the donor phase, respectively. After all the data is collected, it is converted to standard exonic fasta format by replacing tabs with returns. The numbers to the left allow the spreadsheet to be sorted either alphabetically or phylogenetically.
>10.gene_homSap Homo sapiens (human) >11.gene_panTro Pan troglodytes (chimp) >12.gene_gorGor Gorilla gorilla (gorilla) >13.gene_ponPyg Pongo pygmaeus (orang_sumatran) >14.gene_nomLeu Nomascus leucogenys (gibbon) >15.gene_macMul Macaca mulatta (rhesus) >16.gene_papAnu Papio anubis (baboon) >17.gene_papHam Papio hamadryas (baboon) >18.gene_calJac Callithrix jacchus (marmoset) >19.gene_tarSyr Tarsius syrichta (tarsier) >20.gene_otoGar Otolemur garnettii (bushbaby) >21.gene_micMur Microcebus murinus (mouse_lemur) >22.gene_cynVol Cynocephalus volans (flying_lemur) >23.gene_tupBel Tupaia belangeri (tree_shrew) >24.gene_musMus Mus musculus (mouse) >25.gene_ratNor Rattus norvegicus (rat) >26.gene_cavPor Cavia porcellus (guinea_pig) >27.gene_speTri Spermophilus tridecemlineatus (squirrel) >28.gene_dipOrd Dipodomys ordii (kangaroo_rat) >29.gene_oryCun Oryctolagus cuniculus (rabbit) >30.gene_ochPri Ochotona princeps (pika) >31.gene_canFam Canis familiaris (dog) >32.gene_felCat Felis catus (cat) >33.gene_bosTau Bos taurus (cow) >34.gene_oviAri Ovis aries (sheep) >35.gene_susScr Sus scrofa (pig) >36.gene_equCab Equus caballus (horse) >37.gene_myoLuc Myotis lucifugus (microbat) >38.gene_pteVam Pteropus vampyrus (macrobat) >39.gene_turTru Tursiops truncatus (dolphin) >40.gene_susScr Sus scrofa (pig) >41.gene_eriEur Erinaceus europaeus (hedgehog) >42.gene_sorAra Sorex araneus (shrew) >43.gene_borAnc Boreoeuthere ancestralis (ancestral) >44.gene_dasNov Dasypus novemcinctus (armadillo) >45.gene_choHof Choloepus hoffmanni (sloth) >46.gene_loxAfr Loxodonta africana (elephant) >47.gene_proCap Procavia capensis (hyrax) >48.gene_echTel Echinops telfairi (tenrec) >49.gene_monDom Monodelphis domestica (opossum) >50.gene_macEug Macropus eugenii (wallaby) >51.gene_triVul Trichosurus vulpecula (possum) >52.gene_ornAna Ornithorhynchus anatinus (platypus) >53.gene_tacAcu Tachyglossus aculeatus (echidna) >54.gene_galGal Gallus gallus (chicken) >55.gene_taeGut Taeniopygia guttata (finch) >56.gene_anoCar Anolis carolinensis (lizard) >57.gene_xenTro Xenopus tropicalis (frog) >58.gene_xenTro Xenopus laevis (frog) >59.gene_danRer Danio rerio (zebrafish) >60.gene_tetNig Tetraodon nigroviridis (pufferfish) >61.gene_takRub Takifugu rubripes (fugu) >62.gene_gasAcu Gasterosteus aculeatus (stickleback) >63.gene_oryLap Oryzias latipes (medaka) >64.gene_ictPun Ictalurus punctatus (fish) >65.gene_oncMyk Oncorhynchus mykiss (trout) >66.gene_calMil Callorhinchus milii (elephantfish) >67.gene_squAca Squalus acanthias (spiny dogfish) >68.gene_petMar Petromyzon marinus (lamprey)
Obtaining Sequences from 454 Transcript Runs
Sanger genomic sequencing basically shut down in April 2008. The major centers sequencing vertebrates have shifted over to new technologies that are faster and cheaper, predominantly 454. These reads are deposited in a user-unfriendly form at the NCBI [tp://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi Short Read Archives.]
For the average user, there are 3 issues that impede effective access to all this great new data:
- file size is enormous, in part because bulky read quality data is mixed in with fasta.
- to extract the fasta sequence, proprietary software called sffinfo must be obtained (or a script written).