Utilities for comparative genomics: Difference between revisions
Tomemerald (talk | contribs) |
Tomemerald (talk | contribs) mNo edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 263: | Line 263: | ||
>33.62.gene_tetNig Tetraodon nigroviridis (pufferfish) | >33.62.gene_tetNig Tetraodon nigroviridis (pufferfish) | ||
>34.63.gene_takRub Takifugu rubripes (fugu) | >34.63.gene_takRub Takifugu rubripes (fugu) | ||
>35.64.gene_gasAcu Gasterosteus | >35.64.gene_gasAcu Gasterosteus aculeatus (stickleback) | ||
>36.65.gene_oryLap Oryzias latipes (medaka) | >36.65.gene_oryLap Oryzias latipes (medaka) | ||
>99.66.gene_ictPun Ictalurus punctatus (fish) | >99.66.gene_ictPun Ictalurus punctatus (fish) | ||
Line 298: | Line 298: | ||
>33.62.gene_tetNig Tetraodon nigroviridis (pufferfish) | >33.62.gene_tetNig Tetraodon nigroviridis (pufferfish) | ||
>34.63.gene_takRub Takifugu rubripes (fugu) | >34.63.gene_takRub Takifugu rubripes (fugu) | ||
>35.64.gene_gasAcu Gasterosteus | >35.64.gene_gasAcu Gasterosteus aculeatus (stickleback) | ||
>36.65.gene_oryLap Oryzias latipes (medaka) | >36.65.gene_oryLap Oryzias latipes (medaka) | ||
>99.12.gene_gorGor Gorilla gorilla (gorilla) | >99.12.gene_gorGor Gorilla gorilla (gorilla) | ||
Line 356: | Line 356: | ||
>99.68.gene_funHet Fundulus heteroclitis (flounder) | >99.68.gene_funHet Fundulus heteroclitis (flounder) | ||
>30.55.gene_galGal Gallus gallus (chicken) | >30.55.gene_galGal Gallus gallus (chicken) | ||
>35.64.gene_gasAcu Gasterosteus | >35.64.gene_gasAcu Gasterosteus aculeatus (stickleback) | ||
>99.12.gene_gorGor Gorilla gorilla (gorilla) | >99.12.gene_gorGor Gorilla gorilla (gorilla) | ||
>10.10.gene_homSap Homo sapiens (human) | >10.10.gene_homSap Homo sapiens (human) | ||
Line 400: | Line 400: | ||
>99.59.gene_xenTro Xenopus laevis (frog) | >99.59.gene_xenTro Xenopus laevis (frog) | ||
>31.58.gene_xenTro Xenopus tropicalis (frog) | >31.58.gene_xenTro Xenopus tropicalis (frog) | ||
</pre> | |||
=== Species sorting === | |||
It is convenient to sort species in various ways. The tabbed table below allows sorting in either alphabetic or phyogenetic order for either the UCSC 44-way or an expanded set of species with available transcripts, traces, or newly determined genomes. This allows rapid population of exon stacks from the 44-way, yet replacement of its legacy headers such as hg181 and mm41 by standard genSpp format, convenient addition of additional data to fill gaps, and restoration of phylogenetic order which is lost in output of some alignment tools (such as [http://bioinfo.genotoul.fr/multalin/multalin.html Multalin]). | |||
<pre> | |||
46 10 52 > 10 gene anoCar Anolis carolinensis (lizard) anoCar | |||
29 11 30 > 11 gene bosTau Bos taurus (cow) bosTau | |||
15 12 15 > 12 gene calJac Callithrix jacchus (marmoset) calJac | |||
60 54 59 > 13 gene calMil Callorhinchus milii (elephantfish) | |||
32 13 33 > 14 gene canFam Canis familiaris (dog) canFam | |||
23 14 23 > 15 gene cavPor Cavia porcellus (guinea_pig) cavPor | |||
41 15 42 > 16 gene choHof Choloepus hoffmanni (sloth) choHof | |||
52 16 58 > 17 gene danRer Danio rerio (zebrafish) danRer | |||
40 17 41 > 18 gene dasNov Dasypus novemcinctus (armadillo) dasNov | |||
22 18 22 > 19 gene dipOrd Dipodomys ordii (kangaroo_rat) dipOrd | |||
39 19 40 > 20 gene echTel Echinops telfairi (tenrec) echTel | |||
30 20 31 > 21 gene equCab Equus caballus (horse) equCab | |||
35 21 36 > 22 gene eriEur Erinaceus europaeus (hedgehog) eriEur | |||
31 22 32 > 23 gene felCat Felis catus (cat) felCat | |||
44 23 50 > 24 gene galGal Gallus gallus (chicken) galGal | |||
50 24 56 > 25 gene gasAcu Gasterosteus aculeatus (stickleback) gasAcu | |||
12 25 12 > 26 gene gorGor Gorilla gorilla (gorilla) gorGor | |||
10 26 10 > 27 gene homSap Homo sapiens (human) hg181 | |||
37 27 38 > 28 gene loxAfr Loxodonta africana (elephant) loxAfr | |||
55 55 44 > 29 gene macEug Macropus eugenii (wallaby) | |||
14 28 14 > 30 gene macMul Macaca mulatta (rhesus) rheMac | |||
17 29 17 > 31 gene micMur Microcebus murinus (mouse_lemur) micMur | |||
42 30 43 > 32 gene monDom Monodelphis domestica (opossum) monDom | |||
20 31 20 > 33 gene musMus Mus musculus (mouse) mm91 | |||
33 32 34 > 34 gene myoLuc Myotis lucifugus (microbat) myoLuc | |||
26 33 26 > 35 gene ochPri Ochotona princeps (pika) ochPri | |||
43 34 48 > 36 gene ornAna Ornithorhynchus anatinus (platypus) ornAna | |||
25 35 25 > 37 gene oryCun Oryctolagus cuniculus (rabbit) oryCun | |||
51 36 57 > 38 gene oryLap Oryzias latipes (medaka) oryLat | |||
18 37 18 > 39 gene otoGar Otolemur garnettii (bushbaby) otoGar | |||
11 38 11 > 40 gene panTro Pan troglodytes (chimp) panTro | |||
53 39 60 > 41 gene petMar Petromyzon marinus (lamprey) petMar | |||
13 40 13 > 42 gene ponPyg Pongo pygmaeus (orang) ponAbe | |||
38 41 39 > 43 gene proCap Procavia capensis (hyrax) proCap | |||
34 42 35 > 44 gene pteVam Pteropus vampyrus (macrobat) pteVam | |||
21 43 21 > 45 gene ratNor Rattus norvegicus (rat) rn41 | |||
56 56 45 > 46 gene sarHar Sarcophilus harrisii (tasmanian_devil) | |||
36 44 37 > 47 gene sorAra Sorex araneus (shrew) sorAra | |||
24 45 24 > 48 gene speTri Spermophilus tridecemlineatus (squirrel) speTri | |||
54 57 28 > 49 gene susScr Sus scrofa (pig) | |||
59 58 49 > 50 gene tacAcu Tachyglossus aculeatus (echidna) | |||
45 46 51 > 51 gene taeGut Taeniopygia guttata (finch) taeGut | |||
49 47 55 > 52 gene takRub Takifugu rubripes (fugu) fr21 | |||
16 48 16 > 53 gene tarSyr Tarsius syrichta (tarsier) tarSyr | |||
48 49 54 > 54 gene tetNig Tetraodon nigroviridis (pufferfish) tetNig | |||
58 59 47 > 55 gene thyCyn Thylacinus cynocephalus (tasmanian_tiger) | |||
57 60 46 > 56 gene triVul Trichosurus vulpecula (bushytail_possum) | |||
19 50 19 > 57 gene tupBel Tupaia belangeri (tree_shrew) tupBel | |||
28 51 29 > 58 gene turTru Tursiops truncatus (dolphin) turTru | |||
27 52 27 > 59 gene vicPac Vicugna pacos (lama) vicPac | |||
47 53 53 > 60 gene xenTro Xenopus tropicalis (frog) xenTro | |||
44 44 51 f 51 gene fasta genus species common ucsc | |||
phy alp phy alp | |||
</pre> | </pre> | ||
Latest revision as of 16:28, 6 October 2010
State of the art: Comparative Genomics
Populating a feature stack
Populating a feature stack (eg of a coding exon) is the central task in comparative genomics. Feature stacks collect all information available from all available species on a suitable topic. Ideally the topic, for example evolutionary history of a coding exon, has width in base pairs a fraction of characteristic width of available experimental data. That is, a coding gene will have single exons wholly contained within individual trace reads and multiple exons within gapless GenBank wgs contigs.
On the other hand, a segmental duplication of 10 genes, say the one on chr20q/chr8, is very unsuited to a feature stack today because only a couple mammals have an adequate assembly over the intrinsic span of the feature.
Thus it is timely to construct feature stacks for coding exons and the like but premature to consider the comparative genomics of longer features. Consequently all research today that can fully exploit the power of comparative genomics is restricted to computable feature stacks. In other words, if someone can precompute all possible contemporary feature stacks, in effect they have written all possible research papers. This makes them the masters of the comparative genomics universe.
Brian Raney recently has computed this for a genome-based multiz alignment of 28 species, the UCSC 28way conservation track for both nucleotides and proteins.
How many feature stacks are there, how difficult is to compute, store, and query them, and what associated precomputed products go with them? Suppose there are 190,000 coding exon and 3 non-coding features per 20,000 gene for 240,000 features of width 500 bp and needed depth of 50 species. That would fit handily within an excel spreadsheet. So the number, storage, and query are non-issues.
Note thought that the 28way alignment is far from perfect with respect to mis-populating exons with pseudogenes and misalignments and infill completeness, not to mention that 50 vertebrate genomes are actually available in some form. Thus considerable manual curation and infill from the trace archives (long and short) and GenBank database divisions such as nr, est_others, and wgs. In May 2008, typically 43-44 orthologs for a given exon can be located
Here's an example of a feature stack and the implicit paper that practically writes itself:
Comparative Genomics of DRY motifs in exon 3 of RGR Opsins: 1 RWPYGSDGCQAHGFQGFVTALASICSSAAIAWGRYHHYCT 1 human 1 RWPYGSDGCQAHGFQGFVTALASICSSAAIAWGRYHHYCT 1 macaque 1 RWPYGSGGCQAHGFQGFTTALASICGSAAIAWGRYHHYCT 1 lemur 1 RWPHGSEGCQVHGFQGFATALASICGSAAVAWGRYHHYCT 1 mouse 1 RWPYGSDGCQAHGFQGFATALASICGSAAIAWGRYHHYCT 1 rabbit 1 RWPYGSDGCQAHGFQGFVTALASICSSAAIAWGRYHHYCT 1 horse 1 RWPYGPDGCQAHGFQGFATALASICSSAALAWGRYHHYCT 1 dog 1 RWPYGSGGCQAHGFQGFAAALASICGSAAVAWGRYHHYCT 1 bat 1 RWPYGSDGCQAHGFQGFVTALASICSCAAIAWERYHHYCT 1 elephant 1 HWPYGSGGCQAHGFQGFTVALASICSCAAIAWERYHHYCT 1 tenrec 1 RWPHGSDSCQAHSFQGFATALASISSSAAIAWERYRHHCT 1 sloth 1 RWPYGSGGCQAHGFQGFVTALASISSSAAIAWERCHRHCI 1 armadillo 1 HWPYGAEGCRLHGFQGFATALASISLSAAIGWDRYLRHCS 1 platypus 1 YWPYGSDGCQIHGFHGFLTALTSISSAAAVAWDRHHQYCT 1 lizard 1 YWPYGSEGCQIHGFQGFLTALASISSSAAVAWDRYHHYCT 1 chicken 1 YWPYGSEGCQIHGFQGFVAALSSIGSCAAIAWDRYHQYCT 1 frog 1 YWPYGSDGCQTHGFQGFVTALASIHFIAAIAWDRYHQYCT 1 stickleback 1 YWPYGSDGCQTHGFQGFVTALASIHFVAAIAWDRYHQYCT 1 fugu 1 YWPYGSEGCQTHGFHGFLTALASIHFIAAIAWDRYHQYCT 1 medaka 1 YWPYGSEGCQTHGFHGFLMALASINACAAIAWDRYHQNCS 1 elephantshark 1 EWPFGSIGCQLDAFIGMAPTFISIAGAALVAKDKYYRICK 1 tunicate
Now half the data needed for a full feature stack is not available at any genome browser but rather at GenBank etc. Whether that data is retrieved by automated pipeline or by one-off queries, there's a universal roadblock in that blast output is terribly formatted for comparative genomics purposes:
- the blast algorithm is unaware of splice donors and acceptors and ignores user parsing
- blocks are provided in score order instead of natural query exon order
- blocks have faux match extensions at the ends (exon dribble) complicating clean ortholog extraction
- blocks can contain multiple exons when introns are short leading to nonsense in tblastn
- line width of 60 is way too short, causing string searches to fail because of carriage returns
- gaps necessitate dashes causing string searches to fail
- the match genus and species are inconveniently provided
Fixing Blast output in order to populate a comparative genomics feature stack: No utility web tool exists anywhere on the internet that can repair blast output. Consequently one would make a great addition. The needed algorithm would largely carry over to any pipeline program to populate feature stacks genomewide. The example below shows what needs to happen, as limited to one species. In this case, a single tblastn query to GenBank wgs adds 19 mammals to the feature stack.
Here is the original query, 3 exons of bovine rhodpsin marked up for reading frame >RHO1_bosTau Bos taurus (cow) 2 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSR 2 1 YIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKE 0 0 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ 0
Here is the raw output match to marsupial >gb|AAFR03021222.1| Monodelphis domestica cont3.021221, whole genome shotgun sequence Length=94985 Score = 149 bits (377), Expect = 9e-34, Method: Compositional matrix adjust. Identities = 71/74 (95%), Positives = 74/74 (100%), Gaps = 0/74 (0%) Frame = +1 Query 119 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTS 178 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S Sbjct 21139 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSS 21318 Query 179 AVYNPVIYIMMNKQ 192 +VYNPVIYIMMNKQ Sbjct 21319 SVYNPVIYIMMNKQ 21360 Score = 117 bits (294), Expect = 5e-24, Method: Compositional matrix adjust. Identities = 54/59 (91%), Positives = 57/59 (96%), Gaps = 0/59 (0%) Frame = +3 Query 1 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI 59 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR + Sbjct 18648 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV 18824 Score = 107 bits (268), Expect = 4e-21, Method: Compositional matrix adjust. Identities = 51/59 (86%), Positives = 52/59 (88%), Gaps = 0/59 (0%) Frame = +2 Query 55 WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA 113 + RYIPEGMQCSCGIDYYT E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE Sbjct 20546 FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV 20722
Stage 1: order the output blocks according to input exon order and fix line width: Query 1 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI 59 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR + Sbjct 18648 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV 18824 Query 55 WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA 113 + RYIPEGMQCSCGIDYYT E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE Sbjct 20546 FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV 20722 Query 119 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ 192 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S+VYNPVIYIMMNKQ Sbjct 21139 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ 21360
Stage 2: Find the correct input exons Query 1 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI 59 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR + Sbjct 18648 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV 18824 Query 55 WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA 113 + RYIPEGMQCSCGIDYYT E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE Sbjct 20546 FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV 20722 Query 119 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ 192 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S+VYNPVIYIMMNKQ Sbjct 21139 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ 21360
Stage 3: Transfer the match to marsupial Query 1 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYI 59 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAI+GVAFTWVMALACA PPL+GWSR + Sbjct 18648 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSRLV 18824 Query 55 WSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEA 113 + RYIPEGMQCSCGIDYYT E NNESFVIYMFVVHF IPLIVIFFCYGQLVFTVKE Sbjct 20546 FCRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKEV 20722 Query 119 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQ 192 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGS+FGPIFMTIPAFFAK+S+VYNPVIYIMMNKQ Sbjct 21139 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ 21360
Stage 4: Export the desired output with acquired header >RHO1_monDom Monodelphis domesticus (opossum) 2 GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSR 2 1 YIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPLIVIFFCYGQLVFTVKE 0 0 AAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPIFMTIPAFFAKSSSVYNPVIYIMMNKQ 0
Stage 5: Fill in the respective feature stacks with marsupial exons (only first is shown): >RHO1_homSap GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSR >RHO1_bosTau GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSR >RHO1_monDom GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIIGVAFTWVMALACAFPPLIGWSR >RHO1_ornAna GEIALWSLVVLAIERYIVVCKPMSNFRFGENHAIMGVAFTWIMALACALPPLVGWSR >RHO1_galGal GEIALWSLVVLAVERYVVVCKPMSNFRFGENHAIMGVAFSWIMAMACAAPPLFGWSR >RHO1_anoCar GEMGLWSLVVLAVERYVVICKPMSNFRFGETHALIGVSCTWIMALACAGPPLLGWSR >RHO1_xenTro GEMALWSLVVLAIERYVVVCKPMANFRFGENHAIMGVVFTWIMALSCAAPPLFGWSR >RHO1_neoFor GIIALWCLVVLAIERYIVVCKPISNFRFGENHAIMGVVFTWIMALACAGPPLFGWSR >RHO1_latCha GQVALWALVVLAIERYVVVCKPMSNFRFGENHAIMGVIFTWIMALSCAVPPLFGWSR >RHO1_takRub GEIALWSLVVLAVERYIVVCKPMTNFRFGEKHAIAGLVFTWIMALTCATPPLLGWSR >RHO1_leuEri GEVGLWCLVVLAIERYMVVCKPMANFRFGSQHAIIGVVFTWIMALSCAGPPLVGWSR >RHO1_calMil GEIGLWSLVVLAIERYVVVCKPMSNFRFGTNHAIMGVAFTWVMALACAVPPLMGWSR >RHO1_petMar DEMSLWSLVVLAIERYIVICKPMGNFRFGSTHAYMGVAFTWFMALSCAAPPLVGWSR
An alternative approach -- simply collecting the start and stop numbering in the Sbjct (match) line -- might work better. Another flaw in blast is that no flanking material is provided in the event of the first residue or two in an exon not matching. In the above example, blast actually dropped a perfect matches to the initial hexapeptide AAAQQQ even though the simple sequence filter was off.
Thus the processing algorithm could gather these spanning numbers. Entrez retrieval allows limitation to these in conjunction with the accesssion numbers. That would allow for orderly recovery of adequately padded exonic regions which could then be concatenated for translation in a consistent frame with validation of intron position and phase. Alignment could then be performed in an altered version of blast more sympathetic to the goals here. This would better address the special problem of split codon reconstruction in the 12 and 21 overhang situations (where the completed codon does not appear in extended translation of either exon.
Species commonly available
To conveniently collect exons from a given gene, place the list below in one column of a spreadsheet (after search-replace has put in the name of the gene being specifically investigated). The next 3 columns can contain the splice acceptor phase, the exon sequences themselves, and the donor phase. After all the data is collected, it is converted to standard exonic fasta format by replacing tabs with returns.
UCSC now offers a 44-species alignment in this order (at 'Protein FASTA' sequence output page), illustrated here with a specific gene and common fasta header:
1 >RPE65_homSap Homo sapiens (human) 2 >RPE65_panTro Pan troglodytes (chimp) 3 >RPE65_gorGor Gorilla gorilla (gorilla) 4 >RPE65_ponPyg Pongo pygmaeus (orang_sumatran) 5 >RPE65_macMul Macaca mulatta (rhesus) 6 >RPE65_calJac Callithrix jacchus (marmoset) 7 >RPE65_tarSyr Tarsius syrichta (tarsier) 8 >RPE65_micMur Microcebus murinus (mouse_lemur) 9 >RPE65_otoGar Otolemur garnettii (bushbaby) 10 >RPE65_tupBel Tupaia belangeri (tree_shrew) 11 >RPE65_musMus Mus musculus (mouse) 12 >RPE65_ratNor Rattus norvegicus (rat) 13 >RPE65_dipOrd Dipodomys ordii (kangaroo_rat) 14 >RPE65_cavPor Cavia porcellus (guinea_pig) 15 >RPE65_speTri Spermophilus tridecemlineatus (squirrel) 16 >RPE65_oryCun Oryctolagus cuniculus (rabbit) 17 >RPE65_ochPri Ochotona princeps (pika) 18 >RPE65_vicPac Vicugna pacos (lama) 19 >RPE65_turTru Tursiops truncatus (dolphin) 20 >RPE65_bosTau Bos taurus (cow) 21 >RPE65_equCab Equus caballus (horse) 22 >RPE65_felCat Felis catus (cat) 23 >RPE65_canFam Canis familiaris (dog) 24 >RPE65_myoLuc Myotis lucifugus (microbat) 25 >RPE65_pteVam Pteropus vampyrus (macrobat) 26 >RPE65_eriEur Erinaceus europaeus (hedgehog) 27 >RPE65_sorAra Sorex araneus (shrew) 28 >RPE65_loxAfr Loxodonta africana (elephant) 29 >RPE65_proCap Procavia capensis (hyrax) 30 >RPE65_echTel Echinops telfairi (tenrec) 31 >RPE65_dasNov Dasypus novemcinctus (armadillo) 32 >RPE65_choHof Choloepus hoffmanni (sloth) 33 >RPE65_monDom Monodelphis domestica (opossum) 34 >RPE65_ornAna Ornithorhynchus anatinus (platypus) 35 >RPE65_galGal Gallus gallus (chicken) 36 >RPE65_taeGut Taeniopygia guttata (finch) 37 >RPE65_anoCar Anolis carolinensis (lizard) 38 >RPE65_xenTro Xenopus tropicalis (frog) 39 >RPE65_tetNig Tetraodon nigroviridis (pufferfish) 40 >RPE65_takRub Takifugu rubripes (fugu) 41 >RPE65_gasAcu Gasterosteus aculeatus (stickleback) 42 >RPE65_oryLap Oryzias latipes (medaka) 43 >RPE65_danRer Danio rerio (zebrafish) 44 >RPE65_petMar Petromyzon marinus (lamprey)
The numbers to the left allow the spreadsheet to be sorted by output order in the UCSC 28way alignment, phylogenetic quasi-ordering with respect to the vertebrate tree (human arbitrarily first), or alphabetically by genus-species. Those three orders are provided as pre-sorts below.
>10.10.gene_homSap Homo sapiens (human) >11.11.gene_panTro Pan troglodytes (chimp) >99.12.gene_gorGor Gorilla gorilla (gorilla) >99.13.gene_ponPyg Pongo pygmaeus (orang_sumatran) >99.14.gene_nomLeu Nomascus leucogenys (gibbon) >12.15.gene_macMul Macaca mulatta (rhesus) >99.16.gene_papAnu Papio anubis (baboon) >99.17.gene_papHam Papio hamadryas (baboon) >99.18.gene_calJac Callithrix jacchus (marmoset) >99.19.gene_tarSyr Tarsius syrichta (tarsier) >13.20.gene_otoGar Otolemur garnettii (bushbaby) >99.21.gene_micMur Microcebus murinus (mouse_lemur) >99.22.gene_cynVol Cynocephalus volans (flying_lemur) >14.23.gene_tupBel Tupaia belangeri (tree_shrew) >15.24.gene_musMus Mus musculus (mouse) >16.25.gene_ratNor Rattus norvegicus (rat) >17.26.gene_cavPor Cavia porcellus (guinea_pig) >99.27.gene_speTri Spermophilus tridecemlineatus (squirrel) >99.28.gene_dipOrd Dipodomys ordii (kangaroo_rat) >18.29.gene_oryCun Oryctolagus cuniculus (rabbit) >99.30.gene_ochPri Ochotona princeps (pika) >21.31.gene_canFam Canis familiaris (dog) >22.32.gene_felCat Felis catus (cat) >24.33.gene_bosTau Bos taurus (cow) >99.34.gene_oviAri Ovis aries (sheep) >99.35.gene_susScr Sus scrofa (pig) >23.36.gene_equCab Equus caballus (horse) >99.37.gene_myoLuc Myotis lucifugus (microbat) >99.38.gene_pteVam Pteropus vampyrus (macrobat) >99.39.gene_turTru Tursiops truncatus (dolphin) >99.40.gene_susScr Sus scrofa (pig) >99.41.gene_vicVic Vicugna vicugna (vicugna) >19.42.gene_eriEur Erinaceus europaeus (hedgehog) >20.43.gene_sorAra Sorex araneus (shrew) >99.44.gene_borAnc Boreoeuthere ancestralis (ancestral) >25.45.gene_dasNov Dasypus novemcinctus (armadillo) >99.46.gene_choHof Choloepus hoffmanni (sloth) >26.47.gene_loxAfr Loxodonta africana (elephant) >99.48.gene_proCap Procavia capensis (hyrax) >99.49.gene_echTel Echinops telfairi (tenrec) >27.50.gene_monDom Monodelphis domestica (opossum) >99.51.gene_macEug Macropus eugenii (wallaby) >99.52.gene_triVul Trichosurus vulpecula (possum) >28.53.gene_ornAna Ornithorhynchus anatinus (platypus) >99.54.gene_tacAcu Tachyglossus aculeatus (echidna) >30.55.gene_galGal Gallus gallus (chicken) >99.56.gene_taeGut Taeniopygia guttata (finch) >29.57.gene_anoCar Anolis carolinensis (lizard) >31.58.gene_xenTro Xenopus tropicalis (frog) >99.59.gene_xenTro Xenopus laevis (frog) >99.60.gene_neoFor Neoceratodus forsteri (lungfish) >32.61.gene_danRer Danio rerio (zebrafish) >33.62.gene_tetNig Tetraodon nigroviridis (pufferfish) >34.63.gene_takRub Takifugu rubripes (fugu) >35.64.gene_gasAcu Gasterosteus aculeatus (stickleback) >36.65.gene_oryLap Oryzias latipes (medaka) >99.66.gene_ictPun Ictalurus punctatus (fish) >99.67.gene_oncMyk Oncorhynchus mykiss (trout) >99.68.gene_funHet Fundulus heteroclitis (flounder) >99.69.gene_calMil Callorhinchus milii (elephantfish) >99.70.gene_squAca Squalus acanthias (spiny dogfish) >99.71.gene_petMar Petromyzon marinus (lamprey) >99.72.gene_braFlo Branchiostoma floridae (amphioxus) >10.10.gene_homSap Homo sapiens (human) >11.11.gene_panTro Pan troglodytes (chimp) >12.15.gene_macMul Macaca mulatta (rhesus) >13.20.gene_otoGar Otolemur garnettii (bushbaby) >14.23.gene_tupBel Tupaia belangeri (tree_shrew) >15.24.gene_musMus Mus musculus (mouse) >16.25.gene_ratNor Rattus norvegicus (rat) >17.26.gene_cavPor Cavia porcellus (guinea_pig) >18.29.gene_oryCun Oryctolagus cuniculus (rabbit) >19.42.gene_eriEur Erinaceus europaeus (hedgehog) >20.43.gene_sorAra Sorex araneus (shrew) >21.31.gene_canFam Canis familiaris (dog) >22.32.gene_felCat Felis catus (cat) >23.36.gene_equCab Equus caballus (horse) >24.33.gene_bosTau Bos taurus (cow) >25.45.gene_dasNov Dasypus novemcinctus (armadillo) >26.47.gene_loxAfr Loxodonta africana (elephant) >27.50.gene_monDom Monodelphis domestica (opossum) >28.53.gene_ornAna Ornithorhynchus anatinus (platypus) >29.57.gene_anoCar Anolis carolinensis (lizard) >30.55.gene_galGal Gallus gallus (chicken) >31.58.gene_xenTro Xenopus tropicalis (frog) >32.61.gene_danRer Danio rerio (zebrafish) >33.62.gene_tetNig Tetraodon nigroviridis (pufferfish) >34.63.gene_takRub Takifugu rubripes (fugu) >35.64.gene_gasAcu Gasterosteus aculeatus (stickleback) >36.65.gene_oryLap Oryzias latipes (medaka) >99.12.gene_gorGor Gorilla gorilla (gorilla) >99.13.gene_ponPyg Pongo pygmaeus (orang_sumatran) >99.14.gene_nomLeu Nomascus leucogenys (gibbon) >99.16.gene_papAnu Papio anubis (baboon) >99.17.gene_papHam Papio hamadryas (baboon) >99.18.gene_calJac Callithrix jacchus (marmoset) >99.19.gene_tarSyr Tarsius syrichta (tarsier) >99.21.gene_micMur Microcebus murinus (mouse_lemur) >99.22.gene_cynVol Cynocephalus volans (flying_lemur) >99.27.gene_speTri Spermophilus tridecemlineatus (squirrel) >99.28.gene_dipOrd Dipodomys ordii (kangaroo_rat) >99.30.gene_ochPri Ochotona princeps (pika) >99.34.gene_oviAri Ovis aries (sheep) >99.35.gene_susScr Sus scrofa (pig) >99.37.gene_myoLuc Myotis lucifugus (microbat) >99.38.gene_pteVam Pteropus vampyrus (macrobat) >99.39.gene_turTru Tursiops truncatus (dolphin) >99.40.gene_susScr Sus scrofa (pig) >99.41.gene_vicVic Vicugna vicugna (vicugna) >99.44.gene_borAnc Boreoeuthere ancestralis (ancestral) >99.46.gene_choHof Choloepus hoffmanni (sloth) >99.48.gene_proCap Procavia capensis (hyrax) >99.49.gene_echTel Echinops telfairi (tenrec) >99.51.gene_macEug Macropus eugenii (wallaby) >99.52.gene_triVul Trichosurus vulpecula (possum) >99.54.gene_tacAcu Tachyglossus aculeatus (echidna) >99.56.gene_taeGut Taeniopygia guttata (finch) >99.59.gene_xenTro Xenopus laevis (frog) >99.60.gene_neoFor Neoceratodus forsteri (lungfish) >99.66.gene_ictPun Ictalurus punctatus (fish) >99.67.gene_oncMyk Oncorhynchus mykiss (trout) >99.68.gene_funHet Fundulus heteroclitis (flounder) >99.69.gene_calMil Callorhinchus milii (elephantfish) >99.70.gene_squAca Squalus acanthias (spiny dogfish) >99.71.gene_petMar Petromyzon marinus (lamprey) >99.72.gene_braFlo Branchiostoma floridae (amphioxus) >29.57.gene_anoCar Anolis carolinensis (lizard) >99.44.gene_borAnc Boreoeuthere ancestralis (ancestral) >24.33.gene_bosTau Bos taurus (cow) >99.72.gene_braFlo Branchiostoma floridae (amphioxus) >99.18.gene_calJac Callithrix jacchus (marmoset) >99.69.gene_calMil Callorhinchus milii (elephantfish) >21.31.gene_canFam Canis familiaris (dog) >17.26.gene_cavPor Cavia porcellus (guinea_pig) >99.46.gene_choHof Choloepus hoffmanni (sloth) >99.22.gene_cynVol Cynocephalus volans (flying_lemur) >32.61.gene_danRer Danio rerio (zebrafish) >25.45.gene_dasNov Dasypus novemcinctus (armadillo) >99.28.gene_dipOrd Dipodomys ordii (kangaroo_rat) >99.49.gene_echTel Echinops telfairi (tenrec) >23.36.gene_equCab Equus caballus (horse) >19.42.gene_eriEur Erinaceus europaeus (hedgehog) >22.32.gene_felCat Felis catus (cat) >99.68.gene_funHet Fundulus heteroclitis (flounder) >30.55.gene_galGal Gallus gallus (chicken) >35.64.gene_gasAcu Gasterosteus aculeatus (stickleback) >99.12.gene_gorGor Gorilla gorilla (gorilla) >10.10.gene_homSap Homo sapiens (human) >99.66.gene_ictPun Ictalurus punctatus (fish) >26.47.gene_loxAfr Loxodonta africana (elephant) >99.51.gene_macEug Macropus eugenii (wallaby) >12.15.gene_macMul Macaca mulatta (rhesus) >99.21.gene_micMur Microcebus murinus (mouse_lemur) >27.50.gene_monDom Monodelphis domestica (opossum) >15.24.gene_musMus Mus musculus (mouse) >99.37.gene_myoLuc Myotis lucifugus (microbat) >99.60.gene_neoFor Neoceratodus forsteri (lungfish) >99.14.gene_nomLeu Nomascus leucogenys (gibbon) >99.30.gene_ochPri Ochotona princeps (pika) >99.67.gene_oncMyk Oncorhynchus mykiss (trout) >28.53.gene_ornAna Ornithorhynchus anatinus (platypus) >18.29.gene_oryCun Oryctolagus cuniculus (rabbit) >36.65.gene_oryLap Oryzias latipes (medaka) >13.20.gene_otoGar Otolemur garnettii (bushbaby) >99.34.gene_oviAri Ovis aries (sheep) >11.11.gene_panTro Pan troglodytes (chimp) >99.16.gene_papAnu Papio anubis (baboon) >99.17.gene_papHam Papio hamadryas (baboon) >99.71.gene_petMar Petromyzon marinus (lamprey) >99.13.gene_ponPyg Pongo pygmaeus (orang_sumatran) >99.48.gene_proCap Procavia capensis (hyrax) >99.38.gene_pteVam Pteropus vampyrus (macrobat) >16.25.gene_ratNor Rattus norvegicus (rat) >20.43.gene_sorAra Sorex araneus (shrew) >99.27.gene_speTri Spermophilus tridecemlineatus (squirrel) >99.70.gene_squAca Squalus acanthias (spiny dogfish) >99.35.gene_susScr Sus scrofa (pig) >99.40.gene_susScr Sus scrofa (pig) >99.54.gene_tacAcu Tachyglossus aculeatus (echidna) >99.56.gene_taeGut Taeniopygia guttata (finch) >34.63.gene_takRub Takifugu rubripes (fugu) >99.19.gene_tarSyr Tarsius syrichta (tarsier) >33.62.gene_tetNig Tetraodon nigroviridis (pufferfish) >99.52.gene_triVul Trichosurus vulpecula (possum) >14.23.gene_tupBel Tupaia belangeri (tree_shrew) >99.39.gene_turTru Tursiops truncatus (dolphin) >99.41.gene_vicVic Vicugna vicugna (vicugna) >99.59.gene_xenTro Xenopus laevis (frog) >31.58.gene_xenTro Xenopus tropicalis (frog)
Species sorting
It is convenient to sort species in various ways. The tabbed table below allows sorting in either alphabetic or phyogenetic order for either the UCSC 44-way or an expanded set of species with available transcripts, traces, or newly determined genomes. This allows rapid population of exon stacks from the 44-way, yet replacement of its legacy headers such as hg181 and mm41 by standard genSpp format, convenient addition of additional data to fill gaps, and restoration of phylogenetic order which is lost in output of some alignment tools (such as Multalin).
46 10 52 > 10 gene anoCar Anolis carolinensis (lizard) anoCar 29 11 30 > 11 gene bosTau Bos taurus (cow) bosTau 15 12 15 > 12 gene calJac Callithrix jacchus (marmoset) calJac 60 54 59 > 13 gene calMil Callorhinchus milii (elephantfish) 32 13 33 > 14 gene canFam Canis familiaris (dog) canFam 23 14 23 > 15 gene cavPor Cavia porcellus (guinea_pig) cavPor 41 15 42 > 16 gene choHof Choloepus hoffmanni (sloth) choHof 52 16 58 > 17 gene danRer Danio rerio (zebrafish) danRer 40 17 41 > 18 gene dasNov Dasypus novemcinctus (armadillo) dasNov 22 18 22 > 19 gene dipOrd Dipodomys ordii (kangaroo_rat) dipOrd 39 19 40 > 20 gene echTel Echinops telfairi (tenrec) echTel 30 20 31 > 21 gene equCab Equus caballus (horse) equCab 35 21 36 > 22 gene eriEur Erinaceus europaeus (hedgehog) eriEur 31 22 32 > 23 gene felCat Felis catus (cat) felCat 44 23 50 > 24 gene galGal Gallus gallus (chicken) galGal 50 24 56 > 25 gene gasAcu Gasterosteus aculeatus (stickleback) gasAcu 12 25 12 > 26 gene gorGor Gorilla gorilla (gorilla) gorGor 10 26 10 > 27 gene homSap Homo sapiens (human) hg181 37 27 38 > 28 gene loxAfr Loxodonta africana (elephant) loxAfr 55 55 44 > 29 gene macEug Macropus eugenii (wallaby) 14 28 14 > 30 gene macMul Macaca mulatta (rhesus) rheMac 17 29 17 > 31 gene micMur Microcebus murinus (mouse_lemur) micMur 42 30 43 > 32 gene monDom Monodelphis domestica (opossum) monDom 20 31 20 > 33 gene musMus Mus musculus (mouse) mm91 33 32 34 > 34 gene myoLuc Myotis lucifugus (microbat) myoLuc 26 33 26 > 35 gene ochPri Ochotona princeps (pika) ochPri 43 34 48 > 36 gene ornAna Ornithorhynchus anatinus (platypus) ornAna 25 35 25 > 37 gene oryCun Oryctolagus cuniculus (rabbit) oryCun 51 36 57 > 38 gene oryLap Oryzias latipes (medaka) oryLat 18 37 18 > 39 gene otoGar Otolemur garnettii (bushbaby) otoGar 11 38 11 > 40 gene panTro Pan troglodytes (chimp) panTro 53 39 60 > 41 gene petMar Petromyzon marinus (lamprey) petMar 13 40 13 > 42 gene ponPyg Pongo pygmaeus (orang) ponAbe 38 41 39 > 43 gene proCap Procavia capensis (hyrax) proCap 34 42 35 > 44 gene pteVam Pteropus vampyrus (macrobat) pteVam 21 43 21 > 45 gene ratNor Rattus norvegicus (rat) rn41 56 56 45 > 46 gene sarHar Sarcophilus harrisii (tasmanian_devil) 36 44 37 > 47 gene sorAra Sorex araneus (shrew) sorAra 24 45 24 > 48 gene speTri Spermophilus tridecemlineatus (squirrel) speTri 54 57 28 > 49 gene susScr Sus scrofa (pig) 59 58 49 > 50 gene tacAcu Tachyglossus aculeatus (echidna) 45 46 51 > 51 gene taeGut Taeniopygia guttata (finch) taeGut 49 47 55 > 52 gene takRub Takifugu rubripes (fugu) fr21 16 48 16 > 53 gene tarSyr Tarsius syrichta (tarsier) tarSyr 48 49 54 > 54 gene tetNig Tetraodon nigroviridis (pufferfish) tetNig 58 59 47 > 55 gene thyCyn Thylacinus cynocephalus (tasmanian_tiger) 57 60 46 > 56 gene triVul Trichosurus vulpecula (bushytail_possum) 19 50 19 > 57 gene tupBel Tupaia belangeri (tree_shrew) tupBel 28 51 29 > 58 gene turTru Tursiops truncatus (dolphin) turTru 27 52 27 > 59 gene vicPac Vicugna pacos (lama) vicPac 47 53 53 > 60 gene xenTro Xenopus tropicalis (frog) xenTro 44 44 51 f 51 gene fasta genus species common ucsc phy alp phy alp
Obtaining and Analyzing Sequences from 454 Transcript Runs
Sanger genomic sequencing basically shut down at the major sequencing centers in early April 2008. Sequencing of vertebrates has shifted over to faster and cheaper new technologies, predominantly 454. These reads are deposited in a user-impractical form at the NCBI [tp://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi Short Read Archives.]
Notice that gene models are highly mediocre outside species such as human and mouse, yet many mammalian species have essentially no available transcripts. This very much detracts from the values of individual species browsers. Other species such as Tachyglossus have no genome project underway but several hundred thousand transcript reads are available. Thus it would be very advantageous to bring the voluminous 454 data into play.
On 22 April 08, the following large scale 454 transcript and genome projects were available:
homSap SRA000297 Homo sapiens Transcriptome Study 454 GS FLX WUGSC calJac SRA000144 Callithrix jacchus Transcriptome Study WUGSC tupBel SRA000295 Tupaia belangeri Transcriptome Study 454 GS FLX WUGSC susScr SRA000267 Sus scrofa Transcriptome study 454 GS 20 AU-DJ ornAna SRA000294 Ornithorhynchus anatinus Transcriptome Study 454 GS FLX WUGSC galGal SRA000238 Gallus gallus Transcriptome Study 454 GS FLX WUGSC tacAcu SRA000241 Tachyglossus aculeatus Transcriptome Study 454 GS FLX WUGSC turTru SRA000063 Tursiops truncatus Whole Genome Sequencing Project BCM mamPri ......... Mammuthus primigenius Whole Genome Sequencing Project PSU monDom SRA000162 Monodelphis domestica Whole Genome Sequencing WUGSC macEug SRA000041 Macropus eugenii Whole Genome Sequencing Project BCM xenTro SRA000296 Xenopus tropicalis Genomic Sequencing 454 GS FLX WUGSC
For the average user, there are 4 issues that impede effective access to all this great new data:
- file sizes are enormous, in part because bulky read quality data is mixed in with fasta.
- to extract the fasta sequence, proprietary software called sffinfo must be obtained (or a script written).
- the reads are fairly short (235 bp or 1-2 exons) and overlapping reads are not assembled.
- no blast server is available to target the reads.
The process works like this (for a biomedical researcher not familiar with linux on a Macintosh):
- download the software sffinfo, place in a folder called SFF
- open a linux command line interface by opening the Terminal program in the Finder's Utilities Folder
- direct the command line to the folder path using: cd /Users/joeblow/Desktop/SFF
- make ssfinfo recognizable as an executable program: chmod +x sffinfo
- download some data files of interest from NCBI's SRA archive into the SFF foldere
- apply sffinfo to a data file name: ./sffinfo -s EUEMSW405.sff > anyOutputFileName
- split the output file into smaller pieces if needed: split -l 30000 anyOutputFileName
These steps discard the read quality data while retaining the fasta sequences. However that usually results in too large a file for practical purposes such as visualizing entries in ordinary desktop software or submitting to a web tool such as uBlast. Consequently the big fasta file needs to be split up into smaller files, here 30,000 lines each.
On a typical 454 project, it takes an hour to download one transcript run on a Qwest DSL line. It comes compressed (.gz) but this is automatically decompressed by Stuffit. The resultant .sff file was 282 megabytes. After applying ssfinfo, the file size was reduced to 57 megabytes. After splitting, 35 files result, each 1.6 mbytes and each containing 5800 individual fasta formatted reads. These appear in the SFF folder named as xaa, xab, ... These smaller files can be viewed in any text editor software. However most web tools have a 'browse' option that avoids copy/paste in the Clipboard, simply loading the file invisibly into the web tool for submission.
The SRA is now providing a download button on their run browser. Here the trick is to know the run number.
In the case ofTachyglossus (echidna transcripts), the accession number SRA000241 is a bundle of 12 runs. These must be downloaded individually from the run browser. After unpacking with ssfinfo, they can be concatenated into a single large file since no purpose is served keeping runs separate.
- SRR000647 181241 reads
- SRR000648 378 reads
- SRR000649 204062 reads
- SRR000650 1263 reads
- SRR000651 4498 reads
- SRR000652 1309 reads
- SRR000653 3170 reads
- SRR000654 1663 reads
- SRR000655 2414 reads
- SRR000656 242673 reads
- SRR000657 598 reads
- SRR000658 242164 reads
- 885433 total reads
Next, it is currently necessary to install a personal Blast server on your computer so that the database of 454 fasta sequences can be searched with favorite queries without using command lines. That's explained on the NCBI site. Basically the fasta set has to be reconfigured to a human-unreadable format that the Blast tool reads (formatdb) while still allowing full length sequences involved in Blast matches to be retrieved (fastacmd).
As a practical matter, the vast majority of the 454 reads (98%) do not cover any coding exon. The file can be partitioned into those that do and those that don't. At the level of an individual gene family, procede by concatenating protein sequences for representative members of the family (eg selenoproteins, or opsins). Use that as tBlastn query on the home brew Blast server. Extract all the match lines -- these begin with the word Sbjct in blast output lines. Clean out spaces, gap dashes, and numbers using Jim Kent's Protein Duster obtaining a concatenate of all translated matches to the gene family.
Next, to assemble reads for a given protein, use the concatenate as Blastp query against an individual exon of an individual gene of the given gene family at uBlast. This allows ready extraction of exons (though Blast output is by score and so out of sequence order). In comparative genomics, a known already-intronated gene is generally available as paste template. This has the practical effect of allowing visual tiling of reads in the situation where some of the reads have frameshifting indels and no one read covers the exon satisfactorily.