Genome completion status: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
Which metazoan species currently have genomic data available? That's hard to say -- it is a difficult process to track. Consequently few researchers are adequately aware of what species have  genomic data, and so typically undersample needlessly when doing comparative genomics projects. Sampling species more densely often overturns working hypotheses of feature evolution.
== Tracking genome projects is difficult ==


Sequencing centers post raw trace reads on a day-by-day basis at NCBI's trace archives. NCBI performs some quality control and adds them to the accruing database that is blastn accessible. Later the center may assemble them into contigs and post them to the "wgs" division of GenBank (more rarely at "gss" or "htgs"). Depending on the coverage and finishing effort, these contigs can be hosted as a genome by a browser center such as [http://genome.cse.ucsc.edu/index.html UCSC].
Which metazoan species currently have genomic data available? That's hard to say -- it's a difficult process to track. There are no announcements, maintained lists, or publications; sequencing centers rarely update their websites or indicate specific future plans. Consequently few researchers are adequately aware of what species have genomic data, and so typically undersample when doing comparative genomics projects. Sampling species more densely often overturns working hypotheses of feature evolution.
 
It may take 2-3 years for data to complete its migration from trace sequencing to contigs to genome. More rarely, traces are withheld and a genome assembly appears abruptly, as with elephantfish. There are no announcements, maintained lists, or publications; sequencing centers rarely update their websites or indicate specific future plans.  
=== Tracking Sanger reads ===
Sequencing centers post raw trace reads on a day-by-day basis at NCBI's trace archives. NCBI performs some quality control and adds them to the accruing database that is blastn accessible. Later the center may assemble them into contigs and post them to the "wgs" division of GenBank (more rarely at "gss" or "htgs"). Depending on the coverage and finishing effort, these contigs can be hosted as a genome by a browser center such as [http://genome.cse.ucsc.edu/index.html UCSC]. It may take 2-3 years for data to complete its migration from trace sequencing to contigs to genome. More rarely, traces are withheld and a genome assembly appears abruptly, as with elephantfish.  


Further complications include multiple gorilla, orang and gibbon subspecies, personal individual human genomes, areas of confused taxonomy (alpaca vs vicuna) and so on. NISC and sequencing centers do not always work from same individual or even same subspecies so each trace compilation has to be checked separately.  
Further complications include multiple subspecies for gorilla, orang and gibbon, personal human genomes, diploid genomes, areas of confused taxonomy (alpaca vs vicuna) and so on. NISC and sequencing centers do not always work from same individual animal or even same subspecies so each trace compilation has to be checked separately. Transcript programs can use yet other individual animals or subspecies.


To annotate at kbp scales (adequate for exons and small genes), one can reliably use the traces or contigs and not wait (years) for a genome browser to appear.   
To annotate at kbp scales (adequate for exons and small genes), one can reliably use the traces or contigs and not wait (years) for a genome browser to appear.   


If the exon or feature is 1000 bp (or in such pieces), the trace archives work quite well, especially for establishing presence. Absence is not as informative because data might simply be missing due to low coverage. No vertebrae species is truly complete yet including human. it requires a couple million traces before a given incoming genome is worth checking for a given feature.
If the exon or feature is 1000 bp (or in such pieces), the trace archives work quite well, especially for establishing presence. Absence is not as informative because data might simply be missing due to low coverage. No vertebrae species is truly complete yet including human. it requires a few million traces before a given incoming genome is worth checking for a given feature.


Not every trace makes it into a contigs or assembly -- singletons are often omitted, millions of them. Sequencing often continues after a release, as is happening now with elephant and guinea pig. Consequently the trace archive at NCBI is always the resource of last resort, ie if a feature is missing from assembled traces, it is best to go back to the trace archive because all original data is there. However NCBI posting of traces to its blast database can lag trace inputs from the sequencing centers by a week or more, which can amount to a million traces in an active project.
Not every trace makes it into a contigs or assembly -- singletons are often omitted, often millions of them. Sequencing often continues after a release, as is happening now with elephant and guinea pig. Consequently the trace archive at NCBI is always the resource of last resort, ie if a feature is missing from assembled traces, it is best to go back to the trace archive because all original data is there. However NCBI posting of traces to its blast database can lag trace inputs from the sequencing centers by a week or more, which can amount to 1.5 million traces in an active project.


There are also "cdna" species like the  marsupial, Trichosurus vulpecular, which have rather complete coverage of coding genes but no genome project underway. Such sequences can furnish critical close-in query material to improve the sensitivity of trace blast (which is not sensitive at any evolutionary distance if the feature is evolving rapidly).
There are also "cdna" species like the  marsupial, Trichosurus vulpecular, which have rather complete coverage of coding genes but no genome project underway. Such sequences can furnish critical close-in query material to improve the sensitivity of trace blast (which is not sensitive at any evolutionary distance if the feature is evolving rapidly).


In a specific comparative genomics research project, it is important to document which species were considered. For that it is convenient to enter annotation data in a column next to its species, in a spreadsheet containing all species with genomic data (provided below and illustrated below that with a concrete coding indel example). This allows last-minute updating prior to paper submission.
In a concrete comparative genomics research project, it is important to document which species were considered. For that it is convenient to enter annotation data in a column next to its species, in a spreadsheet containing all species with genomic data (provided below and illustrated below that with a concrete coding indel example). This allows last-minute updating prior to paper submission.


Finally, PCR can be used on species currently lacking genomic or cdna projects when it is critical to augment sampling density. Flying lemur would be a good choice in primate-oriented projects because it appears to be the immediate outgroup (hence a great improvement over distant mouse).
Finally, PCR can be used on species currently lacking genomic or cdna projects when it is critical to augment sampling density. Flying lemur would be a good choice in primate-oriented projects because it appears to be the immediate outgroup (hence a great improvement over distant mouse).
Line 33: Line 34:
</pre>
</pre>


=== Available genome assemblies as of May 2008 ===
<pre>The table is correct as of 01 May 08.
  Traces indicated in millions, eg Trc12 means 12 million traces but no wgs contigs or assembly available
  Wgs08 means wgs division of GenBank contains short assembled contigs searchable with tBlastn
  Mar06 etc means the March 2006 assembly is the most recent available at UCSC


<pre>
Mar06  homSap  Homo  sapiens  (human)
homSap Mar06 Homo  sapiens  human
Mar06  panTro  Pan  troglodytes  (chimp)
panTro Mar06 Pan  troglodytes  chimp
Trc04  gorGor  Gorilla  gorilla  (gorilla)
gorGor Dec07 Gorilla  gorilla  gorilla
Jul07  ponPyg  Pongo  pygmaeus  (orang_abelii)
ponPyg Htg06 Pongo  pygmaeus  orang_sumatran
Trc19  nomLeu  Nomascus  leucogenys  (gibbon)
nomLeu Trc07 Nomascus  leucogenys  gibbon
Jan06  macMul  Macaca  mulatta  (rhesus)
macMul Jan06 Macaca  mulatta  rhesus
Trc12  papHam  Papio hamadryas  (baboon)
papHam  ???08 Papas hamadryas  baboon
Trc17 tarSyr Tarsius syrichta (tarsier)
calJac Wgs06 Callithrix jacchus marmoset
Jun07 calJac Callithrix jacchus (marmoset)
tarSyr ???07 Tarsius syrichta tarsier
Dec06  otoGar  Otolemur  garnettii  (bushbaby)
otoGar Dec06 Otolemur  garnettii  bushbaby
Wgs08  micMur  Microcebus  murinus  (mouse_lemur)
micMur Trc07 Microcebus  murinus  mouse_lemur
Trc00  cynVol  Cynocephalus  volans  (flying_lemur)
cynVol ????? Cynocephalus  volans  flying_lemur
Dec06  tupBel  Tupaia  belangeri  (treeshrew)
tupBel Dec06 Tupaia  belangeri  tree_shrew
Jul07  musMus  Mus  musculus  (mouse)
musMus Feb06 Mus  musculus  mouse
Nov04  ratNor  Rattus  norvegicus  (rat)
ratNor Nov04 Rattus  norvegicus  rat
Wgs08  speTri  Spermophilus  tridecemlineatus  (ground_squirrel)
speTri Wgs06 Spermophilus  tridecemlineatus  ground_squirrel
Trc07  dipOrd  Dipodomys ordii  (kangaroo_rat)
dipOrd  ???07 Dipodomys ordii  kangaroo_rat
Wgs08  cavPor  Cavia  porcellus (guinea_pig)
cavPor Wgs06 Cavia  porcellis guinea_pig
May05  oryCun  Oryctolagus  cuniculus  (rabbit)
oryCun May05 Oryctolagus  cuniculus  rabbit
Wgs08  ochPri  Ochotona  princeps  (pika)
ochPri ???07 Ochotona  princeps  pika
May05  canFam  Canis  familiaris (dog)
canFam May05 Canis  familiarus dog
Mar06  felCat  Felis  catus  (cat)
felCat Wgs06 Felis  catus  cat
Aug06 bosTau Bos taurus (cow)
equCab Jan07 Equus caballus horse
Trc10 turTru Tursiops truncatus (dolphin)
myoLuc Wgs06 Myotis lucifugus microbat
Trc06 susScr Sus scrofa (pig)
pteVam Trc06 Pteropus vampyrus macrobat
Trc11 vicVic Vicugna vicugna (vicugna)
bosTau Mar05 Bos taurus cow
Jan07 equCab Equus caballus (horse)
susScr Trc06 Sus scrofa pig
Wgs08  myoLuc  Myotis lucifugus (microbat)
turTru ???07 Tursiops truncatus dolphin
Trc08 pteVam Pteropus  vampyrus  (macrobat)
vicVic ???07 Vicugna vicugna vicugna
Wgs08  sorAra  Sorex  araneus  (shrew)
sorAra Wgs06 Sorex  araneus  shrew
Wgs08  eriEur  Erinaceus  europaeus  (hedgehog)
eriEur Wgs06 Erinaceus  europaeus  hedgehog
May05  loxAfr  Loxodonta  africana  (elephant)
dasNov  May05  Dasypus  novemcinctus  armadillo
Trc09  proCap  Procavia  capensis  (hyrax)
choHof  Trc06  Choloepus  hoffmanni  sloth
Jul05  echTel  Echinops  telfairi  (tenrec)
loxAfr May05 Loxodonta  africana  elephant
May05  dasNov  Dasypus  novemcinctus  (armadillo)
proCap Wgs06 Procavia  capensis  hyrax
Trc09 choHof Choloepus hoffmanni (sloth)
echTel Jul05 Echinops  telfairi  tenrec
Trc10  macEug  Macropus  eugenii  (wallaby)
monDom Jan06 Monodelphis domestica opossum
Jan06  monDom  Monodelphis  domestica (opossum)
macEug Trc06 Macropus  eugenii  wallaby
Mar07 ornAna Ornithorhynchus  anatinus  (platypus)
ornAna Mar07  Ornithorhynchus  anatinus  platypus
May06  galGal  Gallus  gallus  (chicken)
galGal May06 Gallus  gallus  chicken
Trc15  taeGut  Taeniopygia  guttata  (finch)
taeGut Trc06 Taeniopygia  guttata  finch
Feb07  anoCar  Anolis  carolinensis  (lizard)
anoCar Feb07 Anolis  carolinensis  lizard
Aug05  xenTro  Xenopus  tropicalis  (frog)
xenTro Jun06 Xenopus  tropicalis  clawed_frog
Jul07  danRer  Danio  rerio  (zebrafish)
danRer Mar06 Danio  rerio  zebrafish
Oct04 takRub Takifugu rubripes (fugu)
gasAcu Feb06 Gasterosteus aculeatus stickleback
Feb04 tetNig Tetraodon nigroviridis (pufferfish)
oryLat Apr06 Oryzias latipes rice fish
Feb06 gasAcu Gasterosteus aculeatus (stickleback)
takRub Oct04 Takifugu rubripes fugu
Apr06 oryLat Oryzias latipes (medaka)
tetNig Feb04 Tetraodon nigroviridis puffer
Wgs08  calMil  Callorhinchus  milii  (elephantfish)
calMil Wgs07 Callorhinchus  milii  elephantfish
Mar07  petMar  Petromyzon  marinus  (lamprey)
petMar Trc06 Petromyzon  marinus  lamprey
</pre>
</pre>
A coding indel example (a coding exon from gene SPC25 on human chr2) illustrates the usefulness of multiple genomes in timing and understanding evolution of insertions and deletions.
A coding indel example (a coding exon from gene SPC25 on human chr2) illustrates the usefulness of multiple genomes in timing and understanding evolution of insertions and deletions.
Line 130: Line 136:
anoCar MAKAKEEDELTMLEKGIEELCTQIETTYCRQSLEKTSGPRNKCYKSGPRNK
anoCar MAKAKEEDELTMLEKGIEELCTQIETTYCRQSLEKTSGPRNKCYKSGPRNK
</pre>
</pre>
Live blog of an actual exon recovery session:
I updated the available chordate genomes today and their practical access. Pig contigs were released a couple days ago ... they are stored at the HTG division of GenBank and various other inconvenient places. The existing 28-way can be boosted to a 32-way using the net tracks for lamprey, lancelet, marmoset, and orangutan. That can be boosted to a 37-way bringing in Microcebus, Spermophilus, Ochotona, Myotis, and Callorhinchus using a single tBlastn of the wgs division of GenBank using the new capability of boolean species restriction. The 51-way then requires individual queries at the trace archives.
Mark D suggested an interesting trick (which I've amended): set the browser to its 5000 pixel maximum as needed, open in pdf view, scrape off amino acid lines as text, remove spaces and equal signs, watch out for nucleotides also in caps and extraneous symbols, paste into a new column of the ss below in default sort order. If an exon is fairly well-behaved in the comp genomics sense, that provides it for species in the 28-way that have it.
Net tracks have to be opened individually and translated (or uBlastx'ed against fiducial). As always, the trace archives has to be consulted at the end to fill in any residual gaps because millions of singletons get left out of assemblies.
To illustrates what happens in real life (ie the difficulty of fully automating the process proteomewide), I looked an exon of MAN1A2. The 28-way gave 25 sequences, with a proximal frameshift in eriEur and nothing for tupBel and dasNov. eriEur had two covering traces and the browser contig had taken the wrong one; tupBel had one trace coverage with 6 frameshifts, unusable. Armadillo has gone to 6x lately and trace coverage was excellent.
The net tracks yielded compelling ponPyg and petMar translations but nothing for braFlo and a calJac implausibly with two indels. However calJac at the trace archives had indel-free coverage with the expected conservation; the browser again was hosting a contig assembly error or bizarre polymorphism.
Wgs searching restricted to 'Microcebus OR Spermophilus OR Ochotona OR Myotis OR Callorhinchus'  avoids a results page swamped out with sequences already on hand. It still takes forever because so many genomes have been placed in this db. The search here provides a good outcome for all but speTri where only a paralogous exon match occurs. speTri had the needed coverage in the trace archives however. Sometimes the paralog story and other anomalies only emerge from an alignment of all the collected sequences at the end.
It's also worth checking est_others for ad hoc species. Here it's best to look at the taxonomy sort-page. I picked up susScr and papHam, stubbing in Papio anubis for the latter. Another 6 species were available should it later prove mission-critical to flesh out certain regions of the tree:
<pre>
Echis ocellatus ............. 77  1 hit  [snakes]
Bothrops jararaca ........... 70  1 hit  [snakes]
Gekko japonicus ............. 70  1 hit  [lizards]
Ambystoma mexicanum ......... 77  1 hit  [salamanders]
Squalus acanthias ----------- 74  1 hit  [sharks and rays]
Leucoraja erinacea .......... 70  3 hits [sharks and rays] 
</pre>
By using this method of descending quality and convenience, only 12 individual searches at the trace archives are needed. This protein is evolving slowly enough that a single fixed nucleotide blastn query will suffice, either human, boreoeuthere, or reconstructed ancestor. Because the pulldown menu contains hundreds of species, the ones sought should be queried in alphabetic order to simplify sequential setting of the target species. That process took 14 minutes here to get 9 additional species, ending up with a 46-way on this exon or a 52-way using the 6 transcript species above.
In summary, enough data is out there to double up on what the browser 28-way offers. The downside is that even after optimizing the process it takes an hour or more per exon. In this project, I'm looking at non-synonymous changes in 100 exons in the first fully sequenced extinct species. There will have to be some prioritization early on, but this can be based on analysis of the initial pdf screenscrape.
<pre>
col1  species order if scraping protein off the 'Conservation'  28-way, by net availability, otherwise by declining trace reads
col2  species phylogenetic order, sister leafs subordered by assembly quality
col3  assembly, wgs contig, htg contigs, trace archives (last two digits show millions of reads)
col4  6-letter genSpp code
col5  genus
col6  species
col7  common and subspp
1,1,>,Mar06,homSap,Homo,sapiens,(human)
2,2,>,Mar06,panTro,Pan,troglodytes,(chimp)
3,6,>,Jan06,macMul,Macaca,mulatta,(rhesus)
4,10,>,Dec06,otoGar,Otolemur,garnettii,(bushbaby)
5,13,>,Dec06,tupBel,Tupaia,belangeri,(treeshrew)
6,14,>,Jul07,musMus,Mus,musculus,(mouse)
7,15,>,Nov04,ratNor,Rattus,norvegicus,(rat)
8,18,>,Wgs08,cavPor,Cavia,porcellus,(guinea_pig)
9,19,>,May05,oryCun,Oryctolagus,cuniculus,(rabbit)
10,30,>,Wgs08,sorAra,Sorex,araneus,(shrew)
11,31,>,Wgs08,eriEur,Erinaceus,europaeus,(hedgehog)
12,21,>,May05,canFam,Canis,familiaris,(dog)
13,22,>,Mar06,felCat,Felis,catus,(cat)
14,27,>,Jan07,equCab,Equus,caballus,(horse)
15,23,>,Aug06,bosTau,Bos,taurus,(cow)
16,35,>,May05,dasNov,Dasypus,novemcinctus,(armadillo)
17,32,>,May05,loxAfr,Loxodonta,africana,(elephant)
18,34,>,Jul05,echTel,Echinops,telfairi,(tenrec)
19,38,>,Jan06,monDom,Monodelphis,domestica,(opossum)
20,39,>,Mar07,ornAna,Ornithorhynchus,anatinus,(platypus)
21,42,>,Feb07,anoCar,Anolis,carolinensis,(lizard)
22,40,>,May06,galGal,Gallus,gallus,(chicken)
23,43,>,Aug05,xenTro,Xenopus,tropicalis,(frog)
24,44,>,Jul07,danRer,Danio,rerio,(zebrafish)
25,46,>,Feb04,tetNig,Tetraodon,nigroviridis,(pufferfish)
26,45,>,Oct04,takRub,Takifugu,rubripes,(fugu)
27,47,>,Feb06,gasAcu,Gasterosteus,aculeatus,(stickleback)
28,48,>,Apr06,oryLat,Oryzias,latipes,(medaka)
29,50,>,Mar07,petMar,Petromyzon,marinus,(lamprey)
30,51,>,Mar06,braFlo,Branchiostoma,floridae,(lancelet)
31,9,>,Jun07,calJac,Callithrix,jacchus,(marmoset)
32,4,>,Jul07,ponPyg,Pongo,pygmaeus,(orang_abelii)
33,11,>,Wgs08,micMur,Microcebus,murinus,(mouse_lemur)
34,16,>,Wgs08,speTri,Spermophilus,tridecemlineatus,(ground_squirrel)
35,20,>,Wgs08,ochPri,Ochotona,princeps,(pika)
36,28,>,Wgs08,myoLuc,Myotis,lucifugus,(microbat)
37,49,>,Wgs08,calMil,Callorhinchus,milii,(elephantfish)
38,25,>,Htg08,susScr,Sus,scrofa,(pig)
39,5,>,Trc19,nomLeu,Nomascus,leucogenys,(gibbon)
40,8,>,Trc17,tarSyr,Tarsius,syrichta,(tarsier)
41,41,>,Trc15,taeGut,Taeniopygia,guttata,(finch)
42,7,>,Trc12,papHam,Papio,hamadryas,(baboon)
43,26,>,Trc11,vicVic,Vicugna,vicugna,(vicugna)
44,37,>,Trc10,macEug,Macropus,eugenii,(wallaby)
45,24,>,Trc10,turTru,Tursiops,truncatus,(dolphin)
46,36,>,Trc09,choHof,Choloepus,hoffmanni,(sloth)
47,33,>,Trc09,proCap,Procavia,capensis,(hyrax)
48,29,>,Trc08,pteVam,Pteropus,vampyrus,(macrobat)
49,17,>,Trc07,dipOrd,Dipodomys,ordii,(kangaroo_rat)
50,3,>,Trc04,gorGor,Gorilla,gorilla,(gorilla)
51,12,>,Trc00,cynVol,Cynocephalus,volans,(flying_lemur)
</pre>
<pre>Final fasta output
>homSap Mar06 Homo sapiens (human) MAN1A2 exon
AIEKYCRVNGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>panTro Mar06 Pan troglodytes (chimp)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>gorGor Trc04 Gorilla gorilla (gorilla)
AIEKYRRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>ponPyg Jul07 Pongo pygmaeus (orang_abelii)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETL
>nomLeu Trc19 Nomascus leucogenys (gibbon)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>macMul Jan06 Macaca mulatta (rhesus)
AIEKYCRVTGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>papHam Trc12 Papio hamadryas (baboon)
AIEKYCRVNGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>tarSyr Trc17 Tarsius syrichta (tarsier)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>calJac Jun07 Callithrix jacchus (marmoset)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>otoGar Dec06 Otolemur garnettii (bushbaby)
AIEKYCRVTGGFSGIKDVYSSVPTHDDVQQSFFLAETLK
>micMur Wgs08 Microcebus murinus (mouse_lemur)
AIEKYCRVSGGFSGIKDVYSSTPTHDDVQQSFFLAETLK
>cynVol Trc00 Cynocephalus volans (flying_lemur)
-
>tupBel Dec06 Tupaia belangeri (treeshrew)
--
>musMus Jul07 Mus musculus (mouse)
AIEKSCRVSGGFSGVKDVYAPTPVHDDVQQSFFLAETLK
>ratNor Nov04 Rattus norvegicus (rat)
AIEKSCRVSGGFSGVKDVYSPTPAHDDVQQSFFLAETLK
>speTri Wgs08 Spermophilus tridecemlineatus (ground_squirrel)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>dipOrd Trc07 Dipodomys ordii (kangaroo_rat)
-
>cavPor Wgs08 Cavia porcellus (guinea_pig)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>oryCun May05 Oryctolagus cuniculus (rabbit)
AIEKHCRVRGGFSGIKDVYSSTPTHDDVQQSFFLAETLK
>ochPri Wgs08 Ochotona princeps (pika)
AIEKHCRVRGGFSGIKDVYSSTPTHDDVQQSFFLAETLK
>canFam May05 Canis familiaris (dog)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>felCat Mar06 Felis catus (cat)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>bosTau Aug06 Bos taurus (cow)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>turTru Trc10 Tursiops truncatus (dolphin)
AIEKYCRVTGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>susScr Htg08 Sus scrofa (pig)
ALEKHCRVNGGYSGLRDVYVSAQTYDDVQQSFFLAETLK
>vicVic Trc11 Vicugna vicugna (vicugna)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>equCab Jan07 Equus caballus (horse)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>myoLuc Wgs08 Myotis lucifugus (microbat)
AIEKYCRVSGGFSGVKDVYSSTPAHDDVQQSFFLAETLK
>pteVam Trc08 Pteropus vampyrus (macrobat)
AIEKYCRVSGGFSGVKDVYSSTPAHDDVQQSFFLAETLK
>sorAra Wgs08 Sorex araneus (shrew)
AIEKYCRVSSGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>eriEur Wgs08 Erinaceus europaeus (hedgehog)
AIEKYCRVSSGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>loxAfr May05 Loxodonta africana (elephant)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>proCap Trc09 Procavia capensis (hyrax)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>echTel Jul05 Echinops telfairi (tenrec)
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLk
>dasNov May05 Dasypus novemcinctus (armadillo)
AIEKNCRVSSGFSGVKDVYSANPTHDDVQQSFFLAETLK
>choHof Trc09 Choloepus hoffmanni (sloth)
AIEKYCRVSSGFSGVKDVYSSNPTHDDVQQSFFLAETLK
>macEug Trc10 Macropus eugenii (wallaby)
-
>monDom Jan06 Monodelphis domestica (opossum)
AIEKYCRVRGGFSGIKDVYSSAPAYDDVQQSFFLAETLK
>ornAna Mar07 Ornithorhynchus anatinus (platypus)
AIEKSCRVSGGFSGVKDVYSSAPAYDDVQQSFFLAETLK
>galGal May06 Gallus gallus (chicken)
AIDKYCRVSGGFSGVKDVYSSSPTYDDVQQSFFLAETLK
>taeGut Trc15 Taeniopygia guttata (finch)
AIDKYCRVSGGFSGVKDVYSSSPTYDDVQQSFFLAETLK
>anoCar Feb07 Anolis carolinensis (lizard)
AIDKYCRVSGGFSGVKDVYSSAPTFDDVQQSFFLAETLK
>xenTro Aug05 Xenopus tropicalis (frog)
AIDKYCRVSGGFSGIKDVYSSSPTYDDVQQSFFLAETLK
>danRer Jul07 Danio rerio (zebrafish)
ALEKHCRVEGGYSGVRDVYSNNPNHDDVQQSFYLAETLK
>takRub Oct04 Takifugu rubripes (fugu)
AIDKYCRVSGGFSGVKDVYSSNPTYDDVQQSFFLAETLK
>tetNig Feb04 Tetraodon nigroviridis (pufferfish)
AIDKYCRVSGGFSGVKDVYSSSPTYDDVQQSFFLAETLK
>gasAcu Feb06 Gasterosteus aculeatus (stickleback)
AIDKYCRVSGGFSGVKDVYSSNPTYDDVQQSFFLAETLK
>oryLat Apr06 Oryzias latipes (medaka)
AIDKYCRVSGGFSGVKDVYSSNPTYDDVQQSFFLAETLk
>calMil Wgs08 Callorhinchus milii (elephantfish)
AIDKYCRVIGGFSGVKDVYSSTPAYDDVQQSFFLAETLK
>petMar Mar07 Petromyzon marinus (lamprey)
ALEKYCRVEGGFSGIRDVYSSSPAHDDVQQSFFLAETLK
>braFlo Mar06 Branchiostoma floridae (lancelet)
-
</PRE>
=== Tracking 454 reads ===
These six transcript projects are the pick of the 454 litter as of May 2008 in terms of vertebrate data. Transcripts have a lot more payload being coding than random genomic snippets in terms of alignability and ease of working with. The [ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead file sizes] are very reasonable, considering how much data is in them. These are fasta-formatted sequence data only, compressed but not encoded in any way and in the public domain. Read quality is separately available but probably of less interest to most biomedical researchers.
homSap SRA000297 Homo sapiens Transcriptome Study 454 GS FLX WUGSC don't need
calJac SRA000144 Callithrix jacchus Transcriptome Study WUGSC
tupBel SRA000295 Tupaia belangeri Transcriptome Study 454 GS FLX WUGSC
ornAna SRA000294 Ornithorhynchus anatinus Transcriptome Study 454 GSFLX WUGSC
tacAcu SRA000241 Tachyglossus aculeatus Transcriptome Study 454 GS FLX WUGSC
galGal SRA000238 Gallus gallus Transcriptome Study 454 GS FLX WUGSC
[[Category:Comparative Genomics]]
[[Category:Comparative Genomics]]

Latest revision as of 03:30, 6 May 2008

Tracking genome projects is difficult

Which metazoan species currently have genomic data available? That's hard to say -- it's a difficult process to track. There are no announcements, maintained lists, or publications; sequencing centers rarely update their websites or indicate specific future plans. Consequently few researchers are adequately aware of what species have genomic data, and so typically undersample when doing comparative genomics projects. Sampling species more densely often overturns working hypotheses of feature evolution.

Tracking Sanger reads

Sequencing centers post raw trace reads on a day-by-day basis at NCBI's trace archives. NCBI performs some quality control and adds them to the accruing database that is blastn accessible. Later the center may assemble them into contigs and post them to the "wgs" division of GenBank (more rarely at "gss" or "htgs"). Depending on the coverage and finishing effort, these contigs can be hosted as a genome by a browser center such as UCSC. It may take 2-3 years for data to complete its migration from trace sequencing to contigs to genome. More rarely, traces are withheld and a genome assembly appears abruptly, as with elephantfish.

Further complications include multiple subspecies for gorilla, orang and gibbon, personal human genomes, diploid genomes, areas of confused taxonomy (alpaca vs vicuna) and so on. NISC and sequencing centers do not always work from same individual animal or even same subspecies so each trace compilation has to be checked separately. Transcript programs can use yet other individual animals or subspecies.

To annotate at kbp scales (adequate for exons and small genes), one can reliably use the traces or contigs and not wait (years) for a genome browser to appear.

If the exon or feature is 1000 bp (or in such pieces), the trace archives work quite well, especially for establishing presence. Absence is not as informative because data might simply be missing due to low coverage. No vertebrae species is truly complete yet including human. it requires a few million traces before a given incoming genome is worth checking for a given feature.

Not every trace makes it into a contigs or assembly -- singletons are often omitted, often millions of them. Sequencing often continues after a release, as is happening now with elephant and guinea pig. Consequently the trace archive at NCBI is always the resource of last resort, ie if a feature is missing from assembled traces, it is best to go back to the trace archive because all original data is there. However NCBI posting of traces to its blast database can lag trace inputs from the sequencing centers by a week or more, which can amount to 1.5 million traces in an active project.

There are also "cdna" species like the marsupial, Trichosurus vulpecular, which have rather complete coverage of coding genes but no genome project underway. Such sequences can furnish critical close-in query material to improve the sensitivity of trace blast (which is not sensitive at any evolutionary distance if the feature is evolving rapidly).

In a concrete comparative genomics research project, it is important to document which species were considered. For that it is convenient to enter annotation data in a column next to its species, in a spreadsheet containing all species with genomic data (provided below and illustrated below that with a concrete coding indel example). This allows last-minute updating prior to paper submission.

Finally, PCR can be used on species currently lacking genomic or cdna projects when it is critical to augment sampling density. Flying lemur would be a good choice in primate-oriented projects because it appears to be the immediate outgroup (hence a great improvement over distant mouse).

Two recent papers illustrate these concepts and explain methods of contemporary comparative genomics in greater detail:

Janecka JE, Miller W, Pringle TH, Wiens F, Zitzmann A, Helgen KM, Springer MS, Murphy WJ.
Molecular and genomic data identify the closest living relative of primates.
Science. 2007 Nov 2;318(5851):792-4.
PMID: 17975064

Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W.
Using genomic data to unravel the root of the placental mammal phylogeny.
Genome Res. 2007 Apr;17(4):413-21. 
PMID: 17322288

Available genome assemblies as of May 2008

The table is correct as of 01 May 08. 
  Traces indicated in millions, eg Trc12 means 12 million traces but no wgs contigs or assembly available
  Wgs08 means wgs division of GenBank contains short assembled contigs searchable with tBlastn
  Mar06 etc means the March 2006 assembly is the most recent available at UCSC

Mar06  homSap  Homo  sapiens  (human)
Mar06  panTro  Pan  troglodytes  (chimp)
Trc04  gorGor  Gorilla  gorilla  (gorilla)
Jul07  ponPyg  Pongo  pygmaeus  (orang_abelii)
Trc19  nomLeu  Nomascus  leucogenys  (gibbon)
Jan06  macMul  Macaca  mulatta  (rhesus)
Trc12  papHam  Papio  hamadryas  (baboon)
Trc17  tarSyr  Tarsius  syrichta  (tarsier)
Jun07  calJac  Callithrix  jacchus  (marmoset)
Dec06  otoGar  Otolemur  garnettii  (bushbaby)
Wgs08  micMur  Microcebus  murinus  (mouse_lemur)
Trc00  cynVol  Cynocephalus  volans  (flying_lemur)
Dec06  tupBel  Tupaia  belangeri  (treeshrew)
Jul07  musMus  Mus  musculus  (mouse)
Nov04  ratNor  Rattus  norvegicus  (rat)
Wgs08  speTri  Spermophilus  tridecemlineatus  (ground_squirrel)
Trc07  dipOrd  Dipodomys  ordii  (kangaroo_rat)
Wgs08  cavPor  Cavia  porcellus  (guinea_pig)
May05  oryCun  Oryctolagus  cuniculus  (rabbit)
Wgs08  ochPri  Ochotona  princeps  (pika)
May05  canFam  Canis  familiaris  (dog)
Mar06  felCat  Felis  catus  (cat)
Aug06  bosTau  Bos  taurus  (cow)
Trc10  turTru  Tursiops  truncatus  (dolphin)
Trc06  susScr  Sus  scrofa  (pig)
Trc11  vicVic  Vicugna  vicugna  (vicugna)
Jan07  equCab  Equus  caballus  (horse)
Wgs08  myoLuc  Myotis  lucifugus  (microbat)
Trc08  pteVam  Pteropus  vampyrus  (macrobat)
Wgs08  sorAra  Sorex  araneus  (shrew)
Wgs08  eriEur  Erinaceus  europaeus  (hedgehog)
May05  loxAfr  Loxodonta  africana  (elephant)
Trc09  proCap  Procavia  capensis  (hyrax)
Jul05  echTel  Echinops  telfairi  (tenrec)
May05  dasNov  Dasypus  novemcinctus  (armadillo)
Trc09  choHof  Choloepus  hoffmanni  (sloth)
Trc10  macEug  Macropus  eugenii  (wallaby)
Jan06  monDom  Monodelphis  domestica  (opossum)
Mar07  ornAna  Ornithorhynchus  anatinus  (platypus)
May06  galGal  Gallus  gallus  (chicken)
Trc15  taeGut  Taeniopygia  guttata  (finch)
Feb07  anoCar  Anolis  carolinensis  (lizard)
Aug05  xenTro  Xenopus  tropicalis  (frog)
Jul07  danRer  Danio  rerio  (zebrafish)
Oct04  takRub  Takifugu  rubripes  (fugu)
Feb04  tetNig  Tetraodon  nigroviridis  (pufferfish)
Feb06  gasAcu  Gasterosteus  aculeatus  (stickleback)
Apr06  oryLat  Oryzias  latipes  (medaka)
Wgs08  calMil  Callorhinchus  milii  (elephantfish)
Mar07  petMar  Petromyzon  marinus  (lamprey)

A coding indel example (a coding exon from gene SPC25 on human chr2) illustrates the usefulness of multiple genomes in timing and understanding evolution of insertions and deletions.

homSap	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
panTro	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
ponPyg	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
macMul	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
calJac	MVEDELALFDKSLNEFWNKFKST--DTTFQMAGLRDTYKDSLKAFA
tarSyr	MVEDELTLFDKSINEFWNKFKST--DTANQMMGLRDTYKDSVKAFA
otoGar	MVEDQLALLDKNINEFWNKFKST--DTAGQMAGLRDTYKDSIKTFA
micMur	MVEDELVLFDKTVNEFWNKFKST--DTSCHMVGLRDTYKDSLKAFA
cynVol	.................NKFTST--DTSCQMMGLRGTNK.......
tupBel	MVEDELALFDKGINEFWNKFRSTVSDTSCQMVGLRDAYKDSIKAFA
musMus	MGEDELALLNQSINEFGDKFRNRLDDNHSQVLGLRDAFKDSMKAFS
ratNor	MGEDELAAFEKSINEFGDKFRYRLSDNRSQVLGLKDAFKDSIRALS
cavPor	MVEDELALFDKSINEFGNKFRNTLSDTPCQMLGLRDACKDSIKTLA
speTri	MMEDELARFDKSINEFGNKFRNTFSDTRCQMVGLRDVFKDSIEALA
dipOrd	MVEDELAHFDKSISEFGSKFRNTLSDTPSQTVGLRDAYKDSIKALS
oryCun	MVEDELALFDKSINEFGSKFRSTLSDAPCQMVGLRDAYKDSVKSLT
ochPri	MVEDELALFDKSINEFGSKFRSTLSDTPCQMVGLREACKDSVRLLT
canFam	MIDDELAQFDKSISEFWSKFKGTVSDTSSQMVGLRETYKDSIKACA
felCat	MIEDELALFDKSINEFWNKFKSTLSDTSCQMMGLRDTYKDSIKALT
equCab	MVEDELALFDKSINEFWNKFKNTVSDTSCQMVGLRDAYKDSIKAFA
myoLuc	MVEDELALLDKNINEFWNKFKSNVNDTSCQMVGLRDNYKDISKAFT
pteVam	MVEDELALLDKSINEFWNKFKSSVSDTSCQMMALRDSYKDINKAFT
bosTau	MVEDELALFDKSINEFWNKFKSTVSDTSCQMVGLRETYKDSIKAFA
turTru	MVEDELALFDKSINEFWNKFRSTVSDTSCQMVGLRDTYKDSIKAFA
susScr	MVEDELALFDKSINEFWNRFKSTVSDTSCQMVGLRENYKDSLKAFA
oviAri	MVEDELALFDKSLNEFWNKFKSTVNDTSCQMVGLREAYKDSIKAFA
eriEur	MVEDELALFDKSINEFWNKFKGTVSDTSFQMVGLRDTYKDSIKIFT
sorAra	MVEDELVLFEKSINEFVNEFESTASDTTCQVVGPRDADKDSIKALA
dasNov	MIEDELALFDKSINEFWNKFKGTVSDNSCQMVGLRDTYKDSIKAFA
choHof	MIEDELALFDKSINEFWNKFKSAVSDTSCQMVGLRDTYKDSIKAFA
loxAfr	MIEDELVQFDKSINEFWNKFINTASDTSCQMVGLRDAYKDSMKAFA
proCap	MIEDELRQFDKSINEFWNKFINTTSDTSCQMAGLRDAYKDSMKAFA
echTel	MIEDELLQFDKSMNEFRNKHFNTLNDTSGQMMGLRDTYRDSMKAFA
monDom	MSHIKTEEELDLFNKSINDFWNKFRNTTLNEHCSQMVGLRDTYKDSIEALT
macEug	MSHIKTEEELDIFEKSISDFWNRFRNTAFNEPYSQVVGVRDTYKYSIETLT
triVul	MSHIKTEEELDIFNKSINDFWNRFRNTTFNEHYSQVVGLRDTYKNSIEALT
ornAna	MSHIKTEEELALFDKSIDEFWTKFKNTWISEYSCQTVTLRDAHKEAIKALT
galGal	MSAVKTEDEITVVEREMKEFWTELKSVYGTEQINQTLALRDSCKESINVLS
taeGut	MGNAQAEDEVALFEKDMKEFWIQFKISYGTEQNNQTMKEFWIQFKISYGTE
anoCar	MAKAKEEDELTMLEKGIEELCTQIETTYCRQSLEKTSGPRNKCYKSGPRNK


Live blog of an actual exon recovery session:

I updated the available chordate genomes today and their practical access. Pig contigs were released a couple days ago ... they are stored at the HTG division of GenBank and various other inconvenient places. The existing 28-way can be boosted to a 32-way using the net tracks for lamprey, lancelet, marmoset, and orangutan. That can be boosted to a 37-way bringing in Microcebus, Spermophilus, Ochotona, Myotis, and Callorhinchus using a single tBlastn of the wgs division of GenBank using the new capability of boolean species restriction. The 51-way then requires individual queries at the trace archives.

Mark D suggested an interesting trick (which I've amended): set the browser to its 5000 pixel maximum as needed, open in pdf view, scrape off amino acid lines as text, remove spaces and equal signs, watch out for nucleotides also in caps and extraneous symbols, paste into a new column of the ss below in default sort order. If an exon is fairly well-behaved in the comp genomics sense, that provides it for species in the 28-way that have it.

Net tracks have to be opened individually and translated (or uBlastx'ed against fiducial). As always, the trace archives has to be consulted at the end to fill in any residual gaps because millions of singletons get left out of assemblies.

To illustrates what happens in real life (ie the difficulty of fully automating the process proteomewide), I looked an exon of MAN1A2. The 28-way gave 25 sequences, with a proximal frameshift in eriEur and nothing for tupBel and dasNov. eriEur had two covering traces and the browser contig had taken the wrong one; tupBel had one trace coverage with 6 frameshifts, unusable. Armadillo has gone to 6x lately and trace coverage was excellent.

The net tracks yielded compelling ponPyg and petMar translations but nothing for braFlo and a calJac implausibly with two indels. However calJac at the trace archives had indel-free coverage with the expected conservation; the browser again was hosting a contig assembly error or bizarre polymorphism.

Wgs searching restricted to 'Microcebus OR Spermophilus OR Ochotona OR Myotis OR Callorhinchus' avoids a results page swamped out with sequences already on hand. It still takes forever because so many genomes have been placed in this db. The search here provides a good outcome for all but speTri where only a paralogous exon match occurs. speTri had the needed coverage in the trace archives however. Sometimes the paralog story and other anomalies only emerge from an alignment of all the collected sequences at the end.

It's also worth checking est_others for ad hoc species. Here it's best to look at the taxonomy sort-page. I picked up susScr and papHam, stubbing in Papio anubis for the latter. Another 6 species were available should it later prove mission-critical to flesh out certain regions of the tree:

Echis ocellatus ............. 77  1 hit  [snakes] 
Bothrops jararaca ........... 70  1 hit  [snakes] 
Gekko japonicus ............. 70  1 hit  [lizards] 
Ambystoma mexicanum ......... 77  1 hit  [salamanders] 
Squalus acanthias ----------- 74  1 hit  [sharks and rays] 
Leucoraja erinacea .......... 70  3 hits [sharks and rays]  

By using this method of descending quality and convenience, only 12 individual searches at the trace archives are needed. This protein is evolving slowly enough that a single fixed nucleotide blastn query will suffice, either human, boreoeuthere, or reconstructed ancestor. Because the pulldown menu contains hundreds of species, the ones sought should be queried in alphabetic order to simplify sequential setting of the target species. That process took 14 minutes here to get 9 additional species, ending up with a 46-way on this exon or a 52-way using the 6 transcript species above.

In summary, enough data is out there to double up on what the browser 28-way offers. The downside is that even after optimizing the process it takes an hour or more per exon. In this project, I'm looking at non-synonymous changes in 100 exons in the first fully sequenced extinct species. There will have to be some prioritization early on, but this can be based on analysis of the initial pdf screenscrape.

col1  species order if scraping protein off the 'Conservation'  28-way, by net availability, otherwise by declining trace reads
col2  species phylogenetic order, sister leafs subordered by assembly quality
col3  assembly, wgs contig, htg contigs, trace archives (last two digits show millions of reads)
col4  6-letter genSpp code
col5  genus
col6  species
col7  common and subspp

1,1,>,Mar06,homSap,Homo,sapiens,(human)
2,2,>,Mar06,panTro,Pan,troglodytes,(chimp)
3,6,>,Jan06,macMul,Macaca,mulatta,(rhesus)
4,10,>,Dec06,otoGar,Otolemur,garnettii,(bushbaby)
5,13,>,Dec06,tupBel,Tupaia,belangeri,(treeshrew)
6,14,>,Jul07,musMus,Mus,musculus,(mouse)
7,15,>,Nov04,ratNor,Rattus,norvegicus,(rat)
8,18,>,Wgs08,cavPor,Cavia,porcellus,(guinea_pig)
9,19,>,May05,oryCun,Oryctolagus,cuniculus,(rabbit)
10,30,>,Wgs08,sorAra,Sorex,araneus,(shrew)
11,31,>,Wgs08,eriEur,Erinaceus,europaeus,(hedgehog)
12,21,>,May05,canFam,Canis,familiaris,(dog)
13,22,>,Mar06,felCat,Felis,catus,(cat)
14,27,>,Jan07,equCab,Equus,caballus,(horse)
15,23,>,Aug06,bosTau,Bos,taurus,(cow)
16,35,>,May05,dasNov,Dasypus,novemcinctus,(armadillo)
17,32,>,May05,loxAfr,Loxodonta,africana,(elephant)
18,34,>,Jul05,echTel,Echinops,telfairi,(tenrec)
19,38,>,Jan06,monDom,Monodelphis,domestica,(opossum)
20,39,>,Mar07,ornAna,Ornithorhynchus,anatinus,(platypus)
21,42,>,Feb07,anoCar,Anolis,carolinensis,(lizard)
22,40,>,May06,galGal,Gallus,gallus,(chicken)
23,43,>,Aug05,xenTro,Xenopus,tropicalis,(frog)
24,44,>,Jul07,danRer,Danio,rerio,(zebrafish)
25,46,>,Feb04,tetNig,Tetraodon,nigroviridis,(pufferfish)
26,45,>,Oct04,takRub,Takifugu,rubripes,(fugu)
27,47,>,Feb06,gasAcu,Gasterosteus,aculeatus,(stickleback)
28,48,>,Apr06,oryLat,Oryzias,latipes,(medaka)
29,50,>,Mar07,petMar,Petromyzon,marinus,(lamprey)
30,51,>,Mar06,braFlo,Branchiostoma,floridae,(lancelet)
31,9,>,Jun07,calJac,Callithrix,jacchus,(marmoset)
32,4,>,Jul07,ponPyg,Pongo,pygmaeus,(orang_abelii)
33,11,>,Wgs08,micMur,Microcebus,murinus,(mouse_lemur)
34,16,>,Wgs08,speTri,Spermophilus,tridecemlineatus,(ground_squirrel)
35,20,>,Wgs08,ochPri,Ochotona,princeps,(pika)
36,28,>,Wgs08,myoLuc,Myotis,lucifugus,(microbat)
37,49,>,Wgs08,calMil,Callorhinchus,milii,(elephantfish)
38,25,>,Htg08,susScr,Sus,scrofa,(pig)
39,5,>,Trc19,nomLeu,Nomascus,leucogenys,(gibbon)
40,8,>,Trc17,tarSyr,Tarsius,syrichta,(tarsier)
41,41,>,Trc15,taeGut,Taeniopygia,guttata,(finch)
42,7,>,Trc12,papHam,Papio,hamadryas,(baboon)
43,26,>,Trc11,vicVic,Vicugna,vicugna,(vicugna)
44,37,>,Trc10,macEug,Macropus,eugenii,(wallaby)
45,24,>,Trc10,turTru,Tursiops,truncatus,(dolphin)
46,36,>,Trc09,choHof,Choloepus,hoffmanni,(sloth)
47,33,>,Trc09,proCap,Procavia,capensis,(hyrax)
48,29,>,Trc08,pteVam,Pteropus,vampyrus,(macrobat)
49,17,>,Trc07,dipOrd,Dipodomys,ordii,(kangaroo_rat)
50,3,>,Trc04,gorGor,Gorilla,gorilla,(gorilla)
51,12,>,Trc00,cynVol,Cynocephalus,volans,(flying_lemur)
Final fasta output

>homSap Mar06 Homo sapiens (human) MAN1A2 exon
AIEKYCRVNGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>panTro Mar06 Pan troglodytes (chimp) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>gorGor Trc04 Gorilla gorilla (gorilla) 
AIEKYRRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>ponPyg Jul07 Pongo pygmaeus (orang_abelii) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETL
>nomLeu Trc19 Nomascus leucogenys (gibbon) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>macMul Jan06 Macaca mulatta (rhesus) 
AIEKYCRVTGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>papHam Trc12 Papio hamadryas (baboon) 
AIEKYCRVNGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>tarSyr Trc17 Tarsius syrichta (tarsier) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>calJac Jun07 Callithrix jacchus (marmoset) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>otoGar Dec06 Otolemur garnettii (bushbaby) 
AIEKYCRVTGGFSGIKDVYSSVPTHDDVQQSFFLAETLK
>micMur Wgs08 Microcebus murinus (mouse_lemur) 
AIEKYCRVSGGFSGIKDVYSSTPTHDDVQQSFFLAETLK
>cynVol Trc00 Cynocephalus volans (flying_lemur) 
-
>tupBel Dec06 Tupaia belangeri (treeshrew) 
--
>musMus Jul07 Mus musculus (mouse) 
AIEKSCRVSGGFSGVKDVYAPTPVHDDVQQSFFLAETLK
>ratNor Nov04 Rattus norvegicus (rat) 
AIEKSCRVSGGFSGVKDVYSPTPAHDDVQQSFFLAETLK
>speTri Wgs08 Spermophilus tridecemlineatus (ground_squirrel) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>dipOrd Trc07 Dipodomys ordii (kangaroo_rat) 
-
>cavPor Wgs08 Cavia porcellus (guinea_pig) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>oryCun May05 Oryctolagus cuniculus (rabbit) 
AIEKHCRVRGGFSGIKDVYSSTPTHDDVQQSFFLAETLK
>ochPri Wgs08 Ochotona princeps (pika) 
AIEKHCRVRGGFSGIKDVYSSTPTHDDVQQSFFLAETLK
>canFam May05 Canis familiaris (dog) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>felCat Mar06 Felis catus (cat) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>bosTau Aug06 Bos taurus (cow) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>turTru Trc10 Tursiops truncatus (dolphin) 
AIEKYCRVTGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>susScr Htg08 Sus scrofa (pig) 
ALEKHCRVNGGYSGLRDVYVSAQTYDDVQQSFFLAETLK
>vicVic Trc11 Vicugna vicugna (vicugna) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>equCab Jan07 Equus caballus (horse) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>myoLuc Wgs08 Myotis lucifugus (microbat) 
AIEKYCRVSGGFSGVKDVYSSTPAHDDVQQSFFLAETLK
>pteVam Trc08 Pteropus vampyrus (macrobat) 
AIEKYCRVSGGFSGVKDVYSSTPAHDDVQQSFFLAETLK
>sorAra Wgs08 Sorex araneus (shrew) 
AIEKYCRVSSGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>eriEur Wgs08 Erinaceus europaeus (hedgehog) 
AIEKYCRVSSGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>loxAfr May05 Loxodonta africana (elephant) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>proCap Trc09 Procavia capensis (hyrax) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLK
>echTel Jul05 Echinops telfairi (tenrec) 
AIEKYCRVSGGFSGVKDVYSSTPTHDDVQQSFFLAETLk
>dasNov May05 Dasypus novemcinctus (armadillo) 
AIEKNCRVSSGFSGVKDVYSANPTHDDVQQSFFLAETLK
>choHof Trc09 Choloepus hoffmanni (sloth) 
AIEKYCRVSSGFSGVKDVYSSNPTHDDVQQSFFLAETLK
>macEug Trc10 Macropus eugenii (wallaby) 
-
>monDom Jan06 Monodelphis domestica (opossum) 
AIEKYCRVRGGFSGIKDVYSSAPAYDDVQQSFFLAETLK
>ornAna Mar07 Ornithorhynchus anatinus (platypus) 
AIEKSCRVSGGFSGVKDVYSSAPAYDDVQQSFFLAETLK
>galGal May06 Gallus gallus (chicken) 
AIDKYCRVSGGFSGVKDVYSSSPTYDDVQQSFFLAETLK
>taeGut Trc15 Taeniopygia guttata (finch) 
AIDKYCRVSGGFSGVKDVYSSSPTYDDVQQSFFLAETLK
>anoCar Feb07 Anolis carolinensis (lizard) 
AIDKYCRVSGGFSGVKDVYSSAPTFDDVQQSFFLAETLK
>xenTro Aug05 Xenopus tropicalis (frog) 
AIDKYCRVSGGFSGIKDVYSSSPTYDDVQQSFFLAETLK
>danRer Jul07 Danio rerio (zebrafish) 
ALEKHCRVEGGYSGVRDVYSNNPNHDDVQQSFYLAETLK
>takRub Oct04 Takifugu rubripes (fugu) 
AIDKYCRVSGGFSGVKDVYSSNPTYDDVQQSFFLAETLK
>tetNig Feb04 Tetraodon nigroviridis (pufferfish) 
AIDKYCRVSGGFSGVKDVYSSSPTYDDVQQSFFLAETLK
>gasAcu Feb06 Gasterosteus aculeatus (stickleback) 
AIDKYCRVSGGFSGVKDVYSSNPTYDDVQQSFFLAETLK
>oryLat Apr06 Oryzias latipes (medaka) 
AIDKYCRVSGGFSGVKDVYSSNPTYDDVQQSFFLAETLk
>calMil Wgs08 Callorhinchus milii (elephantfish) 
AIDKYCRVIGGFSGVKDVYSSTPAYDDVQQSFFLAETLK
>petMar Mar07 Petromyzon marinus (lamprey) 
ALEKYCRVEGGFSGIRDVYSSSPAHDDVQQSFFLAETLK
>braFlo Mar06 Branchiostoma floridae (lancelet) 
-

Tracking 454 reads

These six transcript projects are the pick of the 454 litter as of May 2008 in terms of vertebrate data. Transcripts have a lot more payload being coding than random genomic snippets in terms of alignability and ease of working with. The file sizes are very reasonable, considering how much data is in them. These are fasta-formatted sequence data only, compressed but not encoded in any way and in the public domain. Read quality is separately available but probably of less interest to most biomedical researchers.

homSap SRA000297 Homo sapiens Transcriptome Study 454 GS FLX WUGSC don't need
calJac SRA000144 Callithrix jacchus Transcriptome Study WUGSC
tupBel SRA000295 Tupaia belangeri Transcriptome Study 454 GS FLX WUGSC
ornAna SRA000294 Ornithorhynchus anatinus Transcriptome Study 454 GSFLX WUGSC
tacAcu SRA000241 Tachyglossus aculeatus Transcriptome Study 454 GS FLX WUGSC
galGal SRA000238 Gallus gallus Transcriptome Study 454 GS FLX WUGSC