Genome completion status

From genomewiki
Revision as of 13:32, 30 January 2008 by Tomemerald (talk | contribs)
Jump to navigationJump to search

Which metazoan species currently have genomic data available? That's hard to say -- it is a difficult process to track. Consequently few researchers are adequately aware of what species have genomic data, and so typically undersample needlessly when doing comparative genomics projects. Sampling species more densely often overturns working hypotheses of feature evolution.

Sequencing centers post raw trace reads on a day-by-day basis at NCBI's trace archives. NCBI performs some quality control and adds them to the accruing database that is blastn accessible. Later the center may assemble them into contigs and post them to the "wgs" division of GenBank (more rarely at "gss" or "htgs"). Depending on the coverage and finishing effort, these contigs can be hosted as a genome by a browser center such as UCSC.

It may take 2-3 years for data to complete its migration from trace sequencing to contigs to genome. More rarely, traces are withheld and a genome assembly appears abruptly, as with elephantfish. There are no announcements, maintained lists, or publications; sequencing centers rarely update their websites or indicate specific future plans.

Further complications include multiple gorilla, orang and gibbon subspecies, personal individual human genomes, areas of confused taxonomy (alpaca vs vicuna) and so on. NISC and sequencing centers do not always work from same individual or even same subspecies so each trace compilation has to be checked separately.

To annotate at kbp scales (adequate for exons and small genes), one can reliably use the traces or contigs and not wait (years) for a genome browser to appear.

If the exon or feature is 1000 bp (or in such pieces), the trace archives work quite well, especially for establishing presence. Absence is not as informative because data might simply be missing due to low coverage. No vertebrae species is truly complete yet including human. it requires a couple million traces before a given incoming genome is worth checking for a given feature.

Not every trace makes it into a contigs or assembly -- singletons are often omitted, millions of them. Sequencing often continues after a release, as is happening now with elephant and guinea pig. Consequently the trace archive at NCBI is always the resource of last resort, ie if a feature is missing from assembled traces, it is best to go back to the trace archive because all original data is there. However NCBI posting of traces to its blast database can lag trace inputs from the sequencing centers by a week or more, which can amount to a million traces in an active project.

There are also "cdna" species like the marsupial, Trichosurus vulpecular, which have rather complete coverage of coding genes but no genome project underway. Such sequences can furnish critical close-in query material to improve the sensitivity of trace blast (which is not sensitive at any evolutionary distance if the feature is evolving rapidly).

In a specific comparative genomics research project, it is important to document which species were considered. For that it is convenient to enter annotation data in a column next to its species, in a spreadsheet containing all species with genomic data (provided below and illustrated below that with a concrete coding indel example). This allows last-minute updating prior to paper submission.

Finally, PCR can be used on species currently lacking genomic or cdna projects when it is critical to augment sampling density. Flying lemur would be a good choice in primate-oriented projects because it appears to be the immediate outgroup (hence a great improvement over distant mouse).

Two recent papers illustrate these concepts and explain methods of contemporary comparative genomics in greater detail:

Janecka JE, Miller W, Pringle TH, Wiens F, Zitzmann A, Helgen KM, Springer MS, Murphy WJ.
Molecular and genomic data identify the closest living relative of primates.
Science. 2007 Nov 2;318(5851):792-4.
PMID: 17975064

Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W.
Using genomic data to unravel the root of the placental mammal phylogeny.
Genome Res. 2007 Apr;17(4):413-21. 
PMID: 17322288


Revised 01 Feb 08. Traces indicated in millions, eg Trc12 means 12 million traces are available but no wgs contigs or assembly
Mar06  homSap  Homo  sapiens  (human)
Mar06  panTro  Pan  troglodytes  (chimp)
Trc04  gorGor  Gorilla  gorilla  (gorilla)
Jul07  ponPyg  Pongo  pygmaeus  (orang_abelii)
Trc19  nomLeu  Nomascus  leucogenys  (gibbon)
Jan06  macMul  Macaca  mulatta  (rhesus)
Trc12  papHam  Papio  hamadryas  (baboon)
Trc17  tarSyr  Tarsius  syrichta  (tarsier)
Jun07  calJac  Callithrix  jacchus  (marmoset)
Dec06  otoGar  Otolemur  garnettii  (bushbaby)
Wgs08  micMur  Microcebus  murinus  (mouse_lemur)
Trc00  cynVol  Cynocephalus  volans  (flying_lemur)
Dec06  tupBel  Tupaia  belangeri  (treeshrew)
Jul07  musMus  Mus  musculus  (mouse)
Nov04  ratNor  Rattus  norvegicus  (rat)
Wgs08  speTri  Spermophilus  tridecemlineatus  (ground_squirrel)
Trc07  dipOrd  Dipodomys  ordii  (kangaroo_rat)
Wgs08  cavPor  Cavia  porcellus  (guinea_pig)
May05  oryCun  Oryctolagus  cuniculus  (rabbit)
Wgs08  ochPri  Ochotona  princeps  (pika)
May05  canFam  Canis  familiaris  (dog)
Mar06  felCat  Felis  catus  (cat)
Aug06  bosTau  Bos  taurus  (cow)
Trc10  turTru  Tursiops  truncatus  (dolphin)
Trc06  susScr  Sus  scrofa  (pig)
Trc11  vicVic  Vicugna  vicugna  (vicugna)
Jan07  equCab  Equus  caballus  (horse)
Wgs08  myoLuc  Myotis  lucifugus  (microbat)
Trc08  pteVam  Pteropus  vampyrus  (macrobat)
Wgs08  sorAra  Sorex  araneus  (shrew)
Wgs08  eriEur  Erinaceus  europaeus  (hedgehog)
May05  loxAfr  Loxodonta  africana  (elephant)
Trc09  proCap  Procavia  capensis  (hyrax)
Jul05  echTel  Echinops  telfairi  (tenrec)
May05  dasNov  Dasypus  novemcinctus  (armadillo)
Trc09  choHof  Choloepus  hoffmanni  (sloth)
Trc10  macEug  Macropus  eugenii  (wallaby)
Jan06  monDom  Monodelphis  domestica  (opossum)
Mar07  ornAna  Ornithorhynchus  anatinus  (platypus)
May06  galGal  Gallus  gallus  (chicken)
Trc15  taeGut  Taeniopygia  guttata  (finch)
Feb07  anoCar  Anolis  carolinensis  (lizard)
Aug05  xenTro  Xenopus  tropicalis  (frog)
Jul07  danRer  Danio  rerio  (zebrafish)
Oct04  takRub  Takifugu  rubripes  (fugu)
Feb04  tetNig  Tetraodon  nigroviridis  (pufferfish)
Feb06  gasAcu  Gasterosteus  aculeatus  (stickleback)
Apr06  oryLat  Oryzias  latipes  (medaka)
Wgs08  calMil  Callorhinchus  milii  (elephantfish)
Mar07  petMar  Petromyzon  marinus  (lamprey)

A coding indel example (a coding exon from gene SPC25 on human chr2) illustrates the usefulness of multiple genomes in timing and understanding evolution of insertions and deletions.

homSap	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
panTro	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
ponPyg	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
macMul	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
calJac	MVEDELALFDKSLNEFWNKFKST--DTTFQMAGLRDTYKDSLKAFA
tarSyr	MVEDELTLFDKSINEFWNKFKST--DTANQMMGLRDTYKDSVKAFA
otoGar	MVEDQLALLDKNINEFWNKFKST--DTAGQMAGLRDTYKDSIKTFA
micMur	MVEDELVLFDKTVNEFWNKFKST--DTSCHMVGLRDTYKDSLKAFA
cynVol	.................NKFTST--DTSCQMMGLRGTNK.......
tupBel	MVEDELALFDKGINEFWNKFRSTVSDTSCQMVGLRDAYKDSIKAFA
musMus	MGEDELALLNQSINEFGDKFRNRLDDNHSQVLGLRDAFKDSMKAFS
ratNor	MGEDELAAFEKSINEFGDKFRYRLSDNRSQVLGLKDAFKDSIRALS
cavPor	MVEDELALFDKSINEFGNKFRNTLSDTPCQMLGLRDACKDSIKTLA
speTri	MMEDELARFDKSINEFGNKFRNTFSDTRCQMVGLRDVFKDSIEALA
dipOrd	MVEDELAHFDKSISEFGSKFRNTLSDTPSQTVGLRDAYKDSIKALS
oryCun	MVEDELALFDKSINEFGSKFRSTLSDAPCQMVGLRDAYKDSVKSLT
ochPri	MVEDELALFDKSINEFGSKFRSTLSDTPCQMVGLREACKDSVRLLT
canFam	MIDDELAQFDKSISEFWSKFKGTVSDTSSQMVGLRETYKDSIKACA
felCat	MIEDELALFDKSINEFWNKFKSTLSDTSCQMMGLRDTYKDSIKALT
equCab	MVEDELALFDKSINEFWNKFKNTVSDTSCQMVGLRDAYKDSIKAFA
myoLuc	MVEDELALLDKNINEFWNKFKSNVNDTSCQMVGLRDNYKDISKAFT
pteVam	MVEDELALLDKSINEFWNKFKSSVSDTSCQMMALRDSYKDINKAFT
bosTau	MVEDELALFDKSINEFWNKFKSTVSDTSCQMVGLRETYKDSIKAFA
turTru	MVEDELALFDKSINEFWNKFRSTVSDTSCQMVGLRDTYKDSIKAFA
susScr	MVEDELALFDKSINEFWNRFKSTVSDTSCQMVGLRENYKDSLKAFA
oviAri	MVEDELALFDKSLNEFWNKFKSTVNDTSCQMVGLREAYKDSIKAFA
eriEur	MVEDELALFDKSINEFWNKFKGTVSDTSFQMVGLRDTYKDSIKIFT
sorAra	MVEDELVLFEKSINEFVNEFESTASDTTCQVVGPRDADKDSIKALA
dasNov	MIEDELALFDKSINEFWNKFKGTVSDNSCQMVGLRDTYKDSIKAFA
choHof	MIEDELALFDKSINEFWNKFKSAVSDTSCQMVGLRDTYKDSIKAFA
loxAfr	MIEDELVQFDKSINEFWNKFINTASDTSCQMVGLRDAYKDSMKAFA
proCap	MIEDELRQFDKSINEFWNKFINTTSDTSCQMAGLRDAYKDSMKAFA
echTel	MIEDELLQFDKSMNEFRNKHFNTLNDTSGQMMGLRDTYRDSMKAFA
monDom	MSHIKTEEELDLFNKSINDFWNKFRNTTLNEHCSQMVGLRDTYKDSIEALT
macEug	MSHIKTEEELDIFEKSISDFWNRFRNTAFNEPYSQVVGVRDTYKYSIETLT
triVul	MSHIKTEEELDIFNKSINDFWNRFRNTTFNEHYSQVVGLRDTYKNSIEALT
ornAna	MSHIKTEEELALFDKSIDEFWTKFKNTWISEYSCQTVTLRDAHKEAIKALT
galGal	MSAVKTEDEITVVEREMKEFWTELKSVYGTEQINQTLALRDSCKESINVLS
taeGut	MGNAQAEDEVALFEKDMKEFWIQFKISYGTEQNNQTMKEFWIQFKISYGTE
anoCar	MAKAKEEDELTMLEKGIEELCTQIETTYCRQSLEKTSGPRNKCYKSGPRNK