Genome completion status: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 1: Line 1:
Which metazoan species currently have genomic data available? Hard to say ... it is a difficult process to track:
Which metazoan species currently have genomic data available? That's hard to say -- it is a difficult process to track. Consequently few researchers are adequately aware of what species have  genomic data, and so typically undersample needlessly when doing comparative genomics projects. Sampling species more densely often overturns working hypotheses of feature evolution.


Sequencing centers post raw trace reads on a day-by-day basis at NCBI's trace archives. NCBI performs some quality control and adds them to the accruing database that is blastn accessible. Later the center may assemble them into contigs and post them to the "wgs" division of GenBank (more rarely at at "gss" or "htgs"). Depending on the coverage and finishing effort, these contigs can be hosted as a genome by a browser center.
Sequencing centers post raw trace reads on a day-by-day basis at NCBI's trace archives. NCBI performs some quality control and adds them to the accruing database that is blastn accessible. Later the center may assemble them into contigs and post them to the "wgs" division of GenBank (more rarely at "gss" or "htgs"). Depending on the coverage and finishing effort, these contigs can be hosted as a genome by a browser center such as [http://genome.cse.ucsc.edu/index.html UCSC].
   
   
It may take 2-3 years for data to complete its migration from trace sequencing to contigs to genome. More rarely, traces are withheld and a genome assembly appears abruptly, as with elephantfish. There are no announcements, maintained lists, or publications; centers rarely update their websites or indicate specific future plans.  
It may take 2-3 years for data to complete its migration from trace sequencing to contigs to genome. More rarely, traces are withheld and a genome assembly appears abruptly, as with elephantfish. There are no announcements, maintained lists, or publications; sequencing centers rarely update their websites or indicate specific future plans.  


Further complications include two orang, two gibbons, subspecies confusion with gorilla and gibbon, multiple individual human genomes, and so forth. NISC and seq centers did not work from same individual or even subspecies so each trace compilation has to be checked separately.  
Further complications include multiple gorilla, orang and gibbon subspecies, personal individual human genomes, areas of confused taxonomy (alpaca vs vicuna) and so on. NISC and sequencing centers do not always work from same individual or even same subspecies so each trace compilation has to be checked separately.  


Consequently few other researchers are aware of what species have available genomic data, and so undersample taxonomically when doing comparative genomics projects. Often sampling more densely overturns working hypotheses of feature evolution.
To annotate at kbp scales (adequate for exons and small genes), one can reliably use the traces or contigs and not wait (years) for a genome browser to appear. 
 
If the exon or feature is 1000 bp (or in such pieces), the trace archives work quite well, especially for establishing presence. Absence is not as informative because data might simply be missing due to low coverage. No vertebrae species is truly complete yet including human. it requires a couple million traces before a given incoming genome is worth checking for a given feature.


To annotate at kbp scales (adequate for exons and small genes), one can reliably use the traces or contigs and not wait (years) for a genome browser to appear.  
Not every trace makes it into a contigs or assembly -- singletons are often omitted, millions of them. Sequencing often continues after a release, as is happening now with elephant and guinea pig. Consequently the trace archive at NCBI is always the resource of last resort, ie if a feature is missing from assembled traces, it is best to go back to the trace archive because all original data is there. However NCBI posting of traces to its blast database can lag trace inputs from the sequencing centers by a week or more, which can amount to a million traces in an active project.


if the exon or feature is 1000 bp (or in such sized pieces), the trace archives work quite well especially for establishing presence. Absence is not asinformative because data might simply be missing due to low coverage. No vertebrae species is truly complete yet including human. it requires a couple million traces before a given incoming genome is worth checking for a given feature.
There are also "cdna" species like the marsupial, Trichosurus vulpecular, which have rather complete coverage of coding genes but no genome project underway. Such sequences can  furnish critical close-in query material to improve the sensitivity of trace blast (which is not sensitive at any evolutionary distance if the feature is evolving rapidly).


It is important to be aware that not every trace makes it into a contigs or assembly -- singletons are often omitted, millions of them. Sequencing often continues after a release, as is happening now with elephant and guinea pig.  
In a specific comparative genomics research project, it is important to document which species were considered. For that it is convenient to enter annotation data in a column next to its species, in a spreadsheet containing all species with genomic data (provided below and illustrated below that with a concrete coding indel example). This allows last-minute updating prior to paper submission.


Thus if a feature is missing from assembled traces, it is best to go back to the trace archive because all original data is there. Be aware that the trace blast database can lag trace inputs from the centers by a week, which can amount to a million traces in an active project.
Finally, PCR can be used on species currently lacking genomic or cdna projects when it is critical to augment sampling density. Flying lemur would be a good choice in primate-oriented projects because it appears to be the immediate outgroup (hence a great improvement over distant mouse).


There are also "cdna" species like the third marsupial, Trichosurus vulpecular, which have rather complete coverage of coding genes. These can also furnish critical close-in queries to improve the sensitivity of trace blast.
Two recent papers illustrate these concepts and explain methods of contemporary comparative genomics in greater detail:


In a specific comparative genomics research project, it is important to document which species were considered. For that it is convenient to enter annotation data in a column next to its species, in a spreadsheet containing all species with genomic data (provided below and illustrated below that with a concrete coding indel example). This allows last-minute updating prior to paper submission.
<pre>
Janecka JE, Miller W, Pringle TH, Wiens F, Zitzmann A, Helgen KM, Springer MS, Murphy WJ.
Molecular and genomic data identify the closest living relative of primates.
Science. 2007 Nov 2;318(5851):792-4.
PMID: 17975064


Finally, PCR can be used on species currently lacking genomic or cdna projects when it is critical to augment sampling density. Flying lemur would be a good choice in primate oriented projects because it appears to be the immediate outgroup (hence a great improvement over distant mouse).
Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W.
Using genomic data to unravel the root of the placental mammal phylogeny.
Genome Res. 2007 Apr;17(4):413-21.  
PMID: 17322288
</pre>




Line 31: Line 41:
nomLeu  Trc07  Nomascus  leucogenys  gibbon
nomLeu  Trc07  Nomascus  leucogenys  gibbon
macMul  Jan06  Macaca  mulatta  rhesus
macMul  Jan06  Macaca  mulatta  rhesus
calJac  Wgs06  Callithrix  jacchus  marmoset_nwm
papHam  ???08  Papas hamadryas  baboon
calJac  Wgs06  Callithrix  jacchus  marmoset
tarSyr  ???07  Tarsius  syrichta  tarsier
tarSyr  ???07  Tarsius  syrichta  tarsier
otoGar  Dec06  Otolemur  garnettii  bushbaby
otoGar  Dec06  Otolemur  garnettii  bushbaby
micMur  Trc07  Microcebus  murinus  mouse_lemur
micMur  Trc07  Microcebus  murinus  mouse_lemur
cynVol  ???07 Cynocephalus  volans  flying_lemur
cynVol  ????? Cynocephalus  volans  flying_lemur
tupBel  Dec06  Tupaia  belangeri  tree_shrew
tupBel  Dec06  Tupaia  belangeri  tree_shrew
musMus  Feb06  Mus  musculus  mouse
musMus  Feb06  Mus  musculus  mouse
ratNor  Nov04  Rattus  norvegicus  rat
ratNor  Nov04  Rattus  norvegicus  rat
speTri  Wgs06  Spermophilus  tridecemlineatus  ground_squirrel
speTri  Wgs06  Spermophilus  tridecemlineatus  ground_squirrel
dipOrd  ???07  Dipodomys ordii  kangaroo_rat
cavPor  Wgs06  Cavia  porcellis  guinea_pig
cavPor  Wgs06  Cavia  porcellis  guinea_pig
oryCun  May05  Oryctolagus  cuniculus  rabbit
oryCun  May05  Oryctolagus  cuniculus  rabbit
ochPri  ???07  Ochotona  princeps  pika
canFam  May05  Canis  familiarus  dog
canFam  May05  Canis  familiarus  dog
felCat  Wgs06  Felis  catus  cat
felCat  Wgs06  Felis  catus  cat
Line 49: Line 62:
bosTau  Mar05  Bos  taurus  cow
bosTau  Mar05  Bos  taurus  cow
susScr  Trc06  Sus  scrofa  pig
susScr  Trc06  Sus  scrofa  pig
turTru  ???07  Tursiops truncatus dolphin
vicVic  ???07  Vicugna vicugna vicugna
sorAra  Wgs06  Sorex  araneus  shrew
sorAra  Wgs06  Sorex  araneus  shrew
eriEur  Wgs06  Erinaceus  europaeus  hedgehog
eriEur  Wgs06  Erinaceus  europaeus  hedgehog
Line 70: Line 85:
calMil  Wgs07  Callorhinchus  milii  elephantfish
calMil  Wgs07  Callorhinchus  milii  elephantfish
petMar  Trc06  Petromyzon  marinus  lamprey
petMar  Trc06  Petromyzon  marinus  lamprey
</pre>
 
A coding indel example (a coding exon from gene SPC25 on human chr2) illustrates the usefulness of multiple genomes in timing and understanding evolution of insertions and deletions.
coding indel example: SPC25 chr2 exon
</pre>
 
<pre>
homSap MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
homSap MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
panTro MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
panTro MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
Line 116: Line 131:
</pre>
</pre>
[[Category:Comparative Genomics]]
[[Category:Comparative Genomics]]
--tom

Revision as of 17:01, 16 November 2007

Which metazoan species currently have genomic data available? That's hard to say -- it is a difficult process to track. Consequently few researchers are adequately aware of what species have genomic data, and so typically undersample needlessly when doing comparative genomics projects. Sampling species more densely often overturns working hypotheses of feature evolution.

Sequencing centers post raw trace reads on a day-by-day basis at NCBI's trace archives. NCBI performs some quality control and adds them to the accruing database that is blastn accessible. Later the center may assemble them into contigs and post them to the "wgs" division of GenBank (more rarely at "gss" or "htgs"). Depending on the coverage and finishing effort, these contigs can be hosted as a genome by a browser center such as UCSC.

It may take 2-3 years for data to complete its migration from trace sequencing to contigs to genome. More rarely, traces are withheld and a genome assembly appears abruptly, as with elephantfish. There are no announcements, maintained lists, or publications; sequencing centers rarely update their websites or indicate specific future plans.

Further complications include multiple gorilla, orang and gibbon subspecies, personal individual human genomes, areas of confused taxonomy (alpaca vs vicuna) and so on. NISC and sequencing centers do not always work from same individual or even same subspecies so each trace compilation has to be checked separately.

To annotate at kbp scales (adequate for exons and small genes), one can reliably use the traces or contigs and not wait (years) for a genome browser to appear.

If the exon or feature is 1000 bp (or in such pieces), the trace archives work quite well, especially for establishing presence. Absence is not as informative because data might simply be missing due to low coverage. No vertebrae species is truly complete yet including human. it requires a couple million traces before a given incoming genome is worth checking for a given feature.

Not every trace makes it into a contigs or assembly -- singletons are often omitted, millions of them. Sequencing often continues after a release, as is happening now with elephant and guinea pig. Consequently the trace archive at NCBI is always the resource of last resort, ie if a feature is missing from assembled traces, it is best to go back to the trace archive because all original data is there. However NCBI posting of traces to its blast database can lag trace inputs from the sequencing centers by a week or more, which can amount to a million traces in an active project.

There are also "cdna" species like the marsupial, Trichosurus vulpecular, which have rather complete coverage of coding genes but no genome project underway. Such sequences can furnish critical close-in query material to improve the sensitivity of trace blast (which is not sensitive at any evolutionary distance if the feature is evolving rapidly).

In a specific comparative genomics research project, it is important to document which species were considered. For that it is convenient to enter annotation data in a column next to its species, in a spreadsheet containing all species with genomic data (provided below and illustrated below that with a concrete coding indel example). This allows last-minute updating prior to paper submission.

Finally, PCR can be used on species currently lacking genomic or cdna projects when it is critical to augment sampling density. Flying lemur would be a good choice in primate-oriented projects because it appears to be the immediate outgroup (hence a great improvement over distant mouse).

Two recent papers illustrate these concepts and explain methods of contemporary comparative genomics in greater detail:

Janecka JE, Miller W, Pringle TH, Wiens F, Zitzmann A, Helgen KM, Springer MS, Murphy WJ.
Molecular and genomic data identify the closest living relative of primates.
Science. 2007 Nov 2;318(5851):792-4.
PMID: 17975064

Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W.
Using genomic data to unravel the root of the placental mammal phylogeny.
Genome Res. 2007 Apr;17(4):413-21. 
PMID: 17322288


homSap  Mar06  Homo  sapiens  human
panTro  Mar06  Pan  troglodytes  chimp
gorGor  Dec07  Gorilla  gorilla  gorilla
ponPyg  Htg06  Pongo  pygmaeus  orang_sumatran
nomLeu  Trc07  Nomascus  leucogenys  gibbon
macMul  Jan06  Macaca  mulatta  rhesus
papHam  ???08  Papas hamadryas  baboon
calJac  Wgs06  Callithrix  jacchus  marmoset
tarSyr  ???07  Tarsius  syrichta  tarsier
otoGar  Dec06  Otolemur  garnettii  bushbaby
micMur  Trc07  Microcebus  murinus  mouse_lemur
cynVol  ?????  Cynocephalus  volans  flying_lemur
tupBel  Dec06  Tupaia  belangeri  tree_shrew
musMus  Feb06  Mus  musculus  mouse
ratNor  Nov04  Rattus  norvegicus  rat
speTri  Wgs06  Spermophilus  tridecemlineatus  ground_squirrel
dipOrd  ???07  Dipodomys ordii  kangaroo_rat
cavPor  Wgs06  Cavia  porcellis  guinea_pig
oryCun  May05  Oryctolagus  cuniculus  rabbit
ochPri  ???07  Ochotona  princeps  pika
canFam  May05  Canis  familiarus  dog
felCat  Wgs06  Felis  catus  cat
equCab  Jan07  Equus  caballus  horse
myoLuc  Wgs06  Myotis  lucifugus  microbat
pteVam  Trc06  Pteropus  vampyrus  macrobat
bosTau  Mar05  Bos  taurus  cow
susScr  Trc06  Sus  scrofa  pig
turTru  ???07  Tursiops truncatus dolphin
vicVic  ???07  Vicugna vicugna vicugna
sorAra  Wgs06  Sorex  araneus  shrew
eriEur  Wgs06  Erinaceus  europaeus  hedgehog
dasNov  May05  Dasypus  novemcinctus  armadillo
choHof  Trc06  Choloepus  hoffmanni  sloth
loxAfr  May05  Loxodonta  africana  elephant
proCap  Wgs06  Procavia  capensis  hyrax
echTel  Jul05  Echinops  telfairi  tenrec
monDom  Jan06  Monodelphis  domestica  opossum
macEug  Trc06  Macropus  eugenii  wallaby
ornAna  Mar07  Ornithorhynchus  anatinus  platypus
galGal  May06  Gallus  gallus  chicken
taeGut  Trc06  Taeniopygia  guttata  finch
anoCar  Feb07  Anolis  carolinensis  lizard
xenTro  Jun06  Xenopus  tropicalis  clawed_frog
danRer  Mar06  Danio  rerio  zebrafish
gasAcu  Feb06  Gasterosteus  aculeatus  stickleback
oryLat  Apr06  Oryzias  latipes  rice fish
takRub  Oct04  Takifugu  rubripes  fugu
tetNig  Feb04  Tetraodon  nigroviridis  puffer
calMil  Wgs07  Callorhinchus  milii  elephantfish
petMar  Trc06  Petromyzon  marinus  lamprey

A coding indel example (a coding exon from gene SPC25 on human chr2) illustrates the usefulness of multiple genomes in timing and understanding evolution of insertions and deletions.

homSap	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
panTro	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
ponPyg	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
macMul	MVEDELALFDKSINEFWNKFKST--DTSCQMAGLRDTYKDSIKAFA
calJac	MVEDELALFDKSLNEFWNKFKST--DTTFQMAGLRDTYKDSLKAFA
tarSyr	MVEDELTLFDKSINEFWNKFKST--DTANQMMGLRDTYKDSVKAFA
otoGar	MVEDQLALLDKNINEFWNKFKST--DTAGQMAGLRDTYKDSIKTFA
micMur	MVEDELVLFDKTVNEFWNKFKST--DTSCHMVGLRDTYKDSLKAFA
cynVol	.................NKFTST--DTSCQMMGLRGTNK.......
tupBel	MVEDELALFDKGINEFWNKFRSTVSDTSCQMVGLRDAYKDSIKAFA
musMus	MGEDELALLNQSINEFGDKFRNRLDDNHSQVLGLRDAFKDSMKAFS
ratNor	MGEDELAAFEKSINEFGDKFRYRLSDNRSQVLGLKDAFKDSIRALS
cavPor	MVEDELALFDKSINEFGNKFRNTLSDTPCQMLGLRDACKDSIKTLA
speTri	MMEDELARFDKSINEFGNKFRNTFSDTRCQMVGLRDVFKDSIEALA
dipOrd	MVEDELAHFDKSISEFGSKFRNTLSDTPSQTVGLRDAYKDSIKALS
oryCun	MVEDELALFDKSINEFGSKFRSTLSDAPCQMVGLRDAYKDSVKSLT
ochPri	MVEDELALFDKSINEFGSKFRSTLSDTPCQMVGLREACKDSVRLLT
canFam	MIDDELAQFDKSISEFWSKFKGTVSDTSSQMVGLRETYKDSIKACA
felCat	MIEDELALFDKSINEFWNKFKSTLSDTSCQMMGLRDTYKDSIKALT
equCab	MVEDELALFDKSINEFWNKFKNTVSDTSCQMVGLRDAYKDSIKAFA
myoLuc	MVEDELALLDKNINEFWNKFKSNVNDTSCQMVGLRDNYKDISKAFT
pteVam	MVEDELALLDKSINEFWNKFKSSVSDTSCQMMALRDSYKDINKAFT
bosTau	MVEDELALFDKSINEFWNKFKSTVSDTSCQMVGLRETYKDSIKAFA
turTru	MVEDELALFDKSINEFWNKFRSTVSDTSCQMVGLRDTYKDSIKAFA
susScr	MVEDELALFDKSINEFWNRFKSTVSDTSCQMVGLRENYKDSLKAFA
oviAri	MVEDELALFDKSLNEFWNKFKSTVNDTSCQMVGLREAYKDSIKAFA
eriEur	MVEDELALFDKSINEFWNKFKGTVSDTSFQMVGLRDTYKDSIKIFT
sorAra	MVEDELVLFEKSINEFVNEFESTASDTTCQVVGPRDADKDSIKALA
dasNov	MIEDELALFDKSINEFWNKFKGTVSDNSCQMVGLRDTYKDSIKAFA
choHof	MIEDELALFDKSINEFWNKFKSAVSDTSCQMVGLRDTYKDSIKAFA
loxAfr	MIEDELVQFDKSINEFWNKFINTASDTSCQMVGLRDAYKDSMKAFA
proCap	MIEDELRQFDKSINEFWNKFINTTSDTSCQMAGLRDAYKDSMKAFA
echTel	MIEDELLQFDKSMNEFRNKHFNTLNDTSGQMMGLRDTYRDSMKAFA
monDom	MSHIKTEEELDLFNKSINDFWNKFRNTTLNEHCSQMVGLRDTYKDSIEALT
macEug	MSHIKTEEELDIFEKSISDFWNRFRNTAFNEPYSQVVGVRDTYKYSIETLT
triVul	MSHIKTEEELDIFNKSINDFWNRFRNTTFNEHYSQVVGLRDTYKNSIEALT
ornAna	MSHIKTEEELALFDKSIDEFWTKFKNTWISEYSCQTVTLRDAHKEAIKALT
galGal	MSAVKTEDEITVVEREMKEFWTELKSVYGTEQINQTLALRDSCKESINVLS
taeGut	MGNAQAEDEVALFEKDMKEFWIQFKISYGTEQNNQTMKEFWIQFKISYGTE
anoCar	MAKAKEEDELTMLEKGIEELCTQIETTYCRQSLEKTSGPRNKCYKSGPRNK