The Ensembl Browser: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 1: Line 1:
I am trying to learn how Ensembl is structured. As Ensembl itself does not have a wiki nor a forum nor a public mailing list for user discussions, I'll document it here.
I am trying to learn how Ensembl is structured. As Ensembl itself does not have a wiki, nor a forum, nor a public mailing list for user discussions, I'll document it here.


Some ideas:
Some ideas:
Line 21: Line 21:
* ensembl_website_version: Ensembl includes some sort of content management system. This databases includes help articles, bugs, news, the list of species on the frontpage etc. (This database looks somewhat similar to hgcentral)
* ensembl_website_version: Ensembl includes some sort of content management system. This databases includes help articles, bugs, news, the list of species on the frontpage etc. (This database looks somewhat similar to hgcentral)
* ensembl_ancestral_version ??
* ensembl_ancestral_version ??


The species database:
The species database:
Line 34: Line 33:
* The biomart step is de-normalizing all databases for faster access (All older biomart versions are accessible via the archived old ensembl versions)
* The biomart step is de-normalizing all databases for faster access (All older biomart versions are accessible via the archived old ensembl versions)


Documentation:
* Most documentation is not accessible from the Ensembl homepage. You have to dig into the CVS repositories to find "pipeline_docs": [http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/?root=ensembl] The file [http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/overview.txt?revision=1.6&root=ensembl&view=markup overview.txt] gives a very good introduction.
[[Category:Technical FAQ]]
[[Category:Technical FAQ]]

Revision as of 09:26, 22 January 2010

I am trying to learn how Ensembl is structured. As Ensembl itself does not have a wiki, nor a forum, nor a public mailing list for user discussions, I'll document it here.

Some ideas:

  • Everything is in mysql databases. No flat text files. Database schema documentation
  • Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries (very complex schema)
  • An update of everything is done every 6 months. The old code, the old API and all databases are archived. Different mysql servers running on different ports are used to separated older archived from current versions.
    • Genes are not re-predicted each time but only when new data is added to the gene build. The starting month of the last update of a gene build is stored in genome_db.genebuild (not the month when the genebuild ended, so I don't see how you know if genes changed)
  • The current version (oct 09) is 56
  • Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a'
  • The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful
  • The database structure is very normalized. Whereas this is nice from a software engineering perspective, you cannot do large-scale requests. E.g. downloading all homologs between two genomes involves queries on self-referencing tables which take ages to resolve and will time out if run on their server. Use biomart for these types of requests.
  • There are still a lot of older functions lingering in the source code. If a function returns null although it shouldn't have a look into the source code. Often they have been replaced by others. The ensembl-dev mailing list is usually the only way to get more information.


The databases:

All versions of the genomes are on the same server. Some ideas to help you find your way:

  • species_name_version_obscureNumber is the format of the individual species database (see below)
  • ensembl_compara includes homologies between proteins and genomes
  • ensembl_go_version: Not used anymore? Was used to store gene ontology links.
  • ensembl_website_version: Ensembl includes some sort of content management system. This databases includes help articles, bugs, news, the list of species on the frontpage etc. (This database looks somewhat similar to hgcentral)
  • ensembl_ancestral_version ??

The species database:

  • Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system'
  • The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty.

The pipeline:

  • Their pipeline systems inserts jobs into a mysql database as well
  • The genebuild step is predicting genes
  • The xref step is connecting predicted genes to external identifiers
  • The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins
  • The biomart step is de-normalizing all databases for faster access (All older biomart versions are accessible via the archived old ensembl versions)

Documentation:

  • Most documentation is not accessible from the Ensembl homepage. You have to dig into the CVS repositories to find "pipeline_docs": [1] The file overview.txt gives a very good introduction.