The Ensembl Browser: Difference between revisions

From genomewiki
Jump to navigationJump to search
(New page: I am trying to learn how Ensembl is structured. As Ensembl itself does not have a wiki nor a forum nor a public mailing list for user discussions, I'll document it here. Some ideas: * Eve...)
 
No edit summary
Line 3: Line 3:
Some ideas:
Some ideas:
* Everything is in mysql databases. No flat text files. [http://www.ensembl.org/info/docs/api/core/core_schema.html Database schema documentation]
* Everything is in mysql databases. No flat text files. [http://www.ensembl.org/info/docs/api/core/core_schema.html Database schema documentation]
* Sequences are stored
* Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries (very complex schema)
* A whole, complete update of everything is done every 6 months. The old code, the old API and all databases are archived. Different mysql servers running on different ports are used to separated older archived from current versions.
* The current version (oct 09) is 56
* Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a'
* The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful
 
The tables:
* Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system'
* The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty.
 
The pipeline:
* Their pipeline systems inserts jobs into a mysql database as well
* Their pipeline systems inserts jobs into a mysql database as well
* The genebuild step is predicting genes
* The xref step is connecting predicted genes to external identifiers
* The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins
* The biomart step is de-normalizing all databases for faster access (It seems that biomart is not archived. If this is true, then one cannot rely on it for whole-genome work as one might end up with inconsistent data)

Revision as of 11:25, 27 October 2009

I am trying to learn how Ensembl is structured. As Ensembl itself does not have a wiki nor a forum nor a public mailing list for user discussions, I'll document it here.

Some ideas:

  • Everything is in mysql databases. No flat text files. Database schema documentation
  • Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries (very complex schema)
  • A whole, complete update of everything is done every 6 months. The old code, the old API and all databases are archived. Different mysql servers running on different ports are used to separated older archived from current versions.
  • The current version (oct 09) is 56
  • Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a'
  • The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful

The tables:

  • Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system'
  • The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty.

The pipeline:

  • Their pipeline systems inserts jobs into a mysql database as well
  • The genebuild step is predicting genes
  • The xref step is connecting predicted genes to external identifiers
  • The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins
  • The biomart step is de-normalizing all databases for faster access (It seems that biomart is not archived. If this is true, then one cannot rely on it for whole-genome work as one might end up with inconsistent data)