The Ensembl Browser: Difference between revisions

Revision as of 11:25, 27 October 2009

I am trying to learn how Ensembl is structured. As Ensembl itself does not have a wiki nor a forum nor a public mailing list for user discussions, I'll document it here.

Some ideas:

Everything is in mysql databases. No flat text files. Database schema documentation
Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries (very complex schema)
A whole, complete update of everything is done every 6 months. The old code, the old API and all databases are archived. Different mysql servers running on different ports are used to separated older archived from current versions.
The current version (oct 09) is 56
Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a'
The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful

The tables:

Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system'
The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty.

The pipeline:

Their pipeline systems inserts jobs into a mysql database as well
The genebuild step is predicting genes
The xref step is connecting predicted genes to external identifiers
The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins
The biomart step is de-normalizing all databases for faster access (It seems that biomart is not archived. If this is true, then one cannot rely on it for whole-genome work as one might end up with inconsistent data)

@@ Line 3: / Line 3: @@
 Some ideas:
 * Everything is in mysql databases. No flat text files. [http://www.ensembl.org/info/docs/api/core/core_schema.html Database schema documentation]
-* Sequences are stored
+* Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries (very complex schema)
+* A whole, complete update of everything is done every 6 months. The old code, the old API and all databases are archived. Different mysql servers running on different ports are used to separated older archived from current versions.
+* The current version (oct 09) is 56
+* Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a'
+* The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful
+The tables:
+* Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system'
+* The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty.
+The pipeline:
 * Their pipeline systems inserts jobs into a mysql database as well
+* The genebuild step is predicting genes
+* The xref step is connecting predicted genes to external identifiers
+* The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins
+* The biomart step is de-normalizing all databases for faster access (It seems that biomart is not archived. If this is true, then one cannot rely on it for whole-genome work as one might end up with inconsistent data)

The Ensembl Browser: Difference between revisions

Revision as of 11:25, 27 October 2009

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools