The Ensembl Browser
I am trying to learn how Ensembl is structured. These notes are partially based on a workshop in 2010 at the EBI.
- Everything is in mysql databases. No flat text files.
- Everything is programmed in PERL (except the UCSC programs?)
- Main documentation start page
- Bert Overduin's homepage has a list of all slides and exercises - very handy!
- Parts of source code:
- "core": genome database and related tools
- "pipeline": the job scheduling system + config files
- "analysis": all genome annotation tools and wrappers
- Subdirectories of source parts:
- "scripts": command-line tools (mostly PERL)
- "modules": perl modules for command-line tools
- "sql": database schemas for scripts (remember that everything reads/writes to mysql
Pipeline / scheduling system
- Job description, input data description and commands are written to MySQL, the cluster writes the results back to mysql
- Each node will only extract part of the
- the paper has a rough general outline
- Basic schema is in ensembl-pipeline/sql/table.sql with some documentation
- The most useful documentation is in in CVS pipeline-docs
- The genebuild step is predicting genes
- The xref step is connecting predicted genes to external identifiers
- The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins
- The biomart step is de-normalizing all databases for faster access (All older biomart versions are accessible via the archived old ensembl versions)
Genome data storage
- Basic schema is in ensembl/sql/table.sql, with quite a bit of documentation of the tables
- Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries
- Database schema documentation
- The database schema is very complex, due to self-referencing tables, whole-genome queries are not possible without biomart at reasonable speed
- An update of everything is done every 6 months. The old code, the old API and all databases are archived. Different mysql servers running on different ports are used to separated older archived from current versions.
- Genes are not re-predicted each time but only when new data is added to the gene build. The starting month of the last update of a gene build is stored in genome_db.genebuild (not the month when the genebuild ended, so I don't see how you know if genes changed)
- the current version can be found out with:
select * from meta meta where meta_key in ("schema_version", "patch")
- Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a'
- The version number system is the opposite of the usual order. The last part of the version number is the MOST important part. If the last one is identical, then that means that the data is the same. The first part if only the artificial "release" number. E.g. Human_56_37a and Human_57_37a are actually the same database, which has been assigned to two different "releases".
- The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful
- The database structure is very normalized. Whereas this is nice from a software engineering perspective, you cannot do large-scale requests. E.g. downloading all homologs between two genomes involves queries on self-referencing tables which take ages to resolve and will time out if run on their server. Use biomart for these types of requests.
- There are still a lot of older functions lingering in the source code. If a function returns null although it shouldn't have a look into the source code. Often they have been replaced by others. The ensembl-dev mailing list is a good way to get more information.
- Ensembl minimum install
All versions of the genomes are on the same server. Some ideas to help you find your way:
- Database names follow the schema <species>_<databaseType>_<releaseNumber>_<assemblyNumber>_<ChangesSinceLastAnnotation>
- assembly number is the assembly version number from NCBI in the case of human
- release is updated every 2 months
- e.g. Homo_sapiens_core_59_37d
- e.g. Homo_sapiens_core_58_37d
- e.g. Homo_sapiens_core_57_37d
- As you can see, the last letter is the most important one - you can see that there are no changes at all to the human annotations from 57 to 59, as the final "d" has not changed!
- ensembl_compara includes homologies between proteins and genomes
- ensembl_go_version: Not used anymore? Was used to store gene ontology links.
- ensembl_website_version: Ensembl includes some sort of content management system. This databases includes help articles, bugs, news, the list of species on the frontpage etc. (This database looks somewhat similar to hgcentral)
- ensembl_ancestral_version ??
- Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system'
- The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty.
- genes are linked to synonyms/names via xref-tables.