Ensembl minimum install

From genomewiki
Revision as of 13:26, 13 September 2010 by Max (talk | contribs)
Jump to navigationJump to search

Load coordinate systems into tables seq_region and coord_system

Necessary files for this example, see File:EnsemblWorkshopFiles.tar.gz. The examples are just copy/pasted from the workshop documentation prepared by Bronwen Aken and Jan Vogel, with some notes added by myself. The original documentation can be found on Jan Vogel's homepage and in the tar archive referenced above, in the doc/ directory.

You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by it's chromosomal location (1:1000), its super_contig location (NT_039500:1-1000) or other coordinates

Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions)

  • set a little shortcut:
 export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
  • Create an empty database named mouse37_mini_ref and populate it with the CORE schema:
 mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql
  • Load coordinates and actual sequences into the empty core database:
  • chromosome -> super_contig mappings:

perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp

  • super_contig -> contig mappings:

perl $PS/load_seq_region.pl -coord_system_name supercontig -default_version -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose

  • See what's going on with:
 select * from seq_region
 select * from coord_system 
 select * from dna;
  • contigs:
 perl $PS/load_seq_region.pl -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa 
  • clones (only this command loads sequences into the "dna" table):

perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 -rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp

  • See what's going on with:
 select * from seq_region
 select * from coord_system 
 select * from dna;
  • Delete version numbers from coord_system table for contig and clone:
 select * from coord_system ;
 update coord_system set version=NULL where name ='clone' ;
 update coord_system set version=NULL where name ='contig' ;
 select * from coord_system ;

+-----------------+-------------+---------+------+--------------------------------+ | coord_system_id | name        | version | rank | attrib                         | +-----------------+-------------+---------+------+--------------------------------+ |               1 | chromosome  | NCBIM37 |    1 | default_version                | |               2 | supercontig | NCBIM37 |    2 | default_version                | |               3 | contig      | NULL    |    3 | default_version,sequence_level | |               4 | clone       | NULL    |    4 | default_version                | +-----------------+-------------+---------+------+--------------------------------+

Load assembly information into MySQL tables assembly and meta

  • chromosome to contig mapping:
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name chromosome -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_chr_contig.agp 
  • supercontig to contig (ignore all warnings):
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name supercontig -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_supercontig_contig.agp 
  • check what's going on:
 select * from assembly

+-------------------+-------------------+-----------+-----------+-----------+---------+-----+ | asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end   | cmp_start | cmp_end | ori | +-------------------+-------------------+-----------+-----------+-----------+---------+-----+ |                 1 |                14 | 129260521 | 129429106 |     41261 |  209846 |   1 | |                 2 |                17 |  69703858 |  69950060 |      2001 |  248203 |   1 | |                 3 |                13 |  94665450 |  94866948 |     22953 |  224451 |   1 | |                 4 |                16 |  21549325 |  21662718 |      2001 |  115394 |   1 | |                 4 |                16 |  21662719 |  21672889 |    115471 |  125641 |   1 | |                 5 |                15 |  81038621 |  81119472 |         1 |   80852 |  -1 | |                 6 |                18 |   3208471 |   3436586 |      2001 |  230116 |   1 | |                 7 |                15 |  23370008 |  23450859 |         1 |   80852 |  -1 | |                 8 |                14 |  81573483 |  81742068 |     41261 |  209846 |   1 | |                 9 |                13 |  44047803 |  44249301 |     22953 |  224451 |   1 | |                10 |                16 |  10375984 |  10489377 |      2001 |  115394 |   1 | |                10 |                16 |  10489378 |  10499548 |    115471 |  125641 |   1 | |                11 |                17 |  35208736 |  35454938 |      2001 |  248203 |   1 | |                12 |                18 |    208471 |    436586 |      2001 |  230116 |   1 | +-------------------+-------------------+-----------+-----------+-----------+---------+-----+ The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).

  • clone to contig (ignore all warnings)
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name clone -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp

Some nice additions =

  • As assemblies can contain references to external databases, we load a list of default references:
 perl $HOME/cvs_checkout/ensembl/misc-scripts/external_db/update_external_dbs.pl $DBSPEC -nonreleasemode -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/external_db/external_dbs.txt
  • contigs that are not located on chromosomes but exist by themselves need the toplevel attribute set. For this, first we need to define the attribute "toplevel", then link the toplevel attribute to the unmapped contigs.
 perl $HOME/cvs_checkout/ensembl/misc-scripts/attribute_types/upload_attributes.pl  $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/attribute_types/attrib_type.txt
  • Ensembl has a system to track sequences that are not mapped at all (ESTs, cDNAs, contigs, etc), so we also populate the "unmapped_reasons" table, though you can skip this.
 perl $HOME/cvs_checkout/ensembl/misc-scripts/unmapped_reason/update_unmapped_reasons.pl $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/unmapped_reason/unmapped_reason.txt