Ensembl minimum install

From genomewiki
Jump to navigationJump to search

Load coordinate systems into tables seq_region and coord_system

Necessary files for this example, see File:EnsemblWorkshopFiles.tar.gz. The examples are just copy/pasted from the workshop documentation.

You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by it's chromosomal location (1:1000), its super_contig location (NT_039500:1-1000) or other coordinates

Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions)

  • set a little shortcut:
 export $DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
  • Create an empty database named mouse37_mini_ref and populate it with the CORE schema:
 mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql
  • Load coordinates and actual sequences into the empty core database:
  • chromosome -> super_contig mappings:

perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp

  • super_contig -> contig mappings:

perl $PS/load_seq_region.pl -coord_system_name supercontig -default_version -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose

  • See what's going on with:
 select * from seq_region
 select * from coord_system 
 select * from dna;
  • contigs:
 perl $PS/load_seq_region.pl -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa 
  • clones (only this command loads sequences into the "dna" table):

perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 -rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp

  • See what's going on with:
 select * from seq_region
 select * from coord_system 
 select * from dna;
  • Delete version numbers from coord_system table for contig and clone:
 select * from coord_system ;
 update coord_system set version=NULL where name ='clone' ;
 update coord_system set version=NULL where name ='contig' ;
 select * from coord_system ;

+-----------------+-------------+---------+------+--------------------------------+ | coord_system_id | name        | version | rank | attrib                         | +-----------------+-------------+---------+------+--------------------------------+ |               1 | chromosome  | NCBIM37 |    1 | default_version                | |               2 | supercontig | NCBIM37 |    2 | default_version                | |               3 | contig      | NULL    |    3 | default_version,sequence_level | |               4 | clone       | NULL    |    4 | default_version                | +-----------------+-------------+---------+------+--------------------------------+

Load assembly information into MySQL tables assembly and meta

  • chromosome to contig mapping:
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name chromosome -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_chr_contig.agp 
  • supercontig to contig (ignore all warnings):
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name supercontig -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_supercontig_contig.agp 
  • check what's going on:
 select * from assembly

+-------------------+-------------------+-----------+-----------+-----------+---------+-----+ | asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end   | cmp_start | cmp_end | ori | +-------------------+-------------------+-----------+-----------+-----------+---------+-----+ |                 1 |                14 | 129260521 | 129429106 |     41261 |  209846 |   1 | |                 2 |                17 |  69703858 |  69950060 |      2001 |  248203 |   1 | |                 3 |                13 |  94665450 |  94866948 |     22953 |  224451 |   1 | |                 4 |                16 |  21549325 |  21662718 |      2001 |  115394 |   1 | |                 4 |                16 |  21662719 |  21672889 |    115471 |  125641 |   1 | |                 5 |                15 |  81038621 |  81119472 |         1 |   80852 |  -1 | |                 6 |                18 |   3208471 |   3436586 |      2001 |  230116 |   1 | |                 7 |                15 |  23370008 |  23450859 |         1 |   80852 |  -1 | |                 8 |                14 |  81573483 |  81742068 |     41261 |  209846 |   1 | |                 9 |                13 |  44047803 |  44249301 |     22953 |  224451 |   1 | |                10 |                16 |  10375984 |  10489377 |      2001 |  115394 |   1 | |                10 |                16 |  10489378 |  10499548 |    115471 |  125641 |   1 | |                11 |                17 |  35208736 |  35454938 |      2001 |  248203 |   1 | |                12 |                18 |    208471 |    436586 |      2001 |  230116 |   1 | +-------------------+-------------------+-----------+-----------+-----------+---------+-----+ The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).

  • clone to contig (ignore all warnings)
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name clone -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp