Ensembl minimum install
Load coordinate systems into tables seq_region and coord_system
Necessary files for this example, see File:EnsemblWorkshopFiles.tar.gz. The examples are just copy/pasted from the workshop documentation.
You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by it's chromosomal location (1:1000), its super_contig location (NT_039500:1-1000) or other coordinates
Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions)
- set a little shortcut:
export $DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
- Create an empty database named mouse37_mini_ref and populate it with the CORE schema:
mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql
- Load coordinates and actual sequences into the empty core database:
- chromosome -> super_contig mappings:
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp
- super_contig -> contig mappings:
perl $PS/load_seq_region.pl -coord_system_name supercontig -default_version -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose
- See what's going on with:
select * from seq_region select * from coord_system select * from dna;
- contigs:
perl $PS/load_seq_region.pl -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa
- clones (only this command loads sequences into the "dna" table):
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 -rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp
- See what's going on with:
select * from seq_region select * from coord_system select * from dna;
- Delete version numbers from coord_system table for contig and clone:
select * from coord_system ; update coord_system set version=NULL where name ='clone' ; update coord_system set version=NULL where name ='contig' ; select * from coord_system ;
+-----------------+-------------+---------+------+--------------------------------+
| coord_system_id | name | version | rank | attrib |
+-----------------+-------------+---------+------+--------------------------------+
| 1 | chromosome | NCBIM37 | 1 | default_version |
| 2 | supercontig | NCBIM37 | 2 | default_version |
| 3 | contig | NULL | 3 | default_version,sequence_level |
| 4 | clone | NULL | 4 | default_version |
+-----------------+-------------+---------+------+--------------------------------+
Load assembly information into MySQL tables assembly and meta
- chromosome to contig mapping:
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name chromosome -component_name contig -agp_file /home/ensembl/workshop/genebuild/assembly/mini_chr_contig.agp
- supercontig to contig (ignore all warnings):
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name supercontig -component_name contig -agp_file /home/ensembl/workshop/genebuild/assembly/mini_supercontig_contig.agp
- check what's going on:
select * from assembly
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+ | asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end | cmp_start | cmp_end | ori | +-------------------+-------------------+-----------+-----------+-----------+---------+-----+ | 1 | 14 | 129260521 | 129429106 | 41261 | 209846 | 1 | | 2 | 17 | 69703858 | 69950060 | 2001 | 248203 | 1 | | 3 | 13 | 94665450 | 94866948 | 22953 | 224451 | 1 | | 4 | 16 | 21549325 | 21662718 | 2001 | 115394 | 1 | | 4 | 16 | 21662719 | 21672889 | 115471 | 125641 | 1 | | 5 | 15 | 81038621 | 81119472 | 1 | 80852 | -1 | | 6 | 18 | 3208471 | 3436586 | 2001 | 230116 | 1 | | 7 | 15 | 23370008 | 23450859 | 1 | 80852 | -1 | | 8 | 14 | 81573483 | 81742068 | 41261 | 209846 | 1 | | 9 | 13 | 44047803 | 44249301 | 22953 | 224451 | 1 | | 10 | 16 | 10375984 | 10489377 | 2001 | 115394 | 1 | | 10 | 16 | 10489378 | 10499548 | 115471 | 125641 | 1 | | 11 | 17 | 35208736 | 35454938 | 2001 | 248203 | 1 | | 12 | 18 | 208471 | 436586 | 2001 | 230116 | 1 | +-------------------+-------------------+-----------+-----------+-----------+---------+-----+ The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).
- clone to contig (ignore all warnings)
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name clone -component_name contig -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp