Ensembl data load: Difference between revisions

From genomewiki
Jump to navigationJump to search
Line 1: Line 1:
To load data into an Ensembl database, one has to add analysis steps. The first step is always the sequences themselves. The following one will process these sequences. "Rules" are added at the end to say which step is based on which other step.
== Load Repeatmasker file ==
== Load Repeatmasker file ==
*  The make things easier, let's set a little shortcut:  
*  The make things easier, let's set a little shortcut:  
Line 6: Line 8:
  RepeatMasker -species mouse -qq -dir <full_path_to_output_directory> $HOME/workshop/genebuild/test_seqs/test_sequence_to_repeatmask.fa
  RepeatMasker -species mouse -qq -dir <full_path_to_output_directory> $HOME/workshop/genebuild/test_seqs/test_sequence_to_repeatmask.fa


* Create a "dummy analysis file" which will simply select the sequences to analyse (here: contigs), e.g. create a file submit_ana.conf:
* Analysis Step 1: Create a "dummy analysis file" which will simply select the sequences to analyse (here: contigs), e.g. create a file submit_ana.conf:
  [SubmitContig]
  [SubmitContig]
  module=Dummy
  module=Dummy
  input_id_type=CONTIG
  input_id_type=CONTIG
* Load the "dummy analysis"
* Load the "dummy analysis" into the database
  $HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file repeatmask_ana.conf  
  $HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file repeatmask_ana.conf  
* Define the real analysis, e.g. repeatmask_ana.conf
* Analysis Step 2: Define the real analysis, e.g. repeatmask_ana.conf
  [RepeatMask]
  [RepeatMask]
  db=repbase
  db=repbase
Line 59: Line 61:
     gff_source: RepeatMask
     gff_source: RepeatMask
     gff_feature: repeat
     gff_feature: repeat
* Add a rule which says that RepeatMask requires the contig sequences:
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/RuleHandler.pl $DBSPEC \
-insert -goal RepeatMask \
-condition SubmitContig
* perl $HOME/cvs_checkout/ensembl-pipeline/scripts/make_input_ids $DBSPEC -logic_name SubmitContig -coord_system contig -slice 150k
* check what has changed:
select ia.input_id,a.logic_name from input_id_analysis ia, analysis a where ia.analysis_id = a.analysis_id ;
+---------------------------------------+--------------+
| input_id                              | logic_name  |
+---------------------------------------+--------------+
| contig:NCBIM37:AC087062.25:1:224451:1 | SubmitContig |
| contig:NCBIM37:AC138620.4:1:209846:1  | SubmitContig |
| contig:NCBIM37:AC153919.8:1:264561:1  | SubmitContig |
| contig:NCBIM37:AL589742.21:1:125641:1 | SubmitContig |

Revision as of 08:59, 14 September 2010

To load data into an Ensembl database, one has to add analysis steps. The first step is always the sequences themselves. The following one will process these sequences. "Rules" are added at the end to say which step is based on which other step.

Load Repeatmasker file

  • The make things easier, let's set a little shortcut:
export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
  • Run repeatmasker on a fasta file:
RepeatMasker -species mouse -qq -dir <full_path_to_output_directory> $HOME/workshop/genebuild/test_seqs/test_sequence_to_repeatmask.fa
  • Analysis Step 1: Create a "dummy analysis file" which will simply select the sequences to analyse (here: contigs), e.g. create a file submit_ana.conf:
[SubmitContig]
module=Dummy
input_id_type=CONTIG
  • Load the "dummy analysis" into the database
$HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file repeatmask_ana.conf 
  • Analysis Step 2: Define the real analysis, e.g. repeatmask_ana.conf
[RepeatMask]
db=repbase
db_version=0129
db_file=repbase
program=RepeatMask
program_version=3.1.8
program_file=/path/to/repmasker/RepeatMask
parameters=-nolow -species mouse -s
module=RepeatMask
gff_source=RepeatMask
gff_feature=repeat
input_id_type=CONTIG
  • load the analysis into the mysql database
$HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file repeatmask_ana.conf
  • see what happened:
SELECT * from analysis\G
*************************** 1. row ***************************
   analysis_id: 1
       created: 2010-09-13 16:50:16
    logic_name: SubmitContig
            db: NULL
    db_version: NULL
       db_file: NULL
       program: NULL
program_version: NULL
  program_file: NULL
    parameters: NULL
        module: Dummy
module_version: NULL
    gff_source: NULL
   gff_feature: NULL
*************************** 2. row ***************************
   analysis_id: 2
       created: 2010-09-13 16:14:11
    logic_name: RepeatMask
            db: repbase
    db_version: 0129
       db_file: repbase
       program: RepeatMask
program_version: 3.1.8
  program_file: /path/to/repmasker/RepeatMask
    parameters: -nolow -species mouse -s
        module: RepeatMask
module_version: NULL
    gff_source: RepeatMask
   gff_feature: repeat
  • Add a rule which says that RepeatMask requires the contig sequences:
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/RuleHandler.pl $DBSPEC \
-insert -goal RepeatMask \
-condition SubmitContig
  • perl $HOME/cvs_checkout/ensembl-pipeline/scripts/make_input_ids $DBSPEC -logic_name SubmitContig -coord_system contig -slice 150k
  • check what has changed:
select ia.input_id,a.logic_name from input_id_analysis ia, analysis a where ia.analysis_id = a.analysis_id ;
+---------------------------------------+--------------+
| input_id                              | logic_name   |
+---------------------------------------+--------------+
| contig:NCBIM37:AC087062.25:1:224451:1 | SubmitContig | 
| contig:NCBIM37:AC138620.4:1:209846:1  | SubmitContig | 
| contig:NCBIM37:AC153919.8:1:264561:1  | SubmitContig | 
| contig:NCBIM37:AL589742.21:1:125641:1 | SubmitContig |