DoBlastzChainNet.pl: Difference between revisions

From genomewiki
Jump to navigationJump to search
(adding working directory)
(adding genome sequences)
Line 83: Line 83:
it consistent to make it easier to use scripts on multiple sequences.
it consistent to make it easier to use scripts on multiple sequences.


==Obtain genome sequences==
Genome sequences from the '''U.C. Santa Cruz Genomics Institute''' can be obtained
directly from the '''hgdownload''' server via rsync.  For example
mkdir /data/genomes/dm6
cd /data/genomes/dm6
rsync -avzP \
    rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.2bit .
rsync -avzP \
    rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.chrom.sizes .
[[Category:Cluster FAQ]]
[[Category:Cluster FAQ]]
[[Category:Technical FAQ]]
[[Category:Technical FAQ]]

Revision as of 03:06, 6 April 2018

Prerequisites

This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.

Parasol Job Control System

The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.

Install scripts and kent command line utilities

This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.


 mkdir -p /data/scripts /data/bin
 chmod 755 /data/scripts /data/bin

 rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/
 git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \
  --prefix=kent/ HEAD src/hg/utils/automation \
     | tar vxf - -C /data/scripts --strip-components=5 \
        --exclude='kent/src/hg/utils/automation/incidentDb' \
      --exclude='kent/src/hg/utils/automation/configFiles' \
      --exclude='kent/src/hg/utils/automation/ensGene' \
      --exclude='kent/src/hg/utils/automation/genbank' \
      --exclude='kent/src/hg/utils/automation/lastz_D' \
      --exclude='kent/src/hg/utils/automation/openStack'

PATH setup

Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in the your home directory:

echo 'export PATH=/data/bin:/data/scripts:$PATH' >> $HOME/.bashrc

Then, source that file to add that to this current shell:

. $HOME/.bashrc

Verify you see those pathnames on the PATH variable:

echo $PATH
/data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin

This entire discussion assumes the bash shell is the user's unix shell.

Working directory hierarchy

It is best to organize your work in a directory hierarchy. For example maintain all your genome sequences in:

 /data/genomes/
 /data/genomes/hg38/
 /data/genomes/mm10/
 /data/genomes/dm6/
 /data/genomes/ce11/
 ... etc ...

Where those database directories can have the 2bit files, chrom.sizes, and track construction directories, for example:

 /data/genomes/dm6/dm6.2bit
 /data/genomes/dm6/dm6.chrom.sizes
 /data/genomes/dm6/trackData/

Such organizations are a personal preference custom. However you do this, keep it consistent to make it easier to use scripts on multiple sequences.

Obtain genome sequences

Genome sequences from the U.C. Santa Cruz Genomics Institute can be obtained directly from the hgdownload server via rsync. For example

mkdir /data/genomes/dm6
cd /data/genomes/dm6
rsync -avzP \
   rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.2bit .
rsync -avzP \
   rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.chrom.sizes .