DoSameSpeciesLiftOver.pl: Difference between revisions

From genomewiki
Jump to navigationJump to search
(add category tags)
(Obtain genome sequences and working directories)
Line 71: Line 71:
  /data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin
  /data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin


==Fetch genome assemblies==
==Obtain genome sequences==
 
This example is going to use two Soy bean assemblies, one from '''genbank''' and one from '''refseq'''
assemblies at NCBI.
 
mkdir -p /data/genomes/genbank /data/genomes/refseq
cd /data/genomes/genbank
rsync -L -a -P \
  rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Glycine_max/all_assembly_versions/GCA_000004515.3_Glycine_max_v2.0/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz ./
cd /data/genomes/refseq
rsync -L -a -P \
rsync://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Glycine_max/all_assembly_versions/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz ./
cd /data/genomes
ls -og genbank/*.gz refseq/*.gz
-r--r--r--. 1 296231629 Jun 13  2016 genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz
-r--r--r--. 1 296228780 Jan  6  2015 refseq/GCF_000004515.3_V1.1_genomic.fna.gz
 
==Working directories==
 
 
Organize your work in a directory hierarchy for convenience of bookeeping and shell script automation
for numerous sequences.
 
The '''target''' sequence name is '''GCA_000004515.3_Glycine_max_v2.0''' and the '''query'''
sequence name is '''GCF_000004515.3_V1.1'''.
 
Convert the '''fasta''' sequence to '''.2bit''' files and calculate '''chrom.sizes''' files.
 
mkdir /data/genomes/GCA_000004515.3_Glycine_max_v2.0
mkdir /data/genomes/GCF_000004515.3_V1.1
cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0
faToTwoBit /data/genomes/genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz GCA_000004515.3_Glycine_max_v2.0.2bit
twoBitInfo GCA_000004515.3_Glycine_max_v2.0.2bit stdout | sort -k2,2nr > GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
ls -og
-rw-rw-r--. 1 287569208 Apr 13 17:30 GCA_000004515.3_Glycine_max_v2.0.2bit
-rw-rw-r--. 1    19439 Apr 13 17:31 GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
cd mkdir /data/genomes/GCF_000004515.3_V1.1
faToTwoBit /data/genomes/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz GCF_000004515.3_V1.1.2bit
twoBitInfo GCF_000004515.3_V1.1.2bit stdout | sort -k2,2nr > GCF_000004515.3_V1.1.chrom.sizes
ls -og
-rw-rw-r--. 1 286264183 Apr 13 17:30 GCF_000004515.3_V1.1.2bit
-rw-rw-r--. 1    23246 Apr 13 17:31 GCF_000004515.3_V1.1.chrom.sizes
 


[[Category:Cluster FAQ]]
[[Category:Cluster FAQ]]
[[Category:Technical FAQ]]
[[Category:Technical FAQ]]

Revision as of 21:15, 25 April 2018

Licensing

For commercial use of these toolsets, please note the license considerations for the kent source tools at the: Genome Store

Prerequisites

This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.

This entire discussion assumes the bash shell is the user's unix shell.

Parasol Job Control System

The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.

Install scripts and kent command line utilities

This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.


 mkdir -p /data/scripts /data/bin
 chmod 755 /data/scripts /data/bin

 rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/
 git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \
  --prefix=kent/ HEAD src/hg/utils/automation \
     | tar vxf - -C /data/scripts --strip-components=5 \
        --exclude='kent/src/hg/utils/automation/incidentDb' \
      --exclude='kent/src/hg/utils/automation/configFiles' \
      --exclude='kent/src/hg/utils/automation/ensGene' \
      --exclude='kent/src/hg/utils/automation/genbank' \
      --exclude='kent/src/hg/utils/automation/lastz_D' \
      --exclude='kent/src/hg/utils/automation/openStack'
  wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/utils/bedSingleCover.pl'

NOTE: A copy of the lastz binary is included in the rsync download of binaries from hgdownload. It is named lastz-1.04.00 to identify the version. Source for lastz can be obtained from lastz github.

PATH setup

Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in your home directory:

echo 'export PATH=/data/bin:/data/bin/blat:/data/scripts:$PATH' >> $HOME/.bashrc

Then, source that file to add that to this current shell:

. $HOME/.bashrc

Verify you see those pathnames on the PATH variable:

echo $PATH
/data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin

Obtain genome sequences

This example is going to use two Soy bean assemblies, one from genbank and one from refseq assemblies at NCBI.

mkdir -p /data/genomes/genbank /data/genomes/refseq
cd /data/genomes/genbank
rsync -L -a -P \
  rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Glycine_max/all_assembly_versions/GCA_000004515.3_Glycine_max_v2.0/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz ./
cd /data/genomes/refseq
rsync -L -a -P \
rsync://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Glycine_max/all_assembly_versions/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz ./
cd /data/genomes
ls -og genbank/*.gz refseq/*.gz
-r--r--r--. 1 296231629 Jun 13  2016 genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz
-r--r--r--. 1 296228780 Jan  6  2015 refseq/GCF_000004515.3_V1.1_genomic.fna.gz

Working directories

Organize your work in a directory hierarchy for convenience of bookeeping and shell script automation for numerous sequences.

The target sequence name is GCA_000004515.3_Glycine_max_v2.0 and the query sequence name is GCF_000004515.3_V1.1.

Convert the fasta sequence to .2bit files and calculate chrom.sizes files.

mkdir /data/genomes/GCA_000004515.3_Glycine_max_v2.0
mkdir /data/genomes/GCF_000004515.3_V1.1
cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0 
faToTwoBit /data/genomes/genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz GCA_000004515.3_Glycine_max_v2.0.2bit
twoBitInfo GCA_000004515.3_Glycine_max_v2.0.2bit stdout | sort -k2,2nr > GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
ls -og
-rw-rw-r--. 1 287569208 Apr 13 17:30 GCA_000004515.3_Glycine_max_v2.0.2bit
-rw-rw-r--. 1     19439 Apr 13 17:31 GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
cd mkdir /data/genomes/GCF_000004515.3_V1.1
faToTwoBit /data/genomes/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz GCF_000004515.3_V1.1.2bit
twoBitInfo GCF_000004515.3_V1.1.2bit stdout | sort -k2,2nr > GCF_000004515.3_V1.1.chrom.sizes
ls -og
-rw-rw-r--. 1 286264183 Apr 13 17:30 GCF_000004515.3_V1.1.2bit
-rw-rw-r--. 1     23246 Apr 13 17:31 GCF_000004515.3_V1.1.chrom.sizes