DoSameSpeciesLiftOver.pl

From genomewiki
Revision as of 22:28, 25 April 2018 by Hiram (talk | contribs) (no need to mention lastz here)
Jump to navigationJump to search

Licensing

For commercial use of these toolsets, please note the license considerations for the kent source tools at the: Genome Store

This process also uses the blat command. For commercial license please see: Kent Informatics

Prerequisites

This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.

This entire discussion assumes the bash shell is the user's unix shell.

Parasol Job Control System

The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.

Install scripts and kent command line utilities

This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.


 mkdir -p /data/scripts /data/bin
 chmod 755 /data/scripts /data/bin

 rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/
 git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \
  --prefix=kent/ HEAD src/hg/utils/automation \
     | tar vxf - -C /data/scripts --strip-components=5 \
        --exclude='kent/src/hg/utils/automation/incidentDb' \
      --exclude='kent/src/hg/utils/automation/configFiles' \
      --exclude='kent/src/hg/utils/automation/ensGene' \
      --exclude='kent/src/hg/utils/automation/genbank' \
      --exclude='kent/src/hg/utils/automation/lastz_D' \
      --exclude='kent/src/hg/utils/automation/openStack'
  wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/utils/bedSingleCover.pl'

PATH setup

Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in your home directory:

echo 'export PATH=/data/bin:/data/bin/blat:/data/scripts:$PATH' >> $HOME/.bashrc

Then, source that file to add that to this current shell:

. $HOME/.bashrc

Verify you see those pathnames on the PATH variable:

echo $PATH
/data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin

Obtain genome sequences

This example is going to use two Soy bean assemblies, one from genbank and one from refseq assemblies at NCBI.

mkdir -p /data/genomes/genbank /data/genomes/refseq
cd /data/genomes/genbank
rsync -L -a -P \
  rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Glycine_max/all_assembly_versions/GCA_000004515.3_Glycine_max_v2.0/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz ./
cd /data/genomes/refseq
rsync -L -a -P \
rsync://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Glycine_max/all_assembly_versions/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz ./
cd /data/genomes
ls -og genbank/*.gz refseq/*.gz
-r--r--r--. 1 296231629 Jun 13  2016 genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz
-r--r--r--. 1 296228780 Jan  6  2015 refseq/GCF_000004515.3_V1.1_genomic.fna.gz

Working directories

Organize your work in a directory hierarchy for convenience of bookeeping and shell script automation for numerous sequences.

The target sequence name is GCA_000004515.3_Glycine_max_v2.0 and the query sequence name is GCF_000004515.3_V1.1.

Convert the fasta sequence to .2bit files and calculate chrom.sizes files.

mkdir /data/genomes/GCA_000004515.3_Glycine_max_v2.0
mkdir /data/genomes/GCF_000004515.3_V1.1
cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0 
faToTwoBit /data/genomes/genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz GCA_000004515.3_Glycine_max_v2.0.2bit
twoBitInfo GCA_000004515.3_Glycine_max_v2.0.2bit stdout | sort -k2,2nr > GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
ls -og
-rw-rw-r--. 1 287569208 Apr 13 17:30 GCA_000004515.3_Glycine_max_v2.0.2bit
-rw-rw-r--. 1     19439 Apr 13 17:31 GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
cd mkdir /data/genomes/GCF_000004515.3_V1.1
faToTwoBit /data/genomes/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz GCF_000004515.3_V1.1.2bit
twoBitInfo GCF_000004515.3_V1.1.2bit stdout | sort -k2,2nr > GCF_000004515.3_V1.1.chrom.sizes
ls -og
-rw-rw-r--. 1 286264183 Apr 13 17:30 GCF_000004515.3_V1.1.2bit
-rw-rw-r--. 1     23246 Apr 13 17:31 GCF_000004515.3_V1.1.chrom.sizes

Construct ooc file

The blat operation in this procedure works much more efficiently when a pre-computed ooc file is constructed to use for all blat comparisons. This is a file that counts up over used 11-mer tiles for blat to eliminate them from the initial consideration for alignment, thereby limiting the amount of alignment that has to take place. We base the repMatch parameter on the size of the genome compared to UCSC hg19 sequence. A genome of that size used -repMatch=1024. We want to adjust that parameter in proportion to that size. Measure the size of the target genome:

cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0
twoBitToFa GCA_000004515.3_Glycine_max_v2.0.2bit stdout | faSize stdin
978416860 bases (23046524 N's 955370336 real 521703227 upper 433667109 lower) in 1189 sequences in 1 files

Note the number of real bases 955370336 to use in this proportion calculation:

calc \( 955370336 / 2861349177  \) \* 1024
( 955370336 / 2861349177 ) * 1024 = 341.901377

Round down the answer to the nearest 50 for the -repMatch=300 use in blat:

blat GCA_000004515.3_Glycine_max_v2.0.2bit /dev/null /dev/null -tileSize=11 \
  -makeOoc=GCA_000004515.3_Glycine_max_v2.0.ooc -repMatch=300
Loading GCA_000004515.3_Glycine_max_v2.0.2bit
Counting GCA_000004515.3_Glycine_max_v2.0.2bit
Writing GCA_000004515.3_Glycine_max_v2.0.ooc
Wrote 64902 overused 11-mers to GCA_000004515.3_Glycine_max_v2.0.ooc