DoSameSpeciesLiftOver.pl

From genomewiki
Jump to: navigation, search

Licensing

For commercial use of these toolsets, please note the license considerations for the kent source tools at the: Genome Store

This process also uses the blat command. For commercial license please see: Kent Informatics

Prerequisites

This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.

This entire discussion assumes the bash shell is the user's unix shell.

Parasol Job Control System

The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.

Install scripts and kent command line utilities

This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.


 sudo mkdir -p /data/scripts /data/bin
 sudo chmod 755 /data/scripts /data/bin

 rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/
 git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \
  --prefix=kent/ HEAD src/hg/utils/automation \
     | tar vxf - -C /data/scripts --strip-components=5 \
        --exclude='kent/src/hg/utils/automation/incidentDb' \
      --exclude='kent/src/hg/utils/automation/configFiles' \
      --exclude='kent/src/hg/utils/automation/ensGene' \
      --exclude='kent/src/hg/utils/automation/genbank' \
      --exclude='kent/src/hg/utils/automation/lastz_D' \
      --exclude='kent/src/hg/utils/automation/openStack'
  wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/bedSingleCover.pl'

PATH setup

Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in your home directory:

echo 'export PATH=/data/bin:/data/bin/blat:/data/scripts:$PATH' >> $HOME/.bashrc

Note: /data/bin/blat has been added to the PATH for access to the blat command. Then, source that file to add that to this current shell:

. $HOME/.bashrc

Verify you see those pathnames on the PATH variable:

echo $PATH
/data/bin:/data/bin/blat:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin

Obtain genome sequences

This example is going to use two Soy bean assemblies, one from genbank and one from refseq assemblies at NCBI.

mkdir -p /data/genomes/genbank /data/genomes/refseq
cd /data/genomes/genbank
rsync -L -a -P \
  rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Glycine_max/all_assembly_versions/GCA_000004515.3_Glycine_max_v2.0/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz ./
cd /data/genomes/refseq
rsync -L -a -P \
rsync://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Glycine_max/all_assembly_versions/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz ./
cd /data/genomes
ls -og genbank/*.gz refseq/*.gz
-r--r--r--. 1 296231629 Jun 13  2016 genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz
-r--r--r--. 1 296228780 Jan  6  2015 refseq/GCF_000004515.3_V1.1_genomic.fna.gz

Working directories

Organize your work in a directory hierarchy for convenience of bookeeping and shell script automation for numerous sequences.

The target sequence name is GCA_000004515.3_Glycine_max_v2.0 and the query sequence name is GCF_000004515.3_V1.1.

Convert the fasta sequence to .2bit files and calculate chrom.sizes files.

mkdir /data/genomes/GCA_000004515.3_Glycine_max_v2.0
mkdir /data/genomes/GCF_000004515.3_V1.1
cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0 
faToTwoBit /data/genomes/genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz GCA_000004515.3_Glycine_max_v2.0.2bit
twoBitInfo GCA_000004515.3_Glycine_max_v2.0.2bit stdout | sort -k2,2nr > GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
ls -og
-rw-rw-r--. 1 287569208 Apr 13 17:30 GCA_000004515.3_Glycine_max_v2.0.2bit
-rw-rw-r--. 1     19439 Apr 13 17:31 GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
cd /data/genomes/GCF_000004515.3_V1.1
faToTwoBit /data/genomes/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz GCF_000004515.3_V1.1.2bit
twoBitInfo GCF_000004515.3_V1.1.2bit stdout | sort -k2,2nr > GCF_000004515.3_V1.1.chrom.sizes
ls -og
-rw-rw-r--. 1 286264183 Apr 13 17:30 GCF_000004515.3_V1.1.2bit
-rw-rw-r--. 1     23246 Apr 13 17:31 GCF_000004515.3_V1.1.chrom.sizes

Construct ooc file

The blat operation in this procedure works much more efficiently when a pre-computed ooc file is constructed to use for all blat comparisons. This is a file that counts up over used 11-mer tiles for blat to eliminate them from the initial consideration for alignment, thereby limiting the amount of alignment that has to take place. We base the repMatch parameter on the size of the genome compared to UCSC hg19 sequence. A genome of that size used -repMatch=1024. We want to adjust that parameter in proportion to that size. Measure the size of the target genome:

cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0
twoBitToFa GCA_000004515.3_Glycine_max_v2.0.2bit stdout | faSize stdin
978416860 bases (23046524 N's 955370336 real 521703227 upper 433667109 lower) in 1189 sequences in 1 files

Note the number of real bases 955370336 to use in this proportion calculation:

calc \( 955370336 / 2861349177  \) \* 1024
( 955370336 / 2861349177 ) * 1024 = 341.901377

Round down the answer to the nearest 50 for the -repMatch=300 use in blat:

blat GCA_000004515.3_Glycine_max_v2.0.2bit /dev/null /dev/null -tileSize=11 \
  -makeOoc=GCA_000004515.3_Glycine_max_v2.0.ooc -repMatch=300
Loading GCA_000004515.3_Glycine_max_v2.0.2bit
Counting GCA_000004515.3_Glycine_max_v2.0.2bit
Writing GCA_000004515.3_Glycine_max_v2.0.ooc
Wrote 64902 overused 11-mers to GCA_000004515.3_Glycine_max_v2.0.ooc

doSameSpeciesLiftOver.pl

This script is going to run the entire process. With the .2bit, chrom.sizes and ooc files in place, the process is ready to run with this single command:

export target="GCA_000004515.3_Glycine_max_v2.0"
export query="GCF_000004515.3_V1.1"
cd /data/genomes/${target}
time (doSameSpeciesLiftOver.pl -verbose=2 -buildDir=`pwd` \
  -ooc=`pwd`/${target}.ooc -fileServer=localhost -localTmp="/dev/shm" \
    -bigClusterHub=localhost -dbHost=localhost -workhorse=localhost \
      -target2Bit=`pwd`/${target}.2bit -targetSizes=`pwd`/${target}.chrom.sizes \
        -query2Bit=/data/genomes/${query}/${query}.2bit \
          -querySizes=/data/genomes/${query}/${query}.chrom.sizes ${target} ${query}) > do.log 2>&1

Result files

The liftOver file result is in this working directory:

cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0
ls -og *.over.chain.gz
-rw-rw-r--. 1 1325174 Apr 14 17:27 GCA_000004515.3_Glycine_max_v2.0ToGCF_000004515.3_V1.1.over.chain.gz

This has also been converted to bigChain files for display in an assembly hub:

ls -og *.bb
-rw-rw-r--. 1 1622857 Apr 14 17:27 chainGCF_000004515.3_V1.1.bb
-rw-rw-r--. 1 1900014 Apr 14 17:27 chainGCF_000004515.3_V1.1Link.bb

How does this process work

The doSameSpeciesLiftOver.pl script performs the processing in distinct steps. Each step is almost always performed with a C-shell or bash shell script. Therefore, if there is a problem in any step, the commands performing the step can be dissected from the script in operation, the problem identified and fixed, and the step completed manually by running the rest of the commands in that script. Once a step has been completed, the process can continue with the next step using the argument -continue=nextStepName. Check the usage message from the doSameSpeciesLiftOver.pl script to see a listing of the steps and their sequence. Specifically:

align, chain, net, load, cleanup

In this example, the various scripts are:

-rwxrwxr-x. 1 1485 Apr 14 00:06 run.blat/doAlign.csh
-rwxrwxr-x. 1 2337 Apr 14 00:08 run.blat/job.csh
-rwxrwxr-x. 1  499 Apr 14 04:43 run.chain/job.csh
-rwxrwxr-x. 1  641 Apr 14 04:43 run.chain/doChain.csh
-rwxrwxr-x. 1 1870 Apr 14 17:26 run.chain/doNet.csh
-rwxrwxr-x. 1 2317 Apr 14 17:27 doLoad.csh
-rwxrwxr-x. 1  540 Apr 14 17:27 doCleanup.csh

The cluster runs performed were in run.blat and run.chain with the following typical processing times for this genome sequence:

cat run.blat/run.time
Completed: 2289 of 2289 jobs
CPU time in finished jobs:     658550s   10975.83m   182.93h    7.62d  0.021 y
IO & Wait Time:                  9734s     162.24m     2.70h    0.11d  0.000 y
Average job time:                 292s       4.87m     0.08h    0.00d
Longest finished job:            3935s      65.58m     1.09h    0.05d
Submission to last job:         16493s     274.88m     4.58h    0.19d
cat run.chain/run.time
Completed: 63 of 63 jobs
CPU time in finished jobs:         49s       0.82m     0.01h    0.00d  0.000 y
IO & Wait Time:                   167s       2.78m     0.05h    0.00d  0.000 y
Average job time:                   3s       0.06m     0.00h    0.00d
Longest finished job:              17s       0.28m     0.00h    0.00d
Submission to last job:            21s       0.35m     0.01h    0.00d


The script can be used in a -debug mode where it does nothing but construct the required shell scripts.