DoSameSpeciesLiftOver.pl
Licensing
For commercial use of these toolsets, please note the license considerations for the kent source tools at the: Genome Store
This process also uses the blat command. For commercial license please see: Kent Informatics
Prerequisites
This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.
This entire discussion assumes the bash shell is the user's unix shell.
Parasol Job Control System
The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.
Install scripts and kent command line utilities
This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.
mkdir -p /data/scripts /data/bin chmod 755 /data/scripts /data/bin rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/ git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \ --prefix=kent/ HEAD src/hg/utils/automation \ | tar vxf - -C /data/scripts --strip-components=5 \ --exclude='kent/src/hg/utils/automation/incidentDb' \ --exclude='kent/src/hg/utils/automation/configFiles' \ --exclude='kent/src/hg/utils/automation/ensGene' \ --exclude='kent/src/hg/utils/automation/genbank' \ --exclude='kent/src/hg/utils/automation/lastz_D' \ --exclude='kent/src/hg/utils/automation/openStack' wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/bedSingleCover.pl'
PATH setup
Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in your home directory:
echo 'export PATH=/data/bin:/data/bin/blat:/data/scripts:$PATH' >> $HOME/.bashrc
Note: /data/bin/blat has been added to the PATH for access to the blat command. Then, source that file to add that to this current shell:
. $HOME/.bashrc
Verify you see those pathnames on the PATH variable:
echo $PATH /data/bin:/data/bin/blat:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin
Obtain genome sequences
This example is going to use two Soy bean assemblies, one from genbank and one from refseq assemblies at NCBI.
mkdir -p /data/genomes/genbank /data/genomes/refseq cd /data/genomes/genbank rsync -L -a -P \ rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Glycine_max/all_assembly_versions/GCA_000004515.3_Glycine_max_v2.0/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz ./ cd /data/genomes/refseq rsync -L -a -P \ rsync://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Glycine_max/all_assembly_versions/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz ./ cd /data/genomes ls -og genbank/*.gz refseq/*.gz -r--r--r--. 1 296231629 Jun 13 2016 genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz -r--r--r--. 1 296228780 Jan 6 2015 refseq/GCF_000004515.3_V1.1_genomic.fna.gz
Working directories
Organize your work in a directory hierarchy for convenience of bookeeping and shell script automation for numerous sequences.
The target sequence name is GCA_000004515.3_Glycine_max_v2.0 and the query sequence name is GCF_000004515.3_V1.1.
Convert the fasta sequence to .2bit files and calculate chrom.sizes files.
mkdir /data/genomes/GCA_000004515.3_Glycine_max_v2.0 mkdir /data/genomes/GCF_000004515.3_V1.1 cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0 faToTwoBit /data/genomes/genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz GCA_000004515.3_Glycine_max_v2.0.2bit twoBitInfo GCA_000004515.3_Glycine_max_v2.0.2bit stdout | sort -k2,2nr > GCA_000004515.3_Glycine_max_v2.0.chrom.sizes ls -og -rw-rw-r--. 1 287569208 Apr 13 17:30 GCA_000004515.3_Glycine_max_v2.0.2bit -rw-rw-r--. 1 19439 Apr 13 17:31 GCA_000004515.3_Glycine_max_v2.0.chrom.sizes cd /data/genomes/GCF_000004515.3_V1.1 faToTwoBit /data/genomes/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz GCF_000004515.3_V1.1.2bit twoBitInfo GCF_000004515.3_V1.1.2bit stdout | sort -k2,2nr > GCF_000004515.3_V1.1.chrom.sizes ls -og -rw-rw-r--. 1 286264183 Apr 13 17:30 GCF_000004515.3_V1.1.2bit -rw-rw-r--. 1 23246 Apr 13 17:31 GCF_000004515.3_V1.1.chrom.sizes
Construct ooc file
The blat operation in this procedure works much more efficiently when a pre-computed ooc file is constructed to use for all blat comparisons. This is a file that counts up over used 11-mer tiles for blat to eliminate them from the initial consideration for alignment, thereby limiting the amount of alignment that has to take place. We base the repMatch parameter on the size of the genome compared to UCSC hg19 sequence. A genome of that size used -repMatch=1024. We want to adjust that parameter in proportion to that size. Measure the size of the target genome:
cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0 twoBitToFa GCA_000004515.3_Glycine_max_v2.0.2bit stdout | faSize stdin 978416860 bases (23046524 N's 955370336 real 521703227 upper 433667109 lower) in 1189 sequences in 1 files
Note the number of real bases 955370336 to use in this proportion calculation:
calc \( 955370336 / 2861349177 \) \* 1024 ( 955370336 / 2861349177 ) * 1024 = 341.901377
Round down the answer to the nearest 50 for the -repMatch=300 use in blat:
blat GCA_000004515.3_Glycine_max_v2.0.2bit /dev/null /dev/null -tileSize=11 \ -makeOoc=GCA_000004515.3_Glycine_max_v2.0.ooc -repMatch=300 Loading GCA_000004515.3_Glycine_max_v2.0.2bit Counting GCA_000004515.3_Glycine_max_v2.0.2bit Writing GCA_000004515.3_Glycine_max_v2.0.ooc Wrote 64902 overused 11-mers to GCA_000004515.3_Glycine_max_v2.0.ooc
doSameSpeciesLiftOver.pl
This script is going to run the entire process. With the .2bit, chrom.sizes and ooc files in place, the process is ready to run with this single command:
export target="GCA_000004515.3_Glycine_max_v2.0" export query="GCF_000004515.3_V1.1" cd /data/genomes/${target} time (doSameSpeciesLiftOver.pl -verbose=2 -buildDir=`pwd` \ -ooc=`pwd`/${target}.ooc -fileServer=localhost -localTmp="/dev/shm" \ -bigClusterHub=localhost -dbHost=localhost -workhorse=localhost \ -target2Bit=`pwd`/${target}.2bit -targetSizes=`pwd`/${target}.chrom.sizes \ -query2Bit=/data/genomes/${query}/${query}.2bit \ -querySizes=/data/genomes/${query}/${query}.chrom.sizes ${target} ${query}) > do.log 2>&1
Result files
The liftOver file result is in this working directory:
cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0 ls -og *.over.chain.gz -rw-rw-r--. 1 1325174 Apr 14 17:27 GCA_000004515.3_Glycine_max_v2.0ToGCF_000004515.3_V1.1.over.chain.gz
This has also been converted to bigChain files for display in an assembly hub:
ls -og *.bb -rw-rw-r--. 1 1622857 Apr 14 17:27 chainGCF_000004515.3_V1.1.bb -rw-rw-r--. 1 1900014 Apr 14 17:27 chainGCF_000004515.3_V1.1Link.bb
How does this process work
The doSameSpeciesLiftOver.pl script performs the processing in distinct steps. Each step is almost always performed with a C-shell or bash shell script. Therefore, if there is a problem in any step, the commands performing the step can be dissected from the script in operation, the problem identified and fixed, and the step completed manually by running the rest of the commands in that script. Once a step has been completed, the process can continue with the next step using the argument -continue=nextStepName. Check the usage message from the doSameSpeciesLiftOver.pl script to see a listing of the steps and their sequence. Specifically:
align, chain, net, load, cleanup
In this example, the various scripts are:
-rwxrwxr-x. 1 1485 Apr 14 00:06 run.blat/doAlign.csh -rwxrwxr-x. 1 2337 Apr 14 00:08 run.blat/job.csh -rwxrwxr-x. 1 499 Apr 14 04:43 run.chain/job.csh -rwxrwxr-x. 1 641 Apr 14 04:43 run.chain/doChain.csh -rwxrwxr-x. 1 1870 Apr 14 17:26 run.chain/doNet.csh -rwxrwxr-x. 1 2317 Apr 14 17:27 doLoad.csh -rwxrwxr-x. 1 540 Apr 14 17:27 doCleanup.csh
The cluster runs performed were in run.blat and run.chain with the following typical processing times for this genome sequence:
cat run.blat/run.time Completed: 2289 of 2289 jobs CPU time in finished jobs: 658550s 10975.83m 182.93h 7.62d 0.021 y IO & Wait Time: 9734s 162.24m 2.70h 0.11d 0.000 y Average job time: 292s 4.87m 0.08h 0.00d Longest finished job: 3935s 65.58m 1.09h 0.05d Submission to last job: 16493s 274.88m 4.58h 0.19d cat run.chain/run.time Completed: 63 of 63 jobs CPU time in finished jobs: 49s 0.82m 0.01h 0.00d 0.000 y IO & Wait Time: 167s 2.78m 0.05h 0.00d 0.000 y Average job time: 3s 0.06m 0.00h 0.00d Longest finished job: 17s 0.28m 0.00h 0.00d Submission to last job: 21s 0.35m 0.01h 0.00d