DoBlastzChainNet.pl

From genomewiki
Revision as of 08:14, 1 September 2018 by Galt (talk | contribs) (genome-source fixes)
Jump to navigationJump to search

Licensing

For commercial use of these toolsets, please note the license considerations for the kent source tools at the: Genome Store

Prerequisites

This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.

This entire discussion assumes the bash shell is the user's unix shell.

Compute resources

For any reasonable sized genome assemblies, this procedure will require cluster compute resources. Typical compute times can range from 1 to 2 days with 100 CPUs(cores). Much longer compute times will be seen for high contig count genome assemblies (hundreds of thousands of contigs) or for assemblies that are not well repeat masked. Please note this scatter plot and histogram showing compute time vs. genome size for alignments performed at UCSC

SizeVsTime.png LastzProcessingTimeHistogram.png

Parasol Job Control System

For cluster compute resources UCSC uses the parasol job control system. The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.

Install scripts and kent command line utilities

This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.


 mkdir -p /data/scripts /data/bin
 chmod 755 /data/scripts /data/bin

 rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/
 git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \
  --prefix=kent/ HEAD src/hg/utils/automation \
     | tar vxf - -C /data/scripts --strip-components=5 \
        --exclude='kent/src/hg/utils/automation/incidentDb' \
      --exclude='kent/src/hg/utils/automation/configFiles' \
      --exclude='kent/src/hg/utils/automation/ensGene' \
      --exclude='kent/src/hg/utils/automation/genbank' \
      --exclude='kent/src/hg/utils/automation/lastz_D' \
      --exclude='kent/src/hg/utils/automation/openStack'
  wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/bedSingleCover.pl'

NOTE: A copy of the lastz binary is included in the rsync download of binaries from hgdownload. It is named lastz-1.04.00 to identify the version. Source for lastz can be obtained from lastz github.

PATH setup

Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in your home directory:

echo 'export PATH=/data/bin:/data/scripts:$PATH' >> $HOME/.bashrc

Then, source that file to add that to this current shell:

. $HOME/.bashrc

Verify you see those pathnames on the PATH variable:

echo $PATH
/data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin

Working directory hierarchy

It is best to organize your work in a directory hierarchy. For example maintain all your genome sequences in:

 /data/genomes/
 /data/genomes/hg38/
 /data/genomes/mm10/
 /data/genomes/dm6/
 /data/genomes/ce11/
 ... etc ...

Where those database directories can have the 2bit files, chrom.sizes, and track construction directories, for example:

 /data/genomes/dm6/dm6.2bit
 /data/genomes/dm6/dm6.chrom.sizes
 /data/genomes/dm6/trackData/

Such organizations are a personal preference custom. However you do this, keep it consistent to make it easier to use scripts on multiple sequences.

Obtain genome sequences

Genome sequences from the U.C. Santa Cruz Genomics Institute can be obtained directly from the hgdownload server via rsync. For example

mkdir /data/genomes/dm6
cd /data/genomes/dm6
rsync -avzP \
   rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.2bit .
rsync -avzP \
   rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.chrom.sizes .
ls -og
-rw-rw-r--. 1 36969050 Aug 28  2014 dm6.2bit
-rw-rw-r--. 1    45055 Aug 28  2014 dm6.chrom.sizes

Genome sequences from the NCBI/Entrez/Genbank system can be found via the assembly_summary.txt text listing information files, for example invertebrate genomes:


 wget -O /tmp/invertebrate.assembly_summary.txt 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/assembly_summary.txt'

Looking for the Anopheles genome:


 grep -w Anopheles /tmp/invertebrate.assembly_summary.txt 
 GCF_000005575.2 PRJNA163        SAMN02952903    AAAB00000000.1  representative genome   180454  7165    Anopheles gambiae str. PEST     strain=PEST            latest   Chromosome      Major   Full    2006/10/16      AgamP3  The International Consortium for the Sequencing of Anopheles Genome     GCA_000005575.1 different       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3

Note the Assembly identification from the ftp path GCF_000005575.2_AgamP3, working with that sequence, using the rsync service in place of the FTP:


 mkdir /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3
 cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3
 rsync -L -a -P rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_genomic.fna.gz ./
 rsync -L -a -P \
     rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_assembly_report.txt ./
 ls -og
 total 79908
 -rw-rw-r--. 1   811394 Mar  7  2017 GCF_000005575.2_AgamP3_assembly_report.txt
 -rw-rw-r--. 1 81010275 Jun 15  2016 GCF_000005575.2_AgamP3_genomic.fna.gz

The assembly_report.txt file is useful to have for the meta-data information it has about the assembly. The fna.gz file needs to be in 2bit format for the processing system, and the chrom.sizes made from the 2bit:

faToTwoBit GCF_000005575.2_AgamP3_genomic.fna.gz GCF_000005575.2_AgamP3.2bit
twoBitInfo GCF_000005575.2_AgamP3.2bit stdout | sort -k2,2nr > GCF_000005575.2_AgamP3.chrom.sizes
ls -og
total 156132
-rw-rw-r--. 1 77912208 Apr  6 03:48 GCF_000005575.2_AgamP3.2bit
-rw-rw-r--. 1   138303 Apr  6 03:48 GCF_000005575.2_AgamP3.chrom.sizes
-rw-rw-r--. 1   811394 Mar  7  2017 GCF_000005575.2_AgamP3_assembly_report.txt
-rw-rw-r--. 1 81010275 Jun 15  2016 GCF_000005575.2_AgamP3_genomic.fna.gz

lastz parameter file

SEE ALSO: lastz DEF file parameters

The DEF file is used with the script to specify alignment parameters to lastz and the axtChain operations. The example for dm6 target vs. A. gambiae query sequence, loose parameters are used for this distant alignment:

cat DEF
# dm6 vs GCF_000005575.2_AgamP3
PATH=/data/bin:/data/scripts
BLASTZ=/data/bin/lastz-1.04.00
BLASTZ_H=2000
BLASTZ_Y=3400
BLASTZ_L=4000
BLASTZ_K=2200
BLASTZ_Q=/data/lastz/HoxD55.q

# TARGET: D. melanogaster dm6
SEQ1_DIR=/data/genomes/dm6/dm6.2bit
SEQ1_LEN=/data/genomes/dm6/dm6.chrom.sizes
SEQ1_CHUNK=32100000
SEQ1_LAP=10000
SEQ1_LIMIT=18

# QUERY: GCF_000005575.2_AgamP3
SEQ2_DIR=/data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3.2bit
SEQ2_LEN=/data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3.chrom.sizes
SEQ2_CHUNK=1000000
SEQ2_LIMIT=2000
SEQ2_LAP=0

BASE=/data/genomes/dm6/trackData/GCF_000005575.2_AgamP3
TMPDIR=/dev/shm

NOTE: UCSC tends to keep options for running alignments to approximately four category sets:

  • human to other primates
  • human to other mammals
  • human to more distant vertebrates
  • fly and worm alignments and other such distant organisms

Many examples of DEF files and chaining arguments can be found in the record of alignments at UCSC in the source tree make doc files. For example, alignments to human/hg38: hg38 lastz and the 100-way alignment: Hg38_100-way_conservation_lastz_parameters

Many experiments have been tried over time. To keep it simple:

  • human to other primates
BLASTZ_M=254
BLASTZ_O=600
BLASTZ_E=150
BLASTZ_K=4500
BLASTZ_Y=15000
BLASTZ_T=2
BLASTZ_Q=/scratch/data/blastz/human_chimp.v2.q
# Where human_chimp.v2.q is:
#  A    C    G    T
#   90 -330 -236 -356
# -330  100 -318 -236
# -236 -318  100 -330
# -356 -236 -330   90
-chainMinScore=5000 -chainLinearGap=medium
  • human to other mammals
BLASTZ_O=400
BLASTZ_E=30
BLASTZ_M=254
# default BLASTZ_Q score matrix:
#       A     C     G     T
# A    91  -114   -31  -123
# C  -114   100  -125   -31
# G   -31  -125   100  -114
# T  -123   -31  -114    91
-chainMinScore=3000 -chainLinearGap=medium
  • human to more distant vertebrates
BLASTZ_M=50
BLASTZ_Y=3400
BLASTZ_L=6000
BLASTZ_K=2200
BLASTZ_Q=/scratch/data/blastz/HoxD55.q
# HoxD55.q matrix is:
#     A    C    G    T
#    91  -90  -25 -100
#   -90  100 -100  -25
#   -25 -100  100  -90
#  -100  -25  -90  91
-chainMinScore=5000 -chainLinearGap=loose
  • fly and worm alignments and other such distant organisms
BLASTZ_H=2000
BLASTZ_Y=3400
BLASTZ_L=4000
BLASTZ_K=2200
BLASTZ_Q=/data/lastz/HoxD55.q
# HoxD55.q matrix is:
#     A    C    G    T
#    91  -90  -25 -100
#   -90  100 -100  -25
#   -25 -100  100  -90
#  -100  -25  -90  91
-chainMinScore=1000 -chainLinearGap=loose

perform alignment

After the DEF file is established, verify the files specified in it are actually present at the locations specified:

egrep "_DIR|_LEN" DEF | sed -e 's/.*=//;' | xargs ls -og
-rw-rw-r--. 1 36969050 Aug 28  2014 /data/genomes/dm6/dm6.2bit
-rw-rw-r--. 1    45055 Aug 28  2014 /data/genomes/dm6/dm6.chrom.sizes
-rw-rw-r--. 1 77912208 Apr  6 03:48 /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3.2bit
-rw-rw-r--. 1   138303 Apr  6 03:48 /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3.chrom.sizes

Use a screen to keep this command attached to a terminal that you can detach from and reattach to at a later time. For large genomes, or with fewer CPU cores available, this command can run for many days. Time the operation of the command and record all output from it for later analysis if any problems arise from the operation:

screen -S dm6.GCF_000005575
time (/data/scripts/doBlastzChainNet.pl `pwd`/DEF -verbose=2 -noDbNameCheck \
 -workhorse=localhost -bigClusterHub=localhost -skipDownload \
   -dbHost=localhost -smallClusterHub=localhost \
     -trackHub -fileServer=localhost -syntenicNet) > do.log 2>&1 &

The screen -S dm6.GCF_000005575 gives a name to the terminal so you can find it later in a listing of a number of screens. To detach from the running terminal, use two key presses:

"Ctrl-a Ctrl-d"

to reattach to this screen: screen -r -d dm6.GCF_000005575

BEWARE the drawback of the screen is that you can accidentally exit the shell while in the screen and you thus lose it. The processes that were attached to that shell can continue if they do not respond to the SIGHUP signal. To avoid this side-effect, develop a habit of always exiting a shell with the two key presses "Ctrl-a Ctrl-d", in a shell that is not in a screen it will merely echo those keystrokes and do nothing.

Monitor progress

To determine which step the process is working on, in this working directory, look for the word step in the do.log file:

cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3
grep -w step do.log
HgStepManager: executing from step 'partition' through step 'syntenicNet'.
HgStepManager: executing step 'partition' Fri Apr  6 04:50:34 2018.
HgStepManager: executing step 'blastz' Fri Apr  6 04:50:46 2018.

To view the parasol status of your batch:

parasol list batches
#user     run   wait   done crash pri max cpu  ram  plan min batch
centos    323  18446  10283     0  10  -1   1  2.0g  323   0 /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/run.blastz/

To view the status of that particular batch:

cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/run.blastz/
para time
29052 jobs in batch
16647 jobs (including everybody's) in Parasol queue or running.
Checking finished jobs
..........................
Completed: 12405 of 29052 jobs
Jobs currently running: 323
In queue waiting: 16324 jobs
CPU time in finished jobs:      81145s    1352.42m    22.54h    0.94d  0.003 y
IO & Wait Time:                142817s    2380.28m    39.67h    1.65d  0.005 y
Time in running jobs:            4076s      67.93m     1.13h    0.05d  0.000 y
Average job time:                  18s       0.30m     0.01h    0.00d
Longest running job:               43s       0.72m     0.01h    0.00d
Longest finished job:              72s       1.20m     0.02h    0.00d
Submission to last job:           718s      11.97m     0.20h    0.01d
Estimated complete:               930s      15.51m     0.26h    0.01d

This example happens to be running on an Open Stack cluster with 323 allocated CPU cores:

parasol status
CPUs total: 323
CPUs free: 0
CPUs busy: 323
Nodes total: 20
Nodes dead: 0
Nodes sick?: 0
Jobs running:  323
Jobs waiting:  14977
Jobs finished: 13752
Jobs crashed:  0
Spokes free: 30
Spokes busy: 0
Spokes dead: 0
Active batches: 1
Total batches: 1
Active users: 1
Total users: 1
Days up: 0.012685
Version: 12.18

When a parasol batch is completed, this scripting process leaves a run.time file in the batch directory where you can see what type of cluster time you have used:

cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/run.blastz
cat run.time
Completed: 29052 of 29052 jobs
CPU time in finished jobs:     149061s    2484.35m    41.41h    1.73d  0.005 y
IO & Wait Time:                268812s    4480.20m    74.67h    3.11d  0.009 y
Average job time:                  14s       0.24m     0.00h    0.00d
Longest finished job:              72s       1.20m     0.02h    0.00d
Submission to last job:          1312s      21.87m     0.36h    0.02d
Estimated complete:                 0s       0.00m     0.00h    0.00d

For small genomes such as the two in this example, the steps after the lastz alignment can proceed rapidly:

grep -w step do.log
HgStepManager: executing from step 'partition' through step 'syntenicNet'.
HgStepManager: executing step 'partition' Fri Apr  6 04:50:34 2018.
HgStepManager: executing step 'blastz' Fri Apr  6 04:50:46 2018.
HgStepManager: executing step 'cat' Fri Apr  6 05:13:52 2018.
HgStepManager: executing step 'chainRun' Fri Apr  6 05:14:12 2018.
HgStepManager: executing step 'chainMerge' Fri Apr  6 05:15:49 2018.
HgStepManager: executing step 'net' Fri Apr  6 05:15:58 2018.
HgStepManager: executing step 'load' Fri Apr  6 05:17:06 2018.
HgStepManager: executing step 'download' Fri Apr  6 05:17:35 2018.
HgStepManager: executing step 'cleanup' Fri Apr  6 05:17:36 2018.
HgStepManager: executing step 'syntenicNet' Fri Apr  6 05:17:42 2018.

The syntenicNet is the last step in this process, the do.log timing will indicate the full time for this alignment:

tail -3 do.log
real    27m20.456s
user    0m0.720s
sys     0m0.394s

And featureBits measurements have taken place to indicate the amount of coverage of the target genome by the query genome, for both the fundamental alignment, and the syntenic filtered alignment:

ls fb.*
fb.dm6.chain.GCF_000005575.2_AgamP3Link.txt
fb.dm6.chainSyn.GCF_000005575.2_AgamP3Link.txt
cat fb.*
19155294 bases of 143726002 (13.328%) in intersection
1617815 bases of 143726002 (1.126%) in intersection

Reciprocal Best

After that alignment is completed, the reciprocal best alignment can be computed:

cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3
export tDb=`grep "SEQ1_DIR=" DEF | sed -e 's#.*/##; s#.2bit##;'`
export qDb=`grep "SEQ2_DIR=" DEF | sed -e 's#.*/##; s#.2bit##;'`
export target2Bit=`grep "SEQ1_DIR=" DEF | sed -e 's/.*=//;'`
export targetSizes=`grep "SEQ1_LEN=" DEF | sed -e 's/.*=//;'`
export query2Bit=`grep "SEQ2_DIR=" DEF | sed -e 's/.*=//;'`
export querySizes=`grep "SEQ2_LEN=" DEF | sed -e 's/.*=//;'`
time (/data/scripts/doRecipBest.pl -buildDir=`pwd` -load \
  -workhorse=localhost -dbHost=localhost -skipDownload \
   -target2Bit=${target2Bit} -query2Bit=${query2Bit} \
    -targetSizes=${targetSizes} -querySizes=${querySizes} \
      -trackHub ${tDb} ${qDb}) > rbest.log 2>&1 &

This process does not have any parasol batch procedures. The procedure only does transformations on some of the result files computed during the first alignment procedure. Since this is not a parallel procedure, it takes a bit of time:

grep -w step rbest.log
HgStepManager: executing from step 'recipBest' through step 'cleanup'.
HgStepManager: executing step 'recipBest' Fri Apr  6 05:24:27 2018.
HgStepManager: executing step 'download' Fri Apr  6 05:55:38 2018.
HgStepManager: executing step 'load' Fri Apr  6 05:55:38 2018.
HgStepManager: executing step 'cleanup' Fri Apr  6 05:55:47 2018.

When completed, this has a featureBits measurement also:

cat fb.dm6.chainRBest.GCF_000005575.2_AgamP3.txt 
15316412 bases of 143726002 (10.657%) in intersection

Swap

Now that one chain is finished you can swap the reverse direction, note the "swap" and "swapDir" arguments to doBlastzChainNet.pl:

mkdir -p /data/genomes/oviAri4/trackData/openstack.lastzHg38.2018-04-26/
cd /data/genomes/oviAri4/trackData/openstack.lastzHg38.2018-04-26/

time (/data/scripts/doBlastzChainNet.pl \
   /data/genomes/hg38/trackData/openstack.lastzOviAri.2018-04-26/DEF \
   -swap -swapDir=`pwd` -verbose=2 -noDbNameCheck -workhorse=localhost \
   -bigClusterHub=localhost -skipDownload -dbHost=localhost \
   -smallClusterHub=localhost -trackHub -    fileServer=localhost \
   -syntenicNet) > swap.log 2>&1 &

Track Hub files

This procedure has constructed big* files that can be used to display these tracks in a track hub on the UCSC genome browser

ls axtChain/*.bb bigMaf/*.bb
axtChain/chainGCF_000005575.2_AgamP3.bb
axtChain/chainGCF_000005575.2_AgamP3Link.bb
axtChain/chainRBestGCF_000005575.2_AgamP3.bb
axtChain/chainRBestGCF_000005575.2_AgamP3Link.bb
axtChain/chainSynGCF_000005575.2_AgamP3.bb
axtChain/chainSynGCF_000005575.2_AgamP3Link.bb
bigMaf/dm6.GCF_000005575.2_AgamP3.net.bb
bigMaf/dm6.GCF_000005575.2_AgamP3.net.summary.bb
bigMaf/dm6.GCF_000005575.2_AgamP3.rbestNet.bb
bigMaf/dm6.GCF_000005575.2_AgamP3.rbestNet.summary.bb
bigMaf/dm6.GCF_000005575.2_AgamP3.synNet.bb
bigMaf/dm6.GCF_000005575.2_AgamP3.synNet.summary.bb

(TBD: show structure of trackDb.txt track hub specifications)

How does this process work

The doBlastzChainNet.pl script performs the processing in distinct steps. Each step is almost always performed with a C-shell or bash shell script. Therefore, if there is a problem in any step, the commands performing the step can be dissected from the script in operation, the problem identified and fixed, and the step completed manually by running the rest of the commands in that script. Once a step has been completed, the process can continue with the next step using the argument -continue=nextStepName. Check the usage message from the doBlastzChainNet.pl script to see a listing of the steps and their sequence. Specifically:

partition, blastz, cat, chainRun, chainMerge, net, load, download, cleanup, syntenicNet

In this example, the various scripts are:

-rwxrwxr-x. 1 1914 Apr  6 04:50 run.blastz/doPartition.bash
-rw-rw-r--. 1 2713 Apr  6 04:50 run.blastz/xdir.sh
-rwxrwxr-x. 1  606 Apr  6 04:50 run.blastz/doClusterRun.csh
-rwxrwxr-x. 1  802 Apr  6 05:13 run.cat/doCatRun.csh
-rwxrwxr-x. 1   72 Apr  6 05:13 run.cat/cat.csh
-rwxrwxr-x. 1  412 Apr  6 05:14 axtChain/run/chain.csh
-rwxrwxr-x. 1  700 Apr  6 05:14 axtChain/run/doChainRun.csh
-rwxrwxr-x. 1 3112 Apr  6 05:15 axtChain/netChains.csh
-rwxrwxr-x. 1 2162 Apr  6 05:17 axtChain/loadUp.csh
-rwxrwxr-x. 1 1479 Apr  6 05:17 cleanUp.csh
-rwxrwxr-x. 1 4507 Apr  6 05:17 axtChain/netSynteny.csh
-rwxrwxr-x. 1 6321 Apr  6 05:24 axtChain/doRecipBest.csh
-rwxrwxr-x. 1 1974 Apr  6 05:55 axtChain/loadRBest.csh
-rwxrwxr-x. 1  583 Apr  6 05:55 rBestCleanUp.bash