DoBlastzChainNet.pl: Difference between revisions
(Reciprocal Best) |
Matt Speir (talk | contribs) (→The new streamline pairLastz script: tweaking instructions in the "push it to the RR" section to remove confusing dev path an simplify pointer to GenArk pushing instructions.) |
||
(35 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
==Licensing== | |||
For commercial use of these toolsets, please note the license considerations for the | |||
kent source tools at the: [https://genome-store.ucsc.edu/ Genome Store] | |||
==Prerequisites== | ==Prerequisites== | ||
Line 9: | Line 14: | ||
preferably a multiple [https://en.wikipedia.org/wiki/Computer_cluster compute cluster] system or equivalent | preferably a multiple [https://en.wikipedia.org/wiki/Computer_cluster compute cluster] system or equivalent | ||
in a [https://en.wikipedia.org/wiki/Cloud_computing cloud computing] environment. | in a [https://en.wikipedia.org/wiki/Cloud_computing cloud computing] environment. | ||
This entire discussion assumes the [https://en.wikipedia.org/wiki/Bash_(Unix_shell) bash shell] | |||
is the user's unix shell. | |||
==Compute resources== | |||
For any reasonable sized genome assemblies, this procedure will require cluster compute resources. | |||
Typical compute times can range from 1 to 2 days with 100 CPUs(cores). Much longer compute times | |||
will be seen for high contig count genome assemblies (hundreds of thousands of contigs) or for | |||
assemblies that are not well repeat masked. Please note this scatter plot and histogram | |||
showing compute time vs. genome size for alignments performed at [https://ucscgenomics.soe.ucsc.edu/ UCSC] | |||
[[Image:SizeVsTime.png|border|100px]] [[File:LastzProcessingTimeHistogram.png|border|100px]] | |||
==Parasol Job Control System== | ==Parasol Job Control System== | ||
The scripts and programs used here expect to find the [[Parasol_job_control_system]] in place | For cluster compute resources [https://ucscgenomics.soe.ucsc.edu/ UCSC] uses the parasol | ||
job control system. The scripts and programs used here expect to find the [[Parasol_job_control_system]] in place | |||
and operational. | and operational. | ||
Line 38: | Line 57: | ||
--exclude='kent/src/hg/utils/automation/lastz_D' \ | --exclude='kent/src/hg/utils/automation/lastz_D' \ | ||
--exclude='kent/src/hg/utils/automation/openStack' | --exclude='kent/src/hg/utils/automation/openStack' | ||
wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/bedSingleCover.pl' | |||
</nowiki> | </nowiki> | ||
</pre> | </pre> | ||
'''NOTE:''' A copy of the '''lastz''' binary is included in the rsync download | |||
of binaries from hgdownload. It is named '''lastz-1.04.00''' to identify the version. | |||
Source for lastz can be obtained from [https://github.com/lastz/lastz lastz github.] | |||
==PATH setup== | ==PATH setup== | ||
Line 45: | Line 69: | ||
Add or verify the two directories '''/data/bin''' and '''/data/scripts''' are added | Add or verify the two directories '''/data/bin''' and '''/data/scripts''' are added | ||
to the shell '''PATH''' environment. This can be added simply to the '''.bashrc''' file in | to the shell '''PATH''' environment. This can be added simply to the '''.bashrc''' file in | ||
your home directory: | |||
echo 'export PATH=/data/bin:/data/scripts:$PATH' >> $HOME/.bashrc | echo 'export PATH=/data/bin:/data/scripts:$PATH' >> $HOME/.bashrc | ||
Line 57: | Line 81: | ||
echo $PATH | echo $PATH | ||
/data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin | /data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin | ||
==Working directory hierarchy== | ==Working directory hierarchy== | ||
Line 100: | Line 121: | ||
Genome sequences from the NCBI/Entrez/Genbank system can be found via the | Genome sequences from the NCBI/Entrez/Genbank system can be found via the | ||
assembly_summary.txt text listing information files, for example '''invertebrate''' genomes: | assembly_summary.txt text listing information files, for example '''invertebrate''' genomes: | ||
<pre> | |||
<nowiki> | |||
wget -O /tmp/invertebrate.assembly_summary.txt 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/assembly_summary.txt' | |||
</nowiki> | |||
</pre> | |||
Looking for the Anopheles genome: | Looking for the Anopheles genome: | ||
<pre> | <pre> | ||
Line 112: | Line 136: | ||
Note the Assembly identification from the ftp path '''GCF_000005575.2_AgamP3''', | Note the Assembly identification from the ftp path '''GCF_000005575.2_AgamP3''', | ||
working with that sequence: | working with that sequence, using the '''rsync''' service in place of the '''FTP''': | ||
<pre> | <pre> | ||
<nowiki> | <nowiki> | ||
mkdir /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 | mkdir /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 | ||
cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 | cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 | ||
rsync -L -a -P rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_genomic.fna.gz ./ | |||
rsync -L -a -P \ | |||
rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_assembly_report.txt ./ | |||
ls -og | ls -og | ||
total 79908 | total 79908 | ||
Line 140: | Line 164: | ||
==lastz parameter file== | ==lastz parameter file== | ||
'''SEE ALSO:''' [[lastz DEF file parameters]] | |||
The '''DEF''' file is used with the script to specify alignment parameters to '''lastz''' and | The '''DEF''' file is used with the script to specify alignment parameters to '''lastz''' and | ||
the '''axtChain''' operations | the '''axtChain''' operations. The example | ||
for '''dm6''' target vs. '''A. gambiae''' query sequence, loose parameters are used | for '''dm6''' target vs. '''A. gambiae''' query sequence, loose parameters are used | ||
for this '''distant''' alignment: | for this '''distant''' alignment: | ||
Line 149: | Line 175: | ||
# dm6 vs GCF_000005575.2_AgamP3 | # dm6 vs GCF_000005575.2_AgamP3 | ||
PATH=/data/bin:/data/scripts | PATH=/data/bin:/data/scripts | ||
BLASTZ=/data/ | BLASTZ=/data/bin/lastz-1.04.00 | ||
BLASTZ_H=2000 | BLASTZ_H=2000 | ||
BLASTZ_Y=3400 | BLASTZ_Y=3400 | ||
Line 172: | Line 198: | ||
BASE=/data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 | BASE=/data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 | ||
TMPDIR=/dev/shm | TMPDIR=/dev/shm | ||
'''NOTE:''' UCSC tends to keep options for running alignments to approximately four | |||
category sets: | |||
* human to other primates | |||
* human to other mammals | |||
* human to more distant vertebrates | |||
* fly and worm alignments and other such distant organisms | |||
Many examples of DEF files and chaining arguments can be found in the record of alignments | |||
at '''UCSC''' in the source tree '''make doc''' files. For example, alignments to human/hg38: | |||
[http://genome-source.cse.ucsc.edu/gitlist/kent.git/blob/master/src/hg/makeDb/doc/hg38/lastzRuns.txt hg38 lastz] | |||
and the 100-way alignment: [[Hg38_100-way_conservation_lastz_parameters]] | |||
Many experiments have been tried over time. To keep it simple: | |||
* '''human to other primates''' | |||
BLASTZ_M=254 | |||
BLASTZ_O=600 | |||
BLASTZ_E=150 | |||
BLASTZ_K=4500 | |||
BLASTZ_Y=15000 | |||
BLASTZ_T=2 | |||
BLASTZ_Q=/scratch/data/blastz/human_chimp.v2.q | |||
# Where human_chimp.v2.q is: | |||
# A C G T | |||
# 90 -330 -236 -356 | |||
# -330 100 -318 -236 | |||
# -236 -318 100 -330 | |||
# -356 -236 -330 90 | |||
-chainMinScore=5000 -chainLinearGap=medium | |||
* '''human to other mammals''' | |||
BLASTZ_O=400 | |||
BLASTZ_E=30 | |||
BLASTZ_M=254 | |||
# default BLASTZ_Q score matrix: | |||
# A C G T | |||
# A 91 -114 -31 -123 | |||
# C -114 100 -125 -31 | |||
# G -31 -125 100 -114 | |||
# T -123 -31 -114 91 | |||
-chainMinScore=3000 -chainLinearGap=medium | |||
* '''human to more distant vertebrates''' | |||
BLASTZ_M=50 | |||
BLASTZ_Y=3400 | |||
BLASTZ_L=6000 | |||
BLASTZ_K=2200 | |||
BLASTZ_Q=/scratch/data/blastz/HoxD55.q | |||
# HoxD55.q matrix is: | |||
# A C G T | |||
# 91 -90 -25 -100 | |||
# -90 100 -100 -25 | |||
# -25 -100 100 -90 | |||
# -100 -25 -90 91 | |||
-chainMinScore=5000 -chainLinearGap=loose | |||
* '''fly and worm alignments and other such distant organisms''' | |||
BLASTZ_H=2000 | |||
BLASTZ_Y=3400 | |||
BLASTZ_L=4000 | |||
BLASTZ_K=2200 | |||
BLASTZ_Q=/data/lastz/HoxD55.q | |||
# HoxD55.q matrix is: | |||
# A C G T | |||
# 91 -90 -25 -100 | |||
# -90 100 -100 -25 | |||
# -25 -100 100 -90 | |||
# -100 -25 -90 91 | |||
-chainMinScore=1000 -chainLinearGap=loose | |||
==perform alignment== | ==perform alignment== | ||
Line 191: | Line 288: | ||
screen -S dm6.GCF_000005575 | screen -S dm6.GCF_000005575 | ||
time (/data/scripts/ | time (/data/scripts/doBlastzChainNet.pl `pwd`/DEF -verbose=2 -noDbNameCheck \ | ||
-workhorse=localhost -bigClusterHub=localhost -skipDownload \ | -workhorse=localhost -bigClusterHub=localhost -skipDownload \ | ||
-dbHost=localhost -smallClusterHub=localhost \ | -dbHost=localhost -smallClusterHub=localhost \ | ||
Line 338: | Line 435: | ||
This process does not have any '''parasol batch''' procedures. The procedure only | This process does not have any '''parasol batch''' procedures. The procedure only | ||
does transformations on some of the result files computed during the first alignment | does transformations on some of the result files computed during the first alignment | ||
procedure. | procedure. Since this is not a parallel procedure, it takes a bit of time: | ||
grep -w step rbest.log | |||
HgStepManager: executing from step 'recipBest' through step 'cleanup'. | |||
HgStepManager: executing step 'recipBest' Fri Apr 6 05:24:27 2018. | |||
HgStepManager: executing step 'download' Fri Apr 6 05:55:38 2018. | |||
HgStepManager: executing step 'load' Fri Apr 6 05:55:38 2018. | |||
HgStepManager: executing step 'cleanup' Fri Apr 6 05:55:47 2018. | |||
When completed, this has a '''featureBits''' measurement also: | |||
cat fb.dm6.chainRBest.GCF_000005575.2_AgamP3.txt | |||
15316412 bases of 143726002 (10.657%) in intersection | |||
==Swap== | |||
Now that one chain is finished you can swap the reverse direction, note the "swap" and "swapDir" arguments to doBlastzChainNet.pl: | |||
mkdir -p /data/genomes/oviAri4/trackData/openstack.lastzHg38.2018-04-26/ | |||
cd /data/genomes/oviAri4/trackData/openstack.lastzHg38.2018-04-26/ | |||
time (/data/scripts/doBlastzChainNet.pl \ | |||
/data/genomes/hg38/trackData/openstack.lastzOviAri.2018-04-26/DEF \ | |||
-swap -swapDir=`pwd` -verbose=2 -noDbNameCheck -workhorse=localhost \ | |||
-bigClusterHub=localhost -skipDownload -dbHost=localhost \ | |||
-smallClusterHub=localhost -trackHub - fileServer=localhost \ | |||
-syntenicNet) > swap.log 2>&1 & | |||
==Track Hub files== | |||
This procedure has constructed '''big*''' files that can be used to display | |||
these tracks in a [http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html track hub] | |||
on the '''UCSC genome browser''' | |||
ls axtChain/*.bb bigMaf/*.bb | |||
axtChain/chainGCF_000005575.2_AgamP3.bb | |||
axtChain/chainGCF_000005575.2_AgamP3Link.bb | |||
axtChain/chainRBestGCF_000005575.2_AgamP3.bb | |||
axtChain/chainRBestGCF_000005575.2_AgamP3Link.bb | |||
axtChain/chainSynGCF_000005575.2_AgamP3.bb | |||
axtChain/chainSynGCF_000005575.2_AgamP3Link.bb | |||
bigMaf/dm6.GCF_000005575.2_AgamP3.net.bb | |||
bigMaf/dm6.GCF_000005575.2_AgamP3.net.summary.bb | |||
bigMaf/dm6.GCF_000005575.2_AgamP3.rbestNet.bb | |||
bigMaf/dm6.GCF_000005575.2_AgamP3.rbestNet.summary.bb | |||
bigMaf/dm6.GCF_000005575.2_AgamP3.synNet.bb | |||
bigMaf/dm6.GCF_000005575.2_AgamP3.synNet.summary.bb | |||
'''(TBD: show structure of trackDb.txt track hub specifications)''' | |||
==How does this process work== | |||
The '''doBlastzChainNet.pl''' script performs the processing in distinct '''steps'''. Each '''step''' is | |||
almost always performed with a C-shell or bash shell script. Therefore, if there is a problem in | |||
any '''step''', the commands performing the '''step''' can be dissected from the script in operation, | |||
the problem identified and fixed, and the '''step''' completed manually by running the rest of the | |||
commands in that script. Once a step has been completed, the process can continue with the next | |||
step using the argument '''-continue=nextStepName'''. Check the usage message from the '''doBlastzChainNet.pl''' | |||
script to see a listing of the '''steps''' and their sequence. Specifically: | |||
partition, blastz, cat, chainRun, chainMerge, net, load, download, cleanup, syntenicNet | |||
In this example, the various scripts are: | |||
-rwxrwxr-x. 1 1914 Apr 6 04:50 run.blastz/doPartition.bash | |||
-rw-rw-r--. 1 2713 Apr 6 04:50 run.blastz/xdir.sh | |||
-rwxrwxr-x. 1 606 Apr 6 04:50 run.blastz/doClusterRun.csh | |||
-rwxrwxr-x. 1 802 Apr 6 05:13 run.cat/doCatRun.csh | |||
-rwxrwxr-x. 1 72 Apr 6 05:13 run.cat/cat.csh | |||
-rwxrwxr-x. 1 412 Apr 6 05:14 axtChain/run/chain.csh | |||
-rwxrwxr-x. 1 700 Apr 6 05:14 axtChain/run/doChainRun.csh | |||
-rwxrwxr-x. 1 3112 Apr 6 05:15 axtChain/netChains.csh | |||
-rwxrwxr-x. 1 2162 Apr 6 05:17 axtChain/loadUp.csh | |||
-rwxrwxr-x. 1 1479 Apr 6 05:17 cleanUp.csh | |||
-rwxrwxr-x. 1 4507 Apr 6 05:17 axtChain/netSynteny.csh | |||
-rwxrwxr-x. 1 6321 Apr 6 05:24 axtChain/doRecipBest.csh | |||
-rwxrwxr-x. 1 1974 Apr 6 05:55 axtChain/loadRBest.csh | |||
-rwxrwxr-x. 1 583 Apr 6 05:55 rBestCleanUp.bash | |||
==Notes from Hiram's Talks (by Dan) 4/21 and 4/28/21== | |||
*LiftOver file creation, LASTZ Run, SameSpeciesLiftOver* | |||
SameSpeciesLiftOver, made for… same species with near identical assemblies, often not great for many assemblies, somewhat antiquated. Hardly used, done with BLAT because expects high similarity, very quick to run. Must be the same biological sample, ie. hg19->hg38, mm10->mm39. It’s one command, fairly simple. | |||
http://genomewiki.ucsc.edu/index.php/DoBlastzChainNet.pl | |||
LastZ Makedoc lives here: | |||
~/kent/src/hg/makeDb/doc/hg38/lastzRuns.txt | |||
Tuning was a thing tried that didn’t work, should be ignored. | |||
First you go to the makeDoc, change the query name, run a “screen” similar to nohup (old school), copy the commands onto the ku, | |||
Screen basics (they stay in the background) | |||
$ screen -ls | |||
$screen -S demoScreen #starts screen | |||
Ctr A, ctr D to detach screen | |||
$ screen -r -d demoScreen #go back to a detached screen | |||
If you exit with the exit command, it’ll terminate your screen prematurely | |||
Going through hg38 -> seaturtle | |||
KU is used by browser engineers, mostly hiram | |||
set -beEu -o pipefail | |||
If anything fails, hard exit | |||
First steps are making dir | |||
mkdir /hive/data/genomes/hg38/bed/lastzCheMyd1.2015-02-06 | |||
Set version of lastz with dir BLASTZ=(v1.04.03) | |||
The score matrix for the alignment is listed in a comment here | |||
# default BLASTZ_Q score matrix | |||
Documented in a printf now, eg human vs X. tropicalis, can’t use single quote within | |||
previously done with a “hereIs” doc | |||
setting params | |||
* 3 params (Go to one recently done and copy that, or go to shell script) | |||
** Primates to prim | |||
** Prim to mammal | |||
** Any to any, very distant | |||
*** These look like BLASTZ_A | |||
* Target init stays the same, query is what changes. | |||
* Chunk is how to split each seq: LASTZ handles 40Mbp target vs 20Mbp query. | |||
* Limit is how many seqs (contigs) you fit into each limit, eg. 3 hg38 and 800 turtle | |||
** This choice is important for how many independent jobs go into the cluster, should be between 10k- 100,000 | |||
** Num pieces target x number of query pieces | |||
* Run script first step to calculate how many jobs, see (http://genomewiki.ucsc.edu/index.php/Cluster_Jobs) | |||
** CrisprAll runs 5M jobs and 1 week to run | |||
* seq_LAP | |||
** how much overlap between sequences, eg. 10k for target and 0 for query | |||
** Prevent artificial breaks | |||
* BASE is where work is done | |||
* tempDir is ?? fastest file transfer dir | |||
Run the doBlastChainNet.pl command (14hr) | |||
* Produces regular and syntenic chain/net, only Regular shown | |||
* Syntenic going in order on chromosome, in synteny, filter of those so only those that go in same order remain, | |||
Run doRecipBest.pl (6hr) | |||
* If there’s a best align, this gets made in recipBest | |||
Do the reverse “swap” doBlastChainNet (2hr) | |||
* produces reg and syntenic chain/net | |||
For assembly hubs: | |||
* Different dir initialization, (BASE) | |||
* Different doBlastChainNet command. | |||
* Bunch of exports, setting variables also stored in BASE dir | |||
* Sed line annotates a comment in the makeDoc with bases in intersection | |||
* Mkdir in the assem hub dir, do the swap command, when you’re in swap land you set -target2Bit with query files 2bit and chrom.sizes | |||
This human->sheep took nearly 600 minutes, (100hrs) | |||
“You can count on a day to run” | |||
For genome without assembly Hub: | |||
* All it takes is 2bit and chom.sizes | |||
* You can’t view it without an assembly | |||
How to move around assembly hubs | |||
$ goto GCF_002742125, goes to proper dir | |||
Multiple alignments Downstream gets created from pairwise chain/net, though not always from “regular” one, which confuses users. | |||
'''4/28/21 Demo''' | |||
Demo rn7 to bosTau9 | |||
Smaller contig is target, larger is query. Figure this using wc -l on *.sizes file | |||
Copy similar makeDoc from similar org (See vim tutorial wiki): | |||
* Mark line with (m + a) | |||
* “ a ‘ a ------- yank lines and paste them into “lastzRuns.txt” | |||
* Search and replace all db names, dates, etc | |||
* 10s --- 10 letter substitute | |||
* . ---- repeat last command | |||
After you mod the file, run the command to put things into DEF folder | |||
Run some command (stop=partition), then check the number of job. N times M should be <100k, hiram uses “bc” to do the math. | |||
Make a screen, called rn7, run the timed command. | |||
* It logs into KU and runs the partitioned jobs with priority “nice” | |||
* Crt A, crt D to end screen | |||
Troubleshooting | |||
* Parasol batch fails | |||
** Go to dir where it was working, look at it, debug error manually | |||
See Recorded Demo here: | |||
https://hgwdev.gi.ucsc.edu/~dschmelt/HiramChainMakingDemo.mp4 | |||
==The new streamline pairLastz script== | |||
The pairLastz script streamlines making the DEF file, running doBlastzChainNet.pl, | |||
running doRecipBest.pl, and the swap into one script. | |||
~/kent/src/hg/utils/automation/pairLastz.sh | |||
usage: pairLastz.sh <target> <query> <tClade> <qClade> | |||
Before starting, add a file to your hgwdev home directory: | |||
/cluster/home/$user/.forward | |||
which is a single line of an email address where you would like to receive email from | |||
processes on hgwdev. You may already be receiving email from hgwdev, but with this file, | |||
you can control where it goes exactly. The pairLastz.sh script will send an email to your | |||
user name on hgwdev when it completes. | |||
'''pairLastzWrapper.py (Helpful)''' | |||
The pairLastzWrapper.py script determines the target, query, clades, and full GC assembly hub name to run the pairLastz script. You can then skip the "Determining target and query" and the "Determining clade" steps. | |||
python3 ~/kent/src/utils/qa/pairLastzWrapper.py -a1 hg38 -a2 GCF_001704415.1 | |||
screen -S 20220323 | |||
time (~/kent/src/hg/utils/automation/pairLastz.sh hg38 GCF_001704415.1_ARS1 primate mammal) > hg38.GCF_001704415.1_20220323.log 2>&1 & | |||
'''Determining target and query''' | |||
Use the following script to determine target and query (which are inputs to the alignment script). Here is an example | |||
python3 /hive/users/gperez2/liftOver/compareAssemblies.py -a1 calJac4 -a2 rheMac10 | |||
rheMac10 is target | |||
calJac4 is query | |||
<-- Extra Details --> You can use the scoring scheme to determine which should be the target and which would be the query: | |||
#1 point for largest genome | |||
#1 point for the lowest number of contigs/scaffolds/chromosomes | |||
#1 point for lowest N50 number | |||
#1 point for largest N50 sequence | |||
All of the measurements can be obtained with the n50.pl script | |||
on the chrom.sizes for the genomes. The higher scoring genome should | |||
be the target. In the case of a tie, then the largest genome is the target. | |||
If one of the assemblies is an assembly hub and available on GenArk, you can go to the directory | |||
using the following command: | |||
$ goto GCF_002742125.1 | |||
If you don't have the goto command, you can then add the following to your .bashrc file: | |||
<pre> | |||
<nowiki> | |||
function goto() { | |||
export asmId=$1 | |||
export gcX=`echo "$asmId" | cut -c1-3` | |||
export d0=`echo "$asmId" | cut -c5-7` | |||
export d1=`echo "$asmId" | cut -c8-10` | |||
export d2=`echo "$asmId" | cut -c11-13` | |||
export destDir="" | |||
if [ "${gcX}" = "GCF" ]; then | |||
destDir="/hive/data/genomes/asmHubs/refseqBuild/${gcX}/${d0}/${d1}/${d2}" | |||
else | |||
destDir="/hive/data/genomes/asmHubs/genbankBuild/${gcX}/${d0}/${d1}/${d2}" | |||
fi | |||
# printf "$destDir\n" 1>&2 | |||
export cdDir=`ls -d ${destDir}/${asmId}*` | |||
if [ -d "${cdDir}" ]; then | |||
printf "cd $cdDir\n" 1>&2 | |||
cd "$cdDir" | |||
else | |||
printf "# can not find ${cdDir}\n" 1>&2 | |||
fi | |||
} | |||
</nowiki> | |||
</pre> | |||
'''Determining clade''' | |||
You need to specify clade of the target and query organisms that control the alignment scoring scheme (thanks Hiram!). There are 3 categories of clade options. You want to use the most specific. | |||
primate | |||
mammal | |||
other | |||
'''Check machines on ku''' | |||
Check if machines are down or backed up with job queries: | |||
ssh ku | |||
parasol status | |||
parasol list machines | |||
parasol list batches | |||
Exit from ku and move to a screen | |||
'''Moving to a screen''' | |||
You can check your screens with the following command: | |||
screen -list | |||
You can make a new screen with the following command: | |||
screen -S screenName | |||
You can exit the screen by doing Ctrl-a then Ctrl-d | |||
You can reattach to the screen by the following command: | |||
screen -r -d screenName | |||
'''Running pairLastz.sh''' | |||
The script can be ran from anywhere on '''dev'''. Hiram and Gerardo have been running the pairLastz.sh script from the automation directory (~/kent/src/hg/utils/automation/). The following command will write the status of the script | |||
to a file and be used to update the makeDoc: | |||
time (~/kent/src/hg/utils/automation/pairLastz.sh \ | |||
target query tClade qClade) \ | |||
> fileName.log 2>&1 & | |||
You can check the status of the alignment by doing a tail on the fileName.log file. | |||
You can see the following fileName.log as an example of what to expect: | |||
/cluster/home/gperez2/kent/src/hg/utils/automation/sheepLiftOver_20211027.log | |||
If the pairLastz.sh run gets killed before completing the alignment, then contact Hiram to see if some files can be salvaged. | |||
'''Updating makeDoc''' | |||
After the script finishes running, you should update the makeDoc. | |||
The makeDoc’s can be found in: | |||
~/kent/src/hg/makeDb/doc/ | |||
If the liftOver files are for two GenArk assemblies, you can update the generic makedoc: | |||
~/kent/src/hg/makeDb/doc/asmHubs/lastzRuns.txt | |||
The following is an example on what goes on a makeDoc: | |||
git show 1054d4e3c1d6bdb3d2799860d0eca70dfd35447c | |||
<pre> | |||
<nowiki> | |||
############################################################################## | |||
# LASTZ human hg19 (DONE - 2021-09-30 - Hiram) | |||
# should be able to run this from anywhere, this time it was run from: | |||
cd /hive/data/genomes/mm39/bed | |||
time (~/kent/src/hg/utils/automation/pairLastz.sh mm39 hg19 mammal primate) \ | |||
> mm39.hg19.log 2>&1 | |||
grep -w real mm39.hg19.log | tail -1 | sed -e 's/^/ # /;' | |||
# real 1200m57.924s | |||
# this command outputs this makeDoc text: | |||
cat lastz.hg19/makeDoc.txt | |||
############################################################################## | |||
# LASTZ Mouse Mm39 vs. Human Hg19 (DONE - 2021-09-30 - Hiram) | |||
mkdir /hive/data/genomes/mm39/bed/lastzHg19.2021-09-30 | |||
cd /hive/data/genomes/mm39/bed/lastzHg19.2021-09-30 | |||
printf '# Human Hg19 vs. Mouse Mm39 | |||
BLASTZ=/cluster/bin/penn/lastz-distrib-1.04.03/bin/lastz | |||
# TARGET: Mouse Mm39 | |||
SEQ1_DIR=/hive/data/genomes/mm39/mm39.2bit | |||
SEQ1_LEN=/hive/data/genomes/mm39/chrom.sizes | |||
SEQ1_CHUNK=20000000 | |||
SEQ1_LAP=10000 | |||
SEQ1_LIMIT=40 | |||
# QUERY: Human Hg19 | |||
SEQ2_DIR=/hive/data/genomes/hg19/hg19.2bit | |||
SEQ2_LEN=/hive/data/genomes/hg19/chrom.sizes | |||
SEQ2_CHUNK=20000000 | |||
SEQ2_LAP=0 | |||
SEQ2_LIMIT=100 | |||
BASE=/hive/data/genomes/mm39/bed/lastzHg19.2021-09-30 | |||
TMPDIR=/dev/shm | |||
' > DEF | |||
time (~/kent/src/hg/utils/automation/doBlastzChainNet.pl -verbose=2 `pwd`/DEF -syntenicNet \ | |||
-workhorse=hgwdev -smallClusterHub=hgwdev -bigClusterHub=ku \ | |||
-chainMinScore=3000 -chainLinearGap=medium) > do.log 2>&1 | |||
grep -w real do.log | sed -e 's/^/ # /;' | |||
# real 474m8.118s | |||
# # real 474m8.118s | |||
# # real 370m29.622s | |||
# real 844m37.788s | |||
sed -e 's/^/ # /;' fb.mm39.chainHg19Link.txt | |||
# 938444606 bases of 2654624157 (35.351%) in intersection | |||
### and for the swap | |||
cd /hive/data/genomes/hg19/bed/blastz.mm39.swap | |||
time (~/kent/src/hg/utils/automation/doBlastzChainNet.pl -swap -verbose=2 \ | |||
/hive/data/genomes/mm39/bed/lastzHg19.2021-09-30/DEF -swapDir=`pwd` \ | |||
-syntenicNet -workhorse=hgwdev -smallClusterHub=hgwdev -bigClusterHub=ku \ | |||
-chainMinScore=3000 -chainLinearGap=medium) > swap.log 2>&1 | |||
grep -w real swap.log | sed -e 's/^/ # /;' | |||
# real 57m17.328s | |||
sed -e 's/^/ # /;' fb.hg19.chainMm39Link.txt | |||
# 969322683 bases of 2991710746 (32.400%) in intersection | |||
sed -e 's/^/ # /;' fb.hg19.chainSynMm39Link.txt | |||
# 921405754 bases of 2991710746 (30.799%) in intersection | |||
time (~/kent/src/hg/utils/automation/doRecipBest.pl -load -workhorse=hgwdev -buildDir=`pwd` \ | |||
hg19 mm39) > rbest.log 2>&1 | |||
grep -w real rbest.log | sed -e 's/^/ # /;' | |||
# real 299m2.449s | |||
sed -e 's/^/ # /;' fb.hg19.chainRBest.Mm39.txt | |||
# 892863094 bases of 2991710746 (29.845%) in intersection | |||
############################################################################## | |||
</nowiki> | |||
</pre> | |||
'''Push it to the RR''' | |||
Native databases to native databases chain files can be pushed to the RR by a push request: | |||
/usr/local/apache/htdocs-hgdownload/goldenPath/db/liftOver/dbToDb.over.chain.gz | |||
Native databases to genArk genomes chain files can be pushed to the RR by a push request: | |||
/usr/local/apache/htdocs-hgdownload/goldenPath/db/liftOver/dbToGC*.over.chain.gz | |||
For chain files from GenArk genomes to both native databases or other GenArk genomes, follow the commands found at [https://genomewiki.ucsc.edu/genecats/index.php?title=Building_GenArk_genomes#Pushing_to_the_RR Pushing_to_the_RR] | |||
[[Category:Cluster FAQ]] | [[Category:Cluster FAQ]] | ||
[[Category:Technical FAQ]] | [[Category:Technical FAQ]] |
Latest revision as of 16:29, 16 July 2024
Licensing
For commercial use of these toolsets, please note the license considerations for the kent source tools at the: Genome Store
Prerequisites
This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.
This entire discussion assumes the bash shell is the user's unix shell.
Compute resources
For any reasonable sized genome assemblies, this procedure will require cluster compute resources. Typical compute times can range from 1 to 2 days with 100 CPUs(cores). Much longer compute times will be seen for high contig count genome assemblies (hundreds of thousands of contigs) or for assemblies that are not well repeat masked. Please note this scatter plot and histogram showing compute time vs. genome size for alignments performed at UCSC
Parasol Job Control System
For cluster compute resources UCSC uses the parasol job control system. The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.
Install scripts and kent command line utilities
This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.
mkdir -p /data/scripts /data/bin chmod 755 /data/scripts /data/bin rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/ git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \ --prefix=kent/ HEAD src/hg/utils/automation \ | tar vxf - -C /data/scripts --strip-components=5 \ --exclude='kent/src/hg/utils/automation/incidentDb' \ --exclude='kent/src/hg/utils/automation/configFiles' \ --exclude='kent/src/hg/utils/automation/ensGene' \ --exclude='kent/src/hg/utils/automation/genbank' \ --exclude='kent/src/hg/utils/automation/lastz_D' \ --exclude='kent/src/hg/utils/automation/openStack' wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/bedSingleCover.pl'
NOTE: A copy of the lastz binary is included in the rsync download of binaries from hgdownload. It is named lastz-1.04.00 to identify the version. Source for lastz can be obtained from lastz github.
PATH setup
Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in your home directory:
echo 'export PATH=/data/bin:/data/scripts:$PATH' >> $HOME/.bashrc
Then, source that file to add that to this current shell:
. $HOME/.bashrc
Verify you see those pathnames on the PATH variable:
echo $PATH /data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin
Working directory hierarchy
It is best to organize your work in a directory hierarchy. For example maintain all your genome sequences in:
/data/genomes/ /data/genomes/hg38/ /data/genomes/mm10/ /data/genomes/dm6/ /data/genomes/ce11/ ... etc ...
Where those database directories can have the 2bit files, chrom.sizes, and track construction directories, for example:
/data/genomes/dm6/dm6.2bit /data/genomes/dm6/dm6.chrom.sizes /data/genomes/dm6/trackData/
Such organizations are a personal preference custom. However you do this, keep it consistent to make it easier to use scripts on multiple sequences.
Obtain genome sequences
Genome sequences from the U.C. Santa Cruz Genomics Institute can be obtained directly from the hgdownload server via rsync. For example
mkdir /data/genomes/dm6 cd /data/genomes/dm6 rsync -avzP \ rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.2bit . rsync -avzP \ rsync://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.chrom.sizes . ls -og -rw-rw-r--. 1 36969050 Aug 28 2014 dm6.2bit -rw-rw-r--. 1 45055 Aug 28 2014 dm6.chrom.sizes
Genome sequences from the NCBI/Entrez/Genbank system can be found via the assembly_summary.txt text listing information files, for example invertebrate genomes:
wget -O /tmp/invertebrate.assembly_summary.txt 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/assembly_summary.txt'
Looking for the Anopheles genome:
grep -w Anopheles /tmp/invertebrate.assembly_summary.txt GCF_000005575.2 PRJNA163 SAMN02952903 AAAB00000000.1 representative genome 180454 7165 Anopheles gambiae str. PEST strain=PEST latest Chromosome Major Full 2006/10/16 AgamP3 The International Consortium for the Sequencing of Anopheles Genome GCA_000005575.1 different ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3
Note the Assembly identification from the ftp path GCF_000005575.2_AgamP3, working with that sequence, using the rsync service in place of the FTP:
mkdir /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 rsync -L -a -P rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_genomic.fna.gz ./ rsync -L -a -P \ rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/575/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3_assembly_report.txt ./ ls -og total 79908 -rw-rw-r--. 1 811394 Mar 7 2017 GCF_000005575.2_AgamP3_assembly_report.txt -rw-rw-r--. 1 81010275 Jun 15 2016 GCF_000005575.2_AgamP3_genomic.fna.gz
The assembly_report.txt file is useful to have for the meta-data information it has about the assembly. The fna.gz file needs to be in 2bit format for the processing system, and the chrom.sizes made from the 2bit:
faToTwoBit GCF_000005575.2_AgamP3_genomic.fna.gz GCF_000005575.2_AgamP3.2bit twoBitInfo GCF_000005575.2_AgamP3.2bit stdout | sort -k2,2nr > GCF_000005575.2_AgamP3.chrom.sizes ls -og total 156132 -rw-rw-r--. 1 77912208 Apr 6 03:48 GCF_000005575.2_AgamP3.2bit -rw-rw-r--. 1 138303 Apr 6 03:48 GCF_000005575.2_AgamP3.chrom.sizes -rw-rw-r--. 1 811394 Mar 7 2017 GCF_000005575.2_AgamP3_assembly_report.txt -rw-rw-r--. 1 81010275 Jun 15 2016 GCF_000005575.2_AgamP3_genomic.fna.gz
lastz parameter file
SEE ALSO: lastz DEF file parameters
The DEF file is used with the script to specify alignment parameters to lastz and the axtChain operations. The example for dm6 target vs. A. gambiae query sequence, loose parameters are used for this distant alignment:
cat DEF # dm6 vs GCF_000005575.2_AgamP3 PATH=/data/bin:/data/scripts BLASTZ=/data/bin/lastz-1.04.00 BLASTZ_H=2000 BLASTZ_Y=3400 BLASTZ_L=4000 BLASTZ_K=2200 BLASTZ_Q=/data/lastz/HoxD55.q # TARGET: D. melanogaster dm6 SEQ1_DIR=/data/genomes/dm6/dm6.2bit SEQ1_LEN=/data/genomes/dm6/dm6.chrom.sizes SEQ1_CHUNK=32100000 SEQ1_LAP=10000 SEQ1_LIMIT=18 # QUERY: GCF_000005575.2_AgamP3 SEQ2_DIR=/data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3.2bit SEQ2_LEN=/data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3.chrom.sizes SEQ2_CHUNK=1000000 SEQ2_LIMIT=2000 SEQ2_LAP=0 BASE=/data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 TMPDIR=/dev/shm
NOTE: UCSC tends to keep options for running alignments to approximately four category sets:
- human to other primates
- human to other mammals
- human to more distant vertebrates
- fly and worm alignments and other such distant organisms
Many examples of DEF files and chaining arguments can be found in the record of alignments at UCSC in the source tree make doc files. For example, alignments to human/hg38: hg38 lastz and the 100-way alignment: Hg38_100-way_conservation_lastz_parameters
Many experiments have been tried over time. To keep it simple:
- human to other primates
BLASTZ_M=254 BLASTZ_O=600 BLASTZ_E=150 BLASTZ_K=4500 BLASTZ_Y=15000 BLASTZ_T=2 BLASTZ_Q=/scratch/data/blastz/human_chimp.v2.q # Where human_chimp.v2.q is: # A C G T # 90 -330 -236 -356 # -330 100 -318 -236 # -236 -318 100 -330 # -356 -236 -330 90 -chainMinScore=5000 -chainLinearGap=medium
- human to other mammals
BLASTZ_O=400 BLASTZ_E=30 BLASTZ_M=254 # default BLASTZ_Q score matrix: # A C G T # A 91 -114 -31 -123 # C -114 100 -125 -31 # G -31 -125 100 -114 # T -123 -31 -114 91 -chainMinScore=3000 -chainLinearGap=medium
- human to more distant vertebrates
BLASTZ_M=50 BLASTZ_Y=3400 BLASTZ_L=6000 BLASTZ_K=2200 BLASTZ_Q=/scratch/data/blastz/HoxD55.q # HoxD55.q matrix is: # A C G T # 91 -90 -25 -100 # -90 100 -100 -25 # -25 -100 100 -90 # -100 -25 -90 91 -chainMinScore=5000 -chainLinearGap=loose
- fly and worm alignments and other such distant organisms
BLASTZ_H=2000 BLASTZ_Y=3400 BLASTZ_L=4000 BLASTZ_K=2200 BLASTZ_Q=/data/lastz/HoxD55.q # HoxD55.q matrix is: # A C G T # 91 -90 -25 -100 # -90 100 -100 -25 # -25 -100 100 -90 # -100 -25 -90 91 -chainMinScore=1000 -chainLinearGap=loose
perform alignment
After the DEF file is established, verify the files specified in it are actually present at the locations specified:
egrep "_DIR|_LEN" DEF | sed -e 's/.*=//;' | xargs ls -og -rw-rw-r--. 1 36969050 Aug 28 2014 /data/genomes/dm6/dm6.2bit -rw-rw-r--. 1 45055 Aug 28 2014 /data/genomes/dm6/dm6.chrom.sizes -rw-rw-r--. 1 77912208 Apr 6 03:48 /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3.2bit -rw-rw-r--. 1 138303 Apr 6 03:48 /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/GCF_000005575.2_AgamP3.chrom.sizes
Use a screen to keep this command attached to a terminal that you can detach from and reattach to at a later time. For large genomes, or with fewer CPU cores available, this command can run for many days. Time the operation of the command and record all output from it for later analysis if any problems arise from the operation:
screen -S dm6.GCF_000005575 time (/data/scripts/doBlastzChainNet.pl `pwd`/DEF -verbose=2 -noDbNameCheck \ -workhorse=localhost -bigClusterHub=localhost -skipDownload \ -dbHost=localhost -smallClusterHub=localhost \ -trackHub -fileServer=localhost -syntenicNet) > do.log 2>&1 &
The screen -S dm6.GCF_000005575 gives a name to the terminal so you can find it later in a listing of a number of screens. To detach from the running terminal, use two key presses:
"Ctrl-a Ctrl-d"
to reattach to this screen: screen -r -d dm6.GCF_000005575
BEWARE the drawback of the screen is that you can accidentally exit the shell while in the screen and you thus lose it. The processes that were attached to that shell can continue if they do not respond to the SIGHUP signal. To avoid this side-effect, develop a habit of always exiting a shell with the two key presses "Ctrl-a Ctrl-d", in a shell that is not in a screen it will merely echo those keystrokes and do nothing.
Monitor progress
To determine which step the process is working on, in this working directory, look for the word step in the do.log file:
cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 grep -w step do.log HgStepManager: executing from step 'partition' through step 'syntenicNet'. HgStepManager: executing step 'partition' Fri Apr 6 04:50:34 2018. HgStepManager: executing step 'blastz' Fri Apr 6 04:50:46 2018.
To view the parasol status of your batch:
parasol list batches #user run wait done crash pri max cpu ram plan min batch centos 323 18446 10283 0 10 -1 1 2.0g 323 0 /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/run.blastz/
To view the status of that particular batch:
cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/run.blastz/ para time 29052 jobs in batch 16647 jobs (including everybody's) in Parasol queue or running. Checking finished jobs .......................... Completed: 12405 of 29052 jobs Jobs currently running: 323 In queue waiting: 16324 jobs CPU time in finished jobs: 81145s 1352.42m 22.54h 0.94d 0.003 y IO & Wait Time: 142817s 2380.28m 39.67h 1.65d 0.005 y Time in running jobs: 4076s 67.93m 1.13h 0.05d 0.000 y Average job time: 18s 0.30m 0.01h 0.00d Longest running job: 43s 0.72m 0.01h 0.00d Longest finished job: 72s 1.20m 0.02h 0.00d Submission to last job: 718s 11.97m 0.20h 0.01d Estimated complete: 930s 15.51m 0.26h 0.01d
This example happens to be running on an Open Stack cluster with 323 allocated CPU cores:
parasol status CPUs total: 323 CPUs free: 0 CPUs busy: 323 Nodes total: 20 Nodes dead: 0 Nodes sick?: 0 Jobs running: 323 Jobs waiting: 14977 Jobs finished: 13752 Jobs crashed: 0 Spokes free: 30 Spokes busy: 0 Spokes dead: 0 Active batches: 1 Total batches: 1 Active users: 1 Total users: 1 Days up: 0.012685 Version: 12.18
When a parasol batch is completed, this scripting process leaves a run.time file in the batch directory where you can see what type of cluster time you have used:
cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3/run.blastz cat run.time Completed: 29052 of 29052 jobs CPU time in finished jobs: 149061s 2484.35m 41.41h 1.73d 0.005 y IO & Wait Time: 268812s 4480.20m 74.67h 3.11d 0.009 y Average job time: 14s 0.24m 0.00h 0.00d Longest finished job: 72s 1.20m 0.02h 0.00d Submission to last job: 1312s 21.87m 0.36h 0.02d Estimated complete: 0s 0.00m 0.00h 0.00d
For small genomes such as the two in this example, the steps after the lastz alignment can proceed rapidly:
grep -w step do.log HgStepManager: executing from step 'partition' through step 'syntenicNet'. HgStepManager: executing step 'partition' Fri Apr 6 04:50:34 2018. HgStepManager: executing step 'blastz' Fri Apr 6 04:50:46 2018. HgStepManager: executing step 'cat' Fri Apr 6 05:13:52 2018. HgStepManager: executing step 'chainRun' Fri Apr 6 05:14:12 2018. HgStepManager: executing step 'chainMerge' Fri Apr 6 05:15:49 2018. HgStepManager: executing step 'net' Fri Apr 6 05:15:58 2018. HgStepManager: executing step 'load' Fri Apr 6 05:17:06 2018. HgStepManager: executing step 'download' Fri Apr 6 05:17:35 2018. HgStepManager: executing step 'cleanup' Fri Apr 6 05:17:36 2018. HgStepManager: executing step 'syntenicNet' Fri Apr 6 05:17:42 2018.
The syntenicNet is the last step in this process, the do.log timing will indicate the full time for this alignment:
tail -3 do.log real 27m20.456s user 0m0.720s sys 0m0.394s
And featureBits measurements have taken place to indicate the amount of coverage of the target genome by the query genome, for both the fundamental alignment, and the syntenic filtered alignment:
ls fb.* fb.dm6.chain.GCF_000005575.2_AgamP3Link.txt fb.dm6.chainSyn.GCF_000005575.2_AgamP3Link.txt
cat fb.* 19155294 bases of 143726002 (13.328%) in intersection 1617815 bases of 143726002 (1.126%) in intersection
Reciprocal Best
After that alignment is completed, the reciprocal best alignment can be computed:
cd /data/genomes/dm6/trackData/GCF_000005575.2_AgamP3 export tDb=`grep "SEQ1_DIR=" DEF | sed -e 's#.*/##; s#.2bit##;'` export qDb=`grep "SEQ2_DIR=" DEF | sed -e 's#.*/##; s#.2bit##;'` export target2Bit=`grep "SEQ1_DIR=" DEF | sed -e 's/.*=//;'` export targetSizes=`grep "SEQ1_LEN=" DEF | sed -e 's/.*=//;'` export query2Bit=`grep "SEQ2_DIR=" DEF | sed -e 's/.*=//;'` export querySizes=`grep "SEQ2_LEN=" DEF | sed -e 's/.*=//;'` time (/data/scripts/doRecipBest.pl -buildDir=`pwd` -load \ -workhorse=localhost -dbHost=localhost -skipDownload \ -target2Bit=${target2Bit} -query2Bit=${query2Bit} \ -targetSizes=${targetSizes} -querySizes=${querySizes} \ -trackHub ${tDb} ${qDb}) > rbest.log 2>&1 &
This process does not have any parasol batch procedures. The procedure only does transformations on some of the result files computed during the first alignment procedure. Since this is not a parallel procedure, it takes a bit of time:
grep -w step rbest.log HgStepManager: executing from step 'recipBest' through step 'cleanup'. HgStepManager: executing step 'recipBest' Fri Apr 6 05:24:27 2018. HgStepManager: executing step 'download' Fri Apr 6 05:55:38 2018. HgStepManager: executing step 'load' Fri Apr 6 05:55:38 2018. HgStepManager: executing step 'cleanup' Fri Apr 6 05:55:47 2018.
When completed, this has a featureBits measurement also:
cat fb.dm6.chainRBest.GCF_000005575.2_AgamP3.txt 15316412 bases of 143726002 (10.657%) in intersection
Swap
Now that one chain is finished you can swap the reverse direction, note the "swap" and "swapDir" arguments to doBlastzChainNet.pl:
mkdir -p /data/genomes/oviAri4/trackData/openstack.lastzHg38.2018-04-26/ cd /data/genomes/oviAri4/trackData/openstack.lastzHg38.2018-04-26/ time (/data/scripts/doBlastzChainNet.pl \ /data/genomes/hg38/trackData/openstack.lastzOviAri.2018-04-26/DEF \ -swap -swapDir=`pwd` -verbose=2 -noDbNameCheck -workhorse=localhost \ -bigClusterHub=localhost -skipDownload -dbHost=localhost \ -smallClusterHub=localhost -trackHub - fileServer=localhost \ -syntenicNet) > swap.log 2>&1 &
Track Hub files
This procedure has constructed big* files that can be used to display these tracks in a track hub on the UCSC genome browser
ls axtChain/*.bb bigMaf/*.bb axtChain/chainGCF_000005575.2_AgamP3.bb axtChain/chainGCF_000005575.2_AgamP3Link.bb axtChain/chainRBestGCF_000005575.2_AgamP3.bb axtChain/chainRBestGCF_000005575.2_AgamP3Link.bb axtChain/chainSynGCF_000005575.2_AgamP3.bb axtChain/chainSynGCF_000005575.2_AgamP3Link.bb bigMaf/dm6.GCF_000005575.2_AgamP3.net.bb bigMaf/dm6.GCF_000005575.2_AgamP3.net.summary.bb bigMaf/dm6.GCF_000005575.2_AgamP3.rbestNet.bb bigMaf/dm6.GCF_000005575.2_AgamP3.rbestNet.summary.bb bigMaf/dm6.GCF_000005575.2_AgamP3.synNet.bb bigMaf/dm6.GCF_000005575.2_AgamP3.synNet.summary.bb
(TBD: show structure of trackDb.txt track hub specifications)
How does this process work
The doBlastzChainNet.pl script performs the processing in distinct steps. Each step is almost always performed with a C-shell or bash shell script. Therefore, if there is a problem in any step, the commands performing the step can be dissected from the script in operation, the problem identified and fixed, and the step completed manually by running the rest of the commands in that script. Once a step has been completed, the process can continue with the next step using the argument -continue=nextStepName. Check the usage message from the doBlastzChainNet.pl script to see a listing of the steps and their sequence. Specifically:
partition, blastz, cat, chainRun, chainMerge, net, load, download, cleanup, syntenicNet
In this example, the various scripts are:
-rwxrwxr-x. 1 1914 Apr 6 04:50 run.blastz/doPartition.bash -rw-rw-r--. 1 2713 Apr 6 04:50 run.blastz/xdir.sh -rwxrwxr-x. 1 606 Apr 6 04:50 run.blastz/doClusterRun.csh -rwxrwxr-x. 1 802 Apr 6 05:13 run.cat/doCatRun.csh -rwxrwxr-x. 1 72 Apr 6 05:13 run.cat/cat.csh -rwxrwxr-x. 1 412 Apr 6 05:14 axtChain/run/chain.csh -rwxrwxr-x. 1 700 Apr 6 05:14 axtChain/run/doChainRun.csh -rwxrwxr-x. 1 3112 Apr 6 05:15 axtChain/netChains.csh -rwxrwxr-x. 1 2162 Apr 6 05:17 axtChain/loadUp.csh -rwxrwxr-x. 1 1479 Apr 6 05:17 cleanUp.csh -rwxrwxr-x. 1 4507 Apr 6 05:17 axtChain/netSynteny.csh -rwxrwxr-x. 1 6321 Apr 6 05:24 axtChain/doRecipBest.csh -rwxrwxr-x. 1 1974 Apr 6 05:55 axtChain/loadRBest.csh -rwxrwxr-x. 1 583 Apr 6 05:55 rBestCleanUp.bash
Notes from Hiram's Talks (by Dan) 4/21 and 4/28/21
- LiftOver file creation, LASTZ Run, SameSpeciesLiftOver*
SameSpeciesLiftOver, made for… same species with near identical assemblies, often not great for many assemblies, somewhat antiquated. Hardly used, done with BLAT because expects high similarity, very quick to run. Must be the same biological sample, ie. hg19->hg38, mm10->mm39. It’s one command, fairly simple.
http://genomewiki.ucsc.edu/index.php/DoBlastzChainNet.pl
LastZ Makedoc lives here:
~/kent/src/hg/makeDb/doc/hg38/lastzRuns.txt
Tuning was a thing tried that didn’t work, should be ignored.
First you go to the makeDoc, change the query name, run a “screen” similar to nohup (old school), copy the commands onto the ku, Screen basics (they stay in the background)
$ screen -ls $screen -S demoScreen #starts screen
Ctr A, ctr D to detach screen
$ screen -r -d demoScreen #go back to a detached screen
If you exit with the exit command, it’ll terminate your screen prematurely
Going through hg38 -> seaturtle KU is used by browser engineers, mostly hiram
set -beEu -o pipefail
If anything fails, hard exit First steps are making dir
mkdir /hive/data/genomes/hg38/bed/lastzCheMyd1.2015-02-06
Set version of lastz with dir BLASTZ=(v1.04.03) The score matrix for the alignment is listed in a comment here
# default BLASTZ_Q score matrix
Documented in a printf now, eg human vs X. tropicalis, can’t use single quote within previously done with a “hereIs” doc setting params
- 3 params (Go to one recently done and copy that, or go to shell script)
- Primates to prim
- Prim to mammal
- Any to any, very distant
- These look like BLASTZ_A
- Target init stays the same, query is what changes.
- Chunk is how to split each seq: LASTZ handles 40Mbp target vs 20Mbp query.
- Limit is how many seqs (contigs) you fit into each limit, eg. 3 hg38 and 800 turtle
- This choice is important for how many independent jobs go into the cluster, should be between 10k- 100,000
- Num pieces target x number of query pieces
- Run script first step to calculate how many jobs, see (http://genomewiki.ucsc.edu/index.php/Cluster_Jobs)
- CrisprAll runs 5M jobs and 1 week to run
- seq_LAP
- how much overlap between sequences, eg. 10k for target and 0 for query
- Prevent artificial breaks
- BASE is where work is done
- tempDir is ?? fastest file transfer dir
Run the doBlastChainNet.pl command (14hr)
- Produces regular and syntenic chain/net, only Regular shown
- Syntenic going in order on chromosome, in synteny, filter of those so only those that go in same order remain,
Run doRecipBest.pl (6hr)
- If there’s a best align, this gets made in recipBest
Do the reverse “swap” doBlastChainNet (2hr)
- produces reg and syntenic chain/net
For assembly hubs:
- Different dir initialization, (BASE)
- Different doBlastChainNet command.
- Bunch of exports, setting variables also stored in BASE dir
- Sed line annotates a comment in the makeDoc with bases in intersection
- Mkdir in the assem hub dir, do the swap command, when you’re in swap land you set -target2Bit with query files 2bit and chrom.sizes
This human->sheep took nearly 600 minutes, (100hrs) “You can count on a day to run”
For genome without assembly Hub:
- All it takes is 2bit and chom.sizes
- You can’t view it without an assembly
How to move around assembly hubs
$ goto GCF_002742125, goes to proper dir
Multiple alignments Downstream gets created from pairwise chain/net, though not always from “regular” one, which confuses users.
4/28/21 Demo
Demo rn7 to bosTau9
Smaller contig is target, larger is query. Figure this using wc -l on *.sizes file
Copy similar makeDoc from similar org (See vim tutorial wiki):
- Mark line with (m + a)
- “ a ‘ a ------- yank lines and paste them into “lastzRuns.txt”
- Search and replace all db names, dates, etc
- 10s --- 10 letter substitute
- . ---- repeat last command
After you mod the file, run the command to put things into DEF folder Run some command (stop=partition), then check the number of job. N times M should be <100k, hiram uses “bc” to do the math.
Make a screen, called rn7, run the timed command.
- It logs into KU and runs the partitioned jobs with priority “nice”
- Crt A, crt D to end screen
Troubleshooting
- Parasol batch fails
- Go to dir where it was working, look at it, debug error manually
See Recorded Demo here: https://hgwdev.gi.ucsc.edu/~dschmelt/HiramChainMakingDemo.mp4
The new streamline pairLastz script
The pairLastz script streamlines making the DEF file, running doBlastzChainNet.pl, running doRecipBest.pl, and the swap into one script.
~/kent/src/hg/utils/automation/pairLastz.sh usage: pairLastz.sh <target> <query> <tClade> <qClade>
Before starting, add a file to your hgwdev home directory:
/cluster/home/$user/.forward
which is a single line of an email address where you would like to receive email from processes on hgwdev. You may already be receiving email from hgwdev, but with this file, you can control where it goes exactly. The pairLastz.sh script will send an email to your user name on hgwdev when it completes.
pairLastzWrapper.py (Helpful)
The pairLastzWrapper.py script determines the target, query, clades, and full GC assembly hub name to run the pairLastz script. You can then skip the "Determining target and query" and the "Determining clade" steps.
python3 ~/kent/src/utils/qa/pairLastzWrapper.py -a1 hg38 -a2 GCF_001704415.1 screen -S 20220323
time (~/kent/src/hg/utils/automation/pairLastz.sh hg38 GCF_001704415.1_ARS1 primate mammal) > hg38.GCF_001704415.1_20220323.log 2>&1 &
Determining target and query
Use the following script to determine target and query (which are inputs to the alignment script). Here is an example
python3 /hive/users/gperez2/liftOver/compareAssemblies.py -a1 calJac4 -a2 rheMac10 rheMac10 is target calJac4 is query
<-- Extra Details --> You can use the scoring scheme to determine which should be the target and which would be the query:
- 1 point for largest genome
- 1 point for the lowest number of contigs/scaffolds/chromosomes
- 1 point for lowest N50 number
- 1 point for largest N50 sequence
All of the measurements can be obtained with the n50.pl script on the chrom.sizes for the genomes. The higher scoring genome should be the target. In the case of a tie, then the largest genome is the target.
If one of the assemblies is an assembly hub and available on GenArk, you can go to the directory using the following command:
$ goto GCF_002742125.1
If you don't have the goto command, you can then add the following to your .bashrc file:
function goto() { export asmId=$1 export gcX=`echo "$asmId" | cut -c1-3` export d0=`echo "$asmId" | cut -c5-7` export d1=`echo "$asmId" | cut -c8-10` export d2=`echo "$asmId" | cut -c11-13` export destDir="" if [ "${gcX}" = "GCF" ]; then destDir="/hive/data/genomes/asmHubs/refseqBuild/${gcX}/${d0}/${d1}/${d2}" else destDir="/hive/data/genomes/asmHubs/genbankBuild/${gcX}/${d0}/${d1}/${d2}" fi # printf "$destDir\n" 1>&2 export cdDir=`ls -d ${destDir}/${asmId}*` if [ -d "${cdDir}" ]; then printf "cd $cdDir\n" 1>&2 cd "$cdDir" else printf "# can not find ${cdDir}\n" 1>&2 fi }
Determining clade
You need to specify clade of the target and query organisms that control the alignment scoring scheme (thanks Hiram!). There are 3 categories of clade options. You want to use the most specific.
primate mammal other
Check machines on ku
Check if machines are down or backed up with job queries:
ssh ku parasol status parasol list machines parasol list batches
Exit from ku and move to a screen
Moving to a screen
You can check your screens with the following command:
screen -list
You can make a new screen with the following command:
screen -S screenName
You can exit the screen by doing Ctrl-a then Ctrl-d
You can reattach to the screen by the following command:
screen -r -d screenName
Running pairLastz.sh
The script can be ran from anywhere on dev. Hiram and Gerardo have been running the pairLastz.sh script from the automation directory (~/kent/src/hg/utils/automation/). The following command will write the status of the script to a file and be used to update the makeDoc:
time (~/kent/src/hg/utils/automation/pairLastz.sh \ target query tClade qClade) \ > fileName.log 2>&1 &
You can check the status of the alignment by doing a tail on the fileName.log file.
You can see the following fileName.log as an example of what to expect:
/cluster/home/gperez2/kent/src/hg/utils/automation/sheepLiftOver_20211027.log
If the pairLastz.sh run gets killed before completing the alignment, then contact Hiram to see if some files can be salvaged.
Updating makeDoc
After the script finishes running, you should update the makeDoc.
The makeDoc’s can be found in:
~/kent/src/hg/makeDb/doc/
If the liftOver files are for two GenArk assemblies, you can update the generic makedoc:
~/kent/src/hg/makeDb/doc/asmHubs/lastzRuns.txt
The following is an example on what goes on a makeDoc:
git show 1054d4e3c1d6bdb3d2799860d0eca70dfd35447c
############################################################################## # LASTZ human hg19 (DONE - 2021-09-30 - Hiram) # should be able to run this from anywhere, this time it was run from: cd /hive/data/genomes/mm39/bed time (~/kent/src/hg/utils/automation/pairLastz.sh mm39 hg19 mammal primate) \ > mm39.hg19.log 2>&1 grep -w real mm39.hg19.log | tail -1 | sed -e 's/^/ # /;' # real 1200m57.924s # this command outputs this makeDoc text: cat lastz.hg19/makeDoc.txt ############################################################################## # LASTZ Mouse Mm39 vs. Human Hg19 (DONE - 2021-09-30 - Hiram) mkdir /hive/data/genomes/mm39/bed/lastzHg19.2021-09-30 cd /hive/data/genomes/mm39/bed/lastzHg19.2021-09-30 printf '# Human Hg19 vs. Mouse Mm39 BLASTZ=/cluster/bin/penn/lastz-distrib-1.04.03/bin/lastz # TARGET: Mouse Mm39 SEQ1_DIR=/hive/data/genomes/mm39/mm39.2bit SEQ1_LEN=/hive/data/genomes/mm39/chrom.sizes SEQ1_CHUNK=20000000 SEQ1_LAP=10000 SEQ1_LIMIT=40 # QUERY: Human Hg19 SEQ2_DIR=/hive/data/genomes/hg19/hg19.2bit SEQ2_LEN=/hive/data/genomes/hg19/chrom.sizes SEQ2_CHUNK=20000000 SEQ2_LAP=0 SEQ2_LIMIT=100 BASE=/hive/data/genomes/mm39/bed/lastzHg19.2021-09-30 TMPDIR=/dev/shm ' > DEF time (~/kent/src/hg/utils/automation/doBlastzChainNet.pl -verbose=2 `pwd`/DEF -syntenicNet \ -workhorse=hgwdev -smallClusterHub=hgwdev -bigClusterHub=ku \ -chainMinScore=3000 -chainLinearGap=medium) > do.log 2>&1 grep -w real do.log | sed -e 's/^/ # /;' # real 474m8.118s # # real 474m8.118s # # real 370m29.622s # real 844m37.788s sed -e 's/^/ # /;' fb.mm39.chainHg19Link.txt # 938444606 bases of 2654624157 (35.351%) in intersection ### and for the swap cd /hive/data/genomes/hg19/bed/blastz.mm39.swap time (~/kent/src/hg/utils/automation/doBlastzChainNet.pl -swap -verbose=2 \ /hive/data/genomes/mm39/bed/lastzHg19.2021-09-30/DEF -swapDir=`pwd` \ -syntenicNet -workhorse=hgwdev -smallClusterHub=hgwdev -bigClusterHub=ku \ -chainMinScore=3000 -chainLinearGap=medium) > swap.log 2>&1 grep -w real swap.log | sed -e 's/^/ # /;' # real 57m17.328s sed -e 's/^/ # /;' fb.hg19.chainMm39Link.txt # 969322683 bases of 2991710746 (32.400%) in intersection sed -e 's/^/ # /;' fb.hg19.chainSynMm39Link.txt # 921405754 bases of 2991710746 (30.799%) in intersection time (~/kent/src/hg/utils/automation/doRecipBest.pl -load -workhorse=hgwdev -buildDir=`pwd` \ hg19 mm39) > rbest.log 2>&1 grep -w real rbest.log | sed -e 's/^/ # /;' # real 299m2.449s sed -e 's/^/ # /;' fb.hg19.chainRBest.Mm39.txt # 892863094 bases of 2991710746 (29.845%) in intersection ##############################################################################
Push it to the RR
Native databases to native databases chain files can be pushed to the RR by a push request:
/usr/local/apache/htdocs-hgdownload/goldenPath/db/liftOver/dbToDb.over.chain.gz
Native databases to genArk genomes chain files can be pushed to the RR by a push request:
/usr/local/apache/htdocs-hgdownload/goldenPath/db/liftOver/dbToGC*.over.chain.gz
For chain files from GenArk genomes to both native databases or other GenArk genomes, follow the commands found at Pushing_to_the_RR