Parasol job control system: Difference between revisions
(No parasol jobs in parasol jobs) |
|||
(10 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
The parasol job control system is used to manage a multiple CPU/core compute cluster. | The parasol job control system is used to manage a multiple CPU/core compute cluster. | ||
It can also be used on a single computer that has multiple CPUs/cores. | It can also be used on a single computer that has multiple CPUs/cores. | ||
SEE ALSO: [https://genecats.gi.ucsc.edu/eng/parasol.html Parasol Parallel Batch System] | |||
It is the cluster control system expected by the | It is the cluster control system expected by the | ||
Line 9: | Line 11: | ||
computing lastz/chain/net pair-wise alignments, multiple alignments of genome assemblies, | computing lastz/chain/net pair-wise alignments, multiple alignments of genome assemblies, | ||
and many other bioinformatics processing toolsets. | and many other bioinformatics processing toolsets. | ||
==Licensing== | |||
For commercial use of these toolsets, please note the license considerations for the | |||
kent source tools at the: [https://genome-store.ucsc.edu/ Genome Store] | |||
==See also== | |||
Please note the [[Open_Stack_Parasol_Installation]] instructions that perform | |||
this installation on the Open Stack cloud platform. | |||
Please note this guideline: [[Cluster_Jobs]] for efficient organization | |||
of your parasol cluster jobs. | |||
==Computer organization== | ==Computer organization== | ||
Line 56: | Line 71: | ||
This example is going to install everything in a constructed directory hierarchy: '''/data/...''' | This example is going to install everything in a constructed directory hierarchy: '''/data/...''' | ||
You can install all of this in a directory of your choice, modify the examples to match | |||
the directory you choose. | |||
The kent command binaries will be in '''/data/bin/''' and the scripts in '''/data/scripts''' | The kent command binaries will be in '''/data/bin/''' and the scripts in '''/data/scripts''' | ||
Line 62: | Line 79: | ||
filesystem for access by the '''node''' machines in this cluster. | filesystem for access by the '''node''' machines in this cluster. | ||
This example expects to operated as the '''root''' user. '''This not necessary,''' you can install | |||
this as your own user. When it is running as your user, it will not operate for other | |||
users. | |||
[[File:ParasolInstall.sh.txt]]: | [[File:ParasolInstall.sh.txt]]: | ||
<pre> | <pre> | ||
#!/bin/bash | #!/bin/bash | ||
export self=`id -n -u` | export self=`id -n -u` | ||
Line 88: | Line 107: | ||
--exclude='kent/src/hg/utils/automation/lastz_D' \ | --exclude='kent/src/hg/utils/automation/lastz_D' \ | ||
--exclude='kent/src/hg/utils/automation/openStack' | --exclude='kent/src/hg/utils/automation/openStack' | ||
wget -O /data/bin/bedSingleCover.pl 'http://genome-source. | wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/bedSingleCover.pl' | ||
else | else | ||
printf "ERROR: need to run this script as sudo ./parasolInstall.sh\n" 1>&2 | printf "ERROR: need to run this script as sudo ./parasolInstall.sh\n" 1>&2 | ||
fi | fi | ||
</pre> | </pre> | ||
Line 130: | Line 148: | ||
one for the operating system and one for the parasol '''hub''' processes. For example: | one for the operating system and one for the parasol '''hub''' processes. For example: | ||
/data/parasol/nodeInfo/nodeReport.sh 2 | /data/parasol/nodeInfo/nodeReport.sh 2 /data/parasol/nodeInfo | ||
On a strictly only '''node''' machine, no CPUs need to be reserved: | On a strictly only '''node''' machine, no CPUs need to be reserved: | ||
/data/parasol/nodeInfo/nodeReport.sh 0 | /data/parasol/nodeInfo/nodeReport.sh 0 /data/parasol/nodeInfo | ||
For cloud setup procedures, this '''"nodeReport.sh 0"''' command can be included in the | For cloud setup procedures, this '''"nodeReport.sh 0"''' command can be included in the | ||
Line 235: | Line 254: | ||
is completely finished, then moving the results to the result file: | is completely finished, then moving the results to the result file: | ||
$ cat runJob | $ cat runJob | ||
#!/bin/bash | #!/bin/bash | ||
Line 243: | Line 262: | ||
mkdir -p `dirname ${resultOut}` | mkdir -p `dirname ${resultOut}` | ||
stat "${fileInput}" > /dev/shm/testJob.$$ | stat "${fileInput}" > /dev/shm/testJob.$$ | ||
sleep $( | sleep $[($RANDOM % 30) + 1]s | ||
mv /dev/shm/testJob.$$ "${resultOut}" | mv /dev/shm/testJob.$$ "${resultOut}" | ||
$ chmod +x runJob | |||
The '''mkdir -p''' will ensure there is a '''result''' directory. | The '''mkdir -p''' will ensure there is a '''result''' directory. | ||
Line 322: | Line 343: | ||
Please note, the parasol job control system is not a recursive process. You can not start new | Please note, the parasol job control system is not a recursive process. You can not start new | ||
parasol batches from within a running parasol job. | parasol batches from within a running parasol job. | ||
==Recovery from errors== | |||
If your parasol batch has a number of errors due to a problem in the scripting commands | |||
you supplied in the jobList, the parasol system will disable that batch until the | |||
problems have been corrected. To recover from this situation, it is usually | |||
best to start over with a clean slate. To completely clean up your batch from | |||
the system, in your batch directory perform the following commands: | |||
$ para stop | |||
$ para flushResults | |||
$ para resetCounts | |||
$ para freeBatch | |||
When you have fixed your processing script so it runs correctly, restart the entire batch with: | |||
$ para create jobList | |||
$ para try | |||
See if those ten jobs started with 'para try' function OK. Use: | |||
$ para check | |||
or | |||
$ para time | |||
to see the progress of the 10 jobs. If they complete successfully, | |||
then continue the entire batch: | |||
$ para push | |||
Follow the progress with: | |||
$ para check | |||
or | |||
$ para time | |||
until it is done. The usual '''batch is finished''' signal we use in our processing | |||
scripts is to leave a '''run.time''' file in the batch directory: | |||
$ para time > run.time | |||
The presence of that '''run.time''' file signals that this batch is completely finished. | |||
==testing UDP network connections== | ==testing UDP network connections== |
Latest revision as of 03:28, 31 January 2023
Introduction
The parasol job control system is used to manage a multiple CPU/core compute cluster. It can also be used on a single computer that has multiple CPUs/cores.
SEE ALSO: Parasol Parallel Batch System
It is the cluster control system expected by the U.C. Santa Cruz Genomics Institute genomics toolsets and processing pipelines such as building new genome browsers, computing lastz/chain/net pair-wise alignments, multiple alignments of genome assemblies, and many other bioinformatics processing toolsets.
Licensing
For commercial use of these toolsets, please note the license considerations for the kent source tools at the: Genome Store
See also
Please note the Open_Stack_Parasol_Installation instructions that perform this installation on the Open Stack cloud platform.
Please note this guideline: Cluster_Jobs for efficient organization of your parasol cluster jobs.
Computer organization
One computer (or one CPU of one computer) is allocated to the task of managing the compute jobs. This is referred to as the parasol hub machine.
The other computers in the cluster are allocated to the task of running the compute jobs. These computers are referred to as the parasol node machines.
Compute jobs to the system are managed with parasol commands on the hub machine. The compute jobs are assigned to the node machines by the parasol processes running on the hub machine.
A single machine can be used as both the hub controller and as a node task machine as long as two CPU cores are reserved, one for the operating system and one for the hub processes, with the extra CPU cores allocated to the node task resource pool.
SSH keys
The hub machine needs to communicate with the node machines via UDP network protocol and via ssh commands for setup tasks.
Assuming this is a new machine (such as a cloud machine instance) with no previous ssh operations, the ssh keys can be generated:
echo | ssh-keygen -N "" -t rsa -q
The echo provides an empty answer to the passphrase question that would normally be asked by the ssh-keygen command. This creates two files in $HOME/.ssh/:
-rw-r--r--. 1 397 Mar 22 17:43 id_rsa.pub -rw-------. 1 1675 Mar 22 17:43 id_rsa
The existing $HOME/.ssh/authorized_keys probably already has keys added from the cloud machine management system to allow login to the machine instance. To add this newly generated key to the authorized_keys file without disturbing existing contents:
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
These id_rsa key files will be copied to other machines in this cluster to allow this hub machine access via ssh. This copy just performed will allow this hub machine to access itself via ssh in the case it is also used as a node machine.
Parasol installation
This example is going to install everything in a constructed directory hierarchy: /data/... You can install all of this in a directory of your choice, modify the examples to match the directory you choose.
The kent command binaries will be in /data/bin/ and the scripts in /data/scripts
This directory will be exported as an NFS filesystem for access by the node machines in this cluster.
This example expects to operated as the root user. This not necessary, you can install this as your own user. When it is running as your user, it will not operate for other users.
#!/bin/bash export self=`id -n -u` if [ "${self}" = "root" ]; then printf "# creating /data/ directory hierarchy and installing binaries\n" 1>&2 mkdir -p /data/bin /data/scripts /data/genomes /data/parasol/nodeInfo chmod 777 /data /data/genomes /data/parasol /data/parasol/nodeInfo chmod 755 /data/bin /data/scripts rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/ yum -y install wget wget -qO /data/parasol/nodeInfo/nodeReport.sh 'http://genomewiki.ucsc.edu/images/e/e3/NodeReport.sh.txt' chmod 755 /data/parasol/nodeInfo/nodeReport.sh wget -qO /data/parasol/initParasol 'http://genomewiki.ucsc.edu/images/4/4f/InitParasol.sh.txt' chmod 755 /data/parasol/initParasol git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \ --prefix=kent/ HEAD src/hg/utils/automation \ | tar vxf - -C /data/scripts --strip-components=5 \ --exclude='kent/src/hg/utils/automation/incidentDb' \ --exclude='kent/src/hg/utils/automation/configFiles' \ --exclude='kent/src/hg/utils/automation/ensGene' \ --exclude='kent/src/hg/utils/automation/genbank' \ --exclude='kent/src/hg/utils/automation/lastz_D' \ --exclude='kent/src/hg/utils/automation/openStack' wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/bedSingleCover.pl' else printf "ERROR: need to run this script as sudo ./parasolInstall.sh\n" 1>&2 fi
PATH setup
and The user that will be using this parasol system will need to have the /data/bin and /data/scripts directories in their shell PATH environment. This can be added simply to the .bashrc file in the user's home directory:
echo 'export PATH=/data/bin:/data/scripts:$PATH' >> $HOME/.bashrc
Then, source that file to add that to this current shell:
. $HOME/.bashrc
This entire discussion assumes the bash shell is the user's unix shell.
To root or not to root
If this parasol system is going to be used by the single user that is installing it, the system does not need to be run as the root user. If multiple users are going to use this parasol system, the daemons need to be run as root user. When the system is run as the root user, it will maintain correct ownership of each user's files created during parasol cluster operations. For a single user operation, all files used by the system will be owned by that single user identity. To use the commands as the root user, add the sudo command in front of each of the operations mentioned below.
Initialize node instances
Each node to be used in the system needs to report itself to the parasol hub. This is done by running the /data/parasol/nodeInfo/nodeReport.sh script when logged into each node in the system. The single argument to the script is the number of CPUs on that node to reserve outside the parasol system. For example, when using a single machine as both hub and node you want to reserve two CPUs, one for the operating system and one for the parasol hub processes. For example:
/data/parasol/nodeInfo/nodeReport.sh 2 /data/parasol/nodeInfo
On a strictly only node machine, no CPUs need to be reserved:
/data/parasol/nodeInfo/nodeReport.sh 0 /data/parasol/nodeInfo
For cloud setup procedures, this "nodeReport.sh 0" command can be included in the
cloud startup scripts for each node brought on-line.
This nodeReport.sh script is setting the parameters for this compute node:
#!/bin/bash myIp=`/usr/sbin/ip addr show | egrep "inet.*eth0" | awk '{print $2}' | sed -e 's#/.*##;'` memSizeMb=`grep -w MemTotal /proc/meminfo | awk '{print 1024*(1+int($2/(1024*1024)))}'` cpuCount=`grep processor /proc/cpuinfo | wc -l` memTotal=`echo $memSizeMb $cpuCount | awk '{printf "%d", $1, $2}'` devShm=`df -k /dev/shm | grep dev/shm | awk '{printf "%d\n", 512*(1+int($2/(1024*1024)))}'` printf "%s\t%d\t%d\t/dev/shm\t/dev/shm\t%s\tbsw\n" "${myIp}" "${cpuCount}" "${memTotal}" "${devShm}" > /data/parasol/nodeInfo/${myIp}.ms
The columns in the .ms file specify the resources available on this machine node. The column definitions are:
name - Network name/IP address cpus - Number of CPUs to use ramSize - Megabytes of memory total for all CPUs tempDir - Location of local temp dir localDir - Location of local data dir localSize - Megabytes of local disk switchName - Name of switch this is on (not used)
Initialize and verify ssh keys
On the parasol hub machine, use the initialize function once to prepare and verify the ssh keys will function properly:
/data/parasol/initParasol initialize
Starting/Stopping the parasol system
After all the node machines have reported in via the nodeReport.sh script, the parasol system can be started:
cd /data/parasol ./initParsol start
That script will echo the command it is using to start the system:
Starting parasol:/usr/bin/ssh 10.109.0.54 /data/bin/paraNode start -cpu=6 log=/data/parasol/10.109.0.54.2018-03-28T17:33.log hub=10.109.0.54 umask=002 sysPath=. userPath=bin Done.
And it will start a log file:
-rw-rw-r--. 1 163 Mar 28 17:48 10.109.0.54.2018-03-28T17:48.log
And the status of the parasol system can be checked:
$ parasol status CPUs total: 6 CPUs free: 6 CPUs busy: 0 Nodes total: 1 Nodes dead: 0 Nodes sick?: 0 Jobs running: 0 Jobs waiting: 0 Jobs finished: 0 Jobs crashed: 0 Spokes free: 30 Spokes busy: 0 Spokes dead: 0 Active batches: 0 Total batches: 0 Active users: 0 Total users: 0 Days up: 0.000046 Version: 12.18
To stop the system:
cd /data/parasol ./initParasol stop
That command responds:
Stopping parasol:Telling 10.109.0.54 to quit rudpSend timed out
Testing parasol jobs
To demonstrate a simple test of running jobs on the system and to prove it is working, this example procedure is carried out in /data/parasol/testJobs
$ mkdir /data/parasol/testJobs $ cd /data/parasol/testJobs
A typical scenario works on a list of files, or a list of a file with different arguments for the job to be carried out. Make up a small file list to work with in this example:
$ find /etc -type f | head > file.list
Using a short script is a good way to execute the required commands on the given input, working with temporary files until the result is completely finished, then moving the results to the result file:
$ cat runJob #!/bin/bash set -beEu -o pipefail export fileInput=$1 export resultOut=$2 mkdir -p `dirname ${resultOut}` stat "${fileInput}" > /dev/shm/testJob.$$ sleep $[($RANDOM % 30) + 1]s mv /dev/shm/testJob.$$ "${resultOut}"
$ chmod +x runJob
The mkdir -p will ensure there is a result directory. The sleep on the RANDOM number will provide a simulated test delay of up to 30 seconds. Results are calculated into the /dev/shm file, and then moved into the resultOut file upon completion. This provides an atomic operation that has few failure modes. Any failure of the script will be noticed by the parasol job management system.
The template file is used with the gensub2 command to manage the given file.list into single line commands:
$ cat template #LOOP ./runJob $(path1) {check out exists+ result/$(num1).txt} #ENDLOOP
Note the usage message from the gensub2 command to see what it can do with the list (or two listings) given to it. In this case, using one list:
gensub2 file.list single template jobList
This creates the jobList:
head -1 jobList ./runJob /etc/fstab {check out exists+ result/0.txt}
Which is used with the para create command to add this batch to the system:
para create jobList 10 jobs written to /data/parasol/testJobs/batch
This batch file is used by the parasol system to record everything related to how the batch proceeds. To start this batch:
$ para push 10 jobs in batch 0 jobs (including everybody's) in Parasol queue or running. Checking finished jobs updated job database on disk Pushed Jobs: 10
To check on the progress of the jobs:
$ parasol list batches #user run wait done crash pri max cpu ram plan min batch centos 6 2 2 0 10 -1 1 2.7g 6 0 /data/parasol/testJobs/ $ para check 10 jobs in batch 7 jobs (including everybody's) in Parasol queue or running. Checking finished jobs queued and waiting: 1 running: 6 ranOk: 3 total jobs in batch: 10 $ para time 10 jobs in batch 5 jobs (including everybody's) in Parasol queue or running. Checking finished jobs Completed: 5 of 10 jobs Jobs currently running: 5 CPU time in finished jobs: 0s 0.00m 0.00h 0.00d 0.000 y IO & Wait Time: 24s 0.40m 0.01h 0.00d 0.000 y Time in running jobs: 41s 0.68m 0.01h 0.00d 0.000 y Average job time: 5s 0.08m 0.00h 0.00d Longest running job: 11s 0.18m 0.00h 0.00d Longest finished job: 10s 0.17m 0.00h 0.00d Submission to last job: 20s 0.33m 0.01h 0.00d Estimated complete: 5s 0.08m 0.00h 0.00d
See also, notes in Cluster_Jobs for additional notes on running parasol jobs successfully and how to determine and fix common error situations.
NO parasol jobs within parasol jobs
Please note, the parasol job control system is not a recursive process. You can not start new parasol batches from within a running parasol job.
Recovery from errors
If your parasol batch has a number of errors due to a problem in the scripting commands you supplied in the jobList, the parasol system will disable that batch until the problems have been corrected. To recover from this situation, it is usually best to start over with a clean slate. To completely clean up your batch from the system, in your batch directory perform the following commands:
$ para stop $ para flushResults $ para resetCounts $ para freeBatch
When you have fixed your processing script so it runs correctly, restart the entire batch with:
$ para create jobList $ para try
See if those ten jobs started with 'para try' function OK. Use:
$ para check
or
$ para time
to see the progress of the 10 jobs. If they complete successfully, then continue the entire batch:
$ para push
Follow the progress with:
$ para check
or
$ para time
until it is done. The usual batch is finished signal we use in our processing scripts is to leave a run.time file in the batch directory:
$ para time > run.time
The presence of that run.time file signals that this batch is completely finished.
testing UDP network connections
This is almost never a problem, but this is a fun test. Using the command nc from the nmap-ncat package:
sudo yum install nmap-ncat
Open a shell terminal to each machine to be tested. Find out what the IP address is for this machine:
$ ip addr show | egrep "inet.*eth0" | awk '{print $2}' | sed -e 's#/.*##;' 10.109.0.40
On this machine, start an nc listener process on some port number, this example port 6111
nc -ul 6111
On the second shell terminal, start a corresponding connection to the first machine's address and port:
nc -u 10.109.0.40 6111
With those two processes connected, typing anything on one terminal will show up at the other terminal, after return is pressed at the end of a line.
To exit each listener process, type Ctrl-D