Cluster Jobs: Difference between revisions
No edit summary |
(changed cse to soe) |
||
(19 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
== | ==Cloud instance installation== | ||
[[Parasol_job_control_system]] - setting up parasol on cloud machines | |||
==Batch Location== | |||
Don't run your batches from your home directory. A runaway | Don't run your batches from your home directory. A runaway | ||
kluster job can quickly swamp the NFS server for the home directories | kluster job can quickly swamp the NFS server for the home directories | ||
and thereby lock out all users. Your batch is typically run from some /hive directory. Also, please make sure your umask is set to 002 rather than the more restrictive 022. We need to have group write permission to everyone's directory so we can fix stuff when you are not available. | and thereby lock out all users. Your batch is typically run from some /hive directory. Also, please make sure your umask is set to 002 rather than the more restrictive 022. We need to have group write permission to everyone's directory so we can fix stuff when you are not available. | ||
There is an older document that describes the pre-hive filesystems, that contains some helpful information here: [http:// | There is an older document that describes the pre-hive filesystems, that contains some helpful information here: [http://genecats.soe.ucsc.edu/eng/KiloKluster.html file system locations]. Note that it may still be helpful | ||
to use local disk to reduce I/O congestion. | to use local disk to reduce I/O congestion. | ||
==Input/Output== | |||
The most critical factor in designing your cluster jobs is to | The most critical factor in designing your cluster jobs is to | ||
completely understand where your input data is coming from, | completely understand where your input data is coming from, | ||
Line 22: | Line 24: | ||
<B>Important note:</B> Remember to clean up any temporary files you create on /scratch/tmp | <B>Important note:</B> Remember to clean up any temporary files you create on /scratch/tmp | ||
==Job Script== | |||
A properly constructed job is typically a small .csh shell script that begins: | A properly constructed job is typically a small .csh shell script that begins: | ||
<pre> | <pre> | ||
Line 33: | Line 35: | ||
If a line in your job file is too long it will cause the hub to crash. Each command, along with the header information, needs to fit in 1444 bytes. | If a line in your job file is too long it will cause the hub to crash. Each command, along with the header information, needs to fit in 1444 bytes. | ||
If you do want to run bash scripts, include this setting at the top: | |||
set -beEu -o pipefail | |||
to get the script to fail on any command anywhere | |||
==Long-Running Jobs and Large Batches== | |||
If you really must run jobs that will occupy a lot of CPU time, it is highly recommend instead, to redesign your processing to avoid that. If you insist there is no other way, then you must use the cluster politely. You have to leave the cluster in a state where it can do work for other users. Genome browser work takes priority over other research on the klusters. | If you really must run jobs that will occupy a lot of CPU time, it is highly recommend instead, to redesign your processing to avoid that. If you insist there is no other way, then you must use the cluster politely. You have to leave the cluster in a state where it can do work for other users. Genome browser work takes priority over other research on the klusters. | ||
Line 41: | Line 47: | ||
Typical job times should be on the order of minutes or less, at the outside tens of minutes. | Typical job times should be on the order of minutes or less, at the outside tens of minutes. | ||
Try to design your processing to stay within this guideline. | Try to design your processing to stay within this guideline. | ||
If you are unable to do this, use the para option - | If you are unable to do this, use the para option -maxJob=N to limit the number of nodes | ||
your long-running jobs are going to occupy. For example, hour-long jobs should be limited | your long-running jobs are going to occupy. For example, hour-long jobs should be limited | ||
to | to 100 nodes. Batches of long-running jobs can easily monopolize the cluster! If there are other people | ||
also running long running batches (which you can see with the command "parasol list batches" then please | |||
restrict it to 50 nodes. | |||
An exception to this rule is the 'edw' user. This is the ENCODE data warehouse daemon. ENCODE is paying for half of the cluster. Because we have little control over the software the ENCODE consortium chooses to run on the cluster, we can't always break the jobs into smaller pieces. Instead the daemon is set up so that it never even submits jobs that fill up more than half the cluster. Occasionally while debugging other programmers, primarily 'kent' will be running the ENCODE jobs instead of 'edw.' Do not take this as license to monopolize half the cluster yourself, at least not unless you are willing to buy another half cluster for the rest of us. | |||
Check with the group before running a batch that will take longer | Check with the group before running a batch that will take longer | ||
Line 49: | Line 59: | ||
or if your average job time is more than 15 minutes. | or if your average job time is more than 15 minutes. | ||
Also please check with the group before assigning more | Also please check with the group before assigning more | ||
than | than 100 cpus to a batch containing long-running jobs. If the cluster is busy please be receptive to requests to reduce the number of CPUs your long running jobs take even if you've already limited them to 100. | ||
==Job Recovery== | |||
There will almost always be failed jobs for a variety of reasons. The most important thing to do is design your jobs such that they have an atomic file presence indicator of successful completion. The case is typically to make a job do all of its work on the /scratch/tmp/ filesystem, creating its result file there. When it has successfully completed its work there, it does a single | There will almost always be failed jobs for a variety of reasons. The most important thing to do is design your jobs such that they have an atomic file presence indicator of successful completion. The case is typically to make a job do all of its work on the /scratch/tmp/ filesystem, creating its result file there. When it has successfully completed its work there, it does a single mv of the result file back to a /hive/ filesystem, which is outside of the cluster and thus more permanent. The existence of that file result can be verified by parasol commands to determine if the job was successfully completed. Parasol keeps track of the jobs that are successful or not. To re-run the failed jobs, you merely do a 'para push' of the batch again, and the failed jobs will be retried. A job can be retried like this until it fails four times. A gensub2 template example to check a result file:<pre> | ||
{check out line+ <result.file>}</pre> | {check out line+ <result.file>}</pre> | ||
is used to tell parasol to check that file to verify job completion. | is used to tell parasol to check that file to verify job completion, and to make sure the file has at least one line, and ends with a '\n' character. | ||
gensub2 template syntax:<pre> | gensub2 template syntax:<pre> | ||
Line 66: | Line 76: | ||
"line+" means file is 1 or more lines of data and is properly line-feed terminated | "line+" means file is 1 or more lines of data and is properly line-feed terminated | ||
==Sick nodes== | |||
Sometimes a kluster node will become defective during the running of your batch. Parasol will stop assigning jobs to that defective node. You can see this with para showSickNodes. To reset the sick-status on your batch run para clearSickNodes. | Sometimes a kluster node will become defective during the running of your batch. Parasol will stop assigning jobs to that defective node. You can see this with para showSickNodes. To reset the sick-status on your batch run para clearSickNodes. | ||
==Sick batch== | |||
If there are too many failures in a row, the system will consider your batch must be sick | If there are too many failures in a row, the system will consider your batch must be sick | ||
rather than the nodes that are failing and it will stop the batch. If you have encountered | rather than the nodes that are failing and it will stop the batch. If you have encountered | ||
this problem and were able to fix the issue, | this problem and were able to fix the issue, you can reset the sick status with para clearSickNodes. | ||
==Tracking errors== | |||
When you see the 'tracking errors' in the para check output: | |||
<pre> | |||
$ para check | |||
2762 jobs in batch | |||
3546 jobs (including everybody's) in Parasol queue or running. | |||
Checking finished jobs | |||
tracking errors: 2279 | |||
running: 22 | |||
ranOk: 461 | |||
failed 4 times: 1319 | |||
total jobs in batch: 2762 | |||
</pre> | |||
That means that the communication and coordination between | |||
the ku hub and the node spokes was interrupted and not everything got | |||
recorded correctly in the para.results file. It doesn't mean that | |||
all those results have been lost. If the ku hub is down for some period of time, the | |||
spokes continued with their running jobs and may have completed them | |||
successfully, but that didn't get recorded since the hub was down. | |||
After such down time on the ku hub, the correct status of the batch | |||
can be confused by parasol. | |||
Another way to see the state of the cluster is with: | |||
$ parasol list machines | grep running | less | |||
which indicates what the spokes have running right now. | |||
After your batch is finished, you can verify what has really been completed | |||
with 'para recover' which scans your original jobList, applies the | |||
{check out line ...} rule for each job to see if the final result file | |||
is actually present. If that is present, parasol says that job is actually | |||
done despite what might have been recorded in para.results or other indications. | |||
The 'para recover' is independent of parasol status, it verifies your result | |||
file. This is the reason why {check out ...} should really be checking a file | |||
that indicates the job really is done. | |||
The 'para recover' can write a new jobList file that can be used to start | |||
a new batch with just the jobs that need to be completed. A typical recover | |||
sequence is: | |||
$ para clearSickNodes | |||
$ para flushResults | |||
$ para time > run.time.first.batch 2>&1 | |||
$ para recover jobList recoverJobList | |||
$ para freeBatch | |||
$ para create recoverJobList | |||
$ para -maxJob=100 push | |||
And you are now running the recover job list set with a clean set | |||
of counters. | |||
Another reason why you may see odd numbers in the parasol status is because | |||
of the 'jobTree' system that is running by other groups here. It uses parasol | |||
in a modular mechanism whereby a single running job can spawn other parasol jobs | |||
and they don't really have any relationship to a specific batch directory. | |||
jobTree runs without the 'batch' mechanism. | |||
==See also:== | |||
<ul> | <ul> | ||
<li> [http:// | <li> [[Parasol_job_control_system]] - setting up parasol on cloud machines | ||
<li> [http:// | <li> [http://genecats.soe.ucsc.edu/eng/KiloKluster.html Where Is Everything] | ||
<li> [http://genecats.soe.ucsc.edu/eng/ UCSC Genome Browser Engineering] documents | |||
<li> [http://genecats.soe.ucsc.edu/eng/parasol.html Parasol batch] system | |||
<li> [http://genomewiki.ucsc.edu/genecats/index.php/Parasol_how_to Parasol HOWTO] | |||
</ul> | </ul> | ||
[[Category:Cluster FAQ]] | [[Category:Cluster FAQ]] | ||
[[Category:Technical FAQ]] | [[Category:Technical FAQ]] |
Latest revision as of 07:18, 1 September 2018
Cloud instance installation
Parasol_job_control_system - setting up parasol on cloud machines
Batch Location
Don't run your batches from your home directory. A runaway kluster job can quickly swamp the NFS server for the home directories and thereby lock out all users. Your batch is typically run from some /hive directory. Also, please make sure your umask is set to 002 rather than the more restrictive 022. We need to have group write permission to everyone's directory so we can fix stuff when you are not available.
There is an older document that describes the pre-hive filesystems, that contains some helpful information here: file system locations. Note that it may still be helpful to use local disk to reduce I/O congestion.
Input/Output
The most critical factor in designing your cluster jobs is to completely understand where your input data is coming from, where temporary files will be made during processing, and where your output data results are going. With several hundred CPUs reading and writing data, it is trivially simple to make life very difficult for the underlying NFS fileservers. The ideal case is, your input data comes from one file server, your temporary files are written to /scratch/tmp/ local disk space, and your output data goes back to a different NFS server than where your input data came from. For the case of input data that will be used in a variety of cluster jobs over an extended period of time, it can be arranged to copy that data to local /scratch/ disk space on each cluster node.
Important note: Remember to clean up any temporary files you create on /scratch/tmp
Job Script
A properly constructed job is typically a small .csh shell script that begins:
#!/bin/csh -fe
The -fe ensures the script will run to completion successfully or exit with an error if any of the commands fail. Parasol is aware of the errors if a command exits with errors so it will know a job has failed because of that. You can see many script examples in the kent source tree src/hg/makeDb/doc/*.txt files where we document all of our browser construction work.
If a line in your job file is too long it will cause the hub to crash. Each command, along with the header information, needs to fit in 1444 bytes.
If you do want to run bash scripts, include this setting at the top:
set -beEu -o pipefail
to get the script to fail on any command anywhere
Long-Running Jobs and Large Batches
If you really must run jobs that will occupy a lot of CPU time, it is highly recommend instead, to redesign your processing to avoid that. If you insist there is no other way, then you must use the cluster politely. You have to leave the cluster in a state where it can do work for other users. Genome browser work takes priority over other research on the klusters.
Use 'para try' and 'para time' to estimate your average job length and total cluster usage for your batch. Typical job times should be on the order of minutes or less, at the outside tens of minutes. Try to design your processing to stay within this guideline. If you are unable to do this, use the para option -maxJob=N to limit the number of nodes your long-running jobs are going to occupy. For example, hour-long jobs should be limited to 100 nodes. Batches of long-running jobs can easily monopolize the cluster! If there are other people also running long running batches (which you can see with the command "parasol list batches" then please restrict it to 50 nodes.
An exception to this rule is the 'edw' user. This is the ENCODE data warehouse daemon. ENCODE is paying for half of the cluster. Because we have little control over the software the ENCODE consortium chooses to run on the cluster, we can't always break the jobs into smaller pieces. Instead the daemon is set up so that it never even submits jobs that fill up more than half the cluster. Occasionally while debugging other programmers, primarily 'kent' will be running the ENCODE jobs instead of 'edw.' Do not take this as license to monopolize half the cluster yourself, at least not unless you are willing to buy another half cluster for the rest of us.
Check with the group before running a batch that will take longer than two cluster-days, or if your average job time is more than 15 minutes. Also please check with the group before assigning more than 100 cpus to a batch containing long-running jobs. If the cluster is busy please be receptive to requests to reduce the number of CPUs your long running jobs take even if you've already limited them to 100.
Job Recovery
There will almost always be failed jobs for a variety of reasons. The most important thing to do is design your jobs such that they have an atomic file presence indicator of successful completion. The case is typically to make a job do all of its work on the /scratch/tmp/ filesystem, creating its result file there. When it has successfully completed its work there, it does a single mv of the result file back to a /hive/ filesystem, which is outside of the cluster and thus more permanent. The existence of that file result can be verified by parasol commands to determine if the job was successfully completed. Parasol keeps track of the jobs that are successful or not. To re-run the failed jobs, you merely do a 'para push' of the batch again, and the failed jobs will be retried. A job can be retried like this until it fails four times. A gensub2 template example to check a result file:
{check out line+ <result.file>}
is used to tell parasol to check that file to verify job completion, and to make sure the file has at least one line, and ends with a '\n' character.
gensub2 template syntax:
{check 'when' 'what' <file>}
where 'when' is either "in" or "out"
and 'what' is one of: "exists" "exists+" "line" "line+"
"exists" means file exists, may be zero size
"exists+" means file exists and is non-zero size
"line" means file may have 0 or more lines of ascii data and is properly line-feed terminated
"line+" means file is 1 or more lines of data and is properly line-feed terminated
Sick nodes
Sometimes a kluster node will become defective during the running of your batch. Parasol will stop assigning jobs to that defective node. You can see this with para showSickNodes. To reset the sick-status on your batch run para clearSickNodes.
Sick batch
If there are too many failures in a row, the system will consider your batch must be sick rather than the nodes that are failing and it will stop the batch. If you have encountered this problem and were able to fix the issue, you can reset the sick status with para clearSickNodes.
Tracking errors
When you see the 'tracking errors' in the para check output:
$ para check 2762 jobs in batch 3546 jobs (including everybody's) in Parasol queue or running. Checking finished jobs tracking errors: 2279 running: 22 ranOk: 461 failed 4 times: 1319 total jobs in batch: 2762
That means that the communication and coordination between the ku hub and the node spokes was interrupted and not everything got recorded correctly in the para.results file. It doesn't mean that all those results have been lost. If the ku hub is down for some period of time, the spokes continued with their running jobs and may have completed them successfully, but that didn't get recorded since the hub was down. After such down time on the ku hub, the correct status of the batch can be confused by parasol.
Another way to see the state of the cluster is with:
$ parasol list machines | grep running | less
which indicates what the spokes have running right now.
After your batch is finished, you can verify what has really been completed with 'para recover' which scans your original jobList, applies the {check out line ...} rule for each job to see if the final result file is actually present. If that is present, parasol says that job is actually done despite what might have been recorded in para.results or other indications. The 'para recover' is independent of parasol status, it verifies your result file. This is the reason why {check out ...} should really be checking a file that indicates the job really is done.
The 'para recover' can write a new jobList file that can be used to start a new batch with just the jobs that need to be completed. A typical recover sequence is:
$ para clearSickNodes $ para flushResults $ para time > run.time.first.batch 2>&1 $ para recover jobList recoverJobList $ para freeBatch $ para create recoverJobList $ para -maxJob=100 push
And you are now running the recover job list set with a clean set of counters.
Another reason why you may see odd numbers in the parasol status is because of the 'jobTree' system that is running by other groups here. It uses parasol in a modular mechanism whereby a single running job can spawn other parasol jobs and they don't really have any relationship to a specific batch directory. jobTree runs without the 'batch' mechanism.
See also:
- Parasol_job_control_system - setting up parasol on cloud machines
- Where Is Everything
- UCSC Genome Browser Engineering documents
- Parasol batch system
- Parasol HOWTO