Cluster Jobs: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 7: Line 7:


There is an older document that describes the pre-hive filesystems, that contains some helpful information here: [http://genome-test.cse.ucsc.edu/eng/KiloKluster.html file system locations].  Note that it may still be helpful
There is an older document that describes the pre-hive filesystems, that contains some helpful information here: [http://genome-test.cse.ucsc.edu/eng/KiloKluster.html file system locations].  Note that it may still be helpful
to use local disk to reduce I/O congestion, however the discussion below about use of NFS and /cluster/store
to use local disk to reduce I/O congestion.  
filesystems is largely obsolete.


<h4>Input/Output</h4>
<h4>Input/Output</h4>
Line 38: Line 37:
If you really must run jobs that will occupy a lot of CPU time, it is highly recommend instead, to redesign your processing to avoid that. If you insist there is no other way, then you must use the cluster politely.  You have to leave the cluster in a state where it can do work for other users.  Genome browser work takes priority over other research on the klusters.
If you really must run jobs that will occupy a lot of CPU time, it is highly recommend instead, to redesign your processing to avoid that. If you insist there is no other way, then you must use the cluster politely.  You have to leave the cluster in a state where it can do work for other users.  Genome browser work takes priority over other research on the klusters.


Use 'para try' and 'para -eta time' to estimate your average job length
Use 'para try' and 'para time' to estimate your average job length
and total cluster usage for your batch.
and total cluster usage for your batch.
Typical job times should be on the order of minutes or less, at the outside tens of minutes.
Typical job times should be on the order of minutes or less, at the outside tens of minutes.
Line 53: Line 52:


<h4>Job Recovery</h4>
<h4>Job Recovery</h4>
There will almost always be failed jobs for a variety of reasons.  The most important thing to do is design your jobs such that they have an atomic file presence indicator of successful completion.  The case is typically to make a job do all of its work on the /scratch/tmp/ filesystem, creating its result file there.  When it has successfully completed its work there, it does a single copy of the result file back to a /cluster/storeN/ filesystem, which is outside of the cluster and thus more permanent. The existence of that file result can be verified by parasol commands to determine if the job was successfully completed.  Parasol keeps track of the jobs that are successful or not.  To re-run the failed jobs, you merely do a 'para push' of the batch again, and the failed jobs will be retried. A job can be retried like this until it fails four times.  A gensub2 template example to check a result file:<pre>
There will almost always be failed jobs for a variety of reasons.  The most important thing to do is design your jobs such that they have an atomic file presence indicator of successful completion.  The case is typically to make a job do all of its work on the /scratch/tmp/ filesystem, creating its result file there.  When it has successfully completed its work there, it does a single copy of the result file back to a /hive/ filesystem, which is outside of the cluster and thus more permanent. The existence of that file result can be verified by parasol commands to determine if the job was successfully completed.  Parasol keeps track of the jobs that are successful or not.  To re-run the failed jobs, you merely do a 'para push' of the batch again, and the failed jobs will be retried. A job can be retried like this until it fails four times.  A gensub2 template example to check a result file:<pre>
{check out line+ &lt;result.file&gt;}</pre>
{check out line+ &lt;result.file&gt;}</pre>
is used to tell parasol to check that file to verify job completion.
is used to tell parasol to check that file to verify job completion.
Line 67: Line 66:
"line+" means file is 1 or more lines of data and is properly line-feed terminated
"line+" means file is 1 or more lines of data and is properly line-feed terminated


<h4>Finding bad nodes</h4>
<h4>Sick nodes</h4>
Sometimes a kluster node will become defective during the running of your batch but parasol will not recognize that node as defective.  Parasol will continue to assign jobs to that defective node, and they will all failIn short order your batch will be consumed by this defective node as all the jobs assigned to it fail.  To determine if this is what happened to your batch, perform the following in your batch directory:
Sometimes a kluster node will become defective during the running of your batch.  Parasol will stop assigning jobs to that defective node.  You can see this with para showSickNodes.  To reset the sick-status on your batch run para clearSickNodes.
<pre>
$ para status | grep crash | awk '{print $5}' | sort | uniq -c | sort -n | tail
</pre>


There should be a kluster node at the end of that list with a huge number of failed jobs compared to all other failures.  You can fix the kluster by removing that node from the pool with the ''parasol remove machine'' command.
<h4>Sick batch</h4>
 
If there are too many failures in a row, the system will consider your batch must be sick
If the 'para status' doesn't work because of problems with para.results (it isn't corrupted, but it isn't accumulating status), you can do the same thing with 'para problems'
rather than the nodes that are failing and it will stop the batch. If you have encountered
<pre>
this problem and were able to fix the issue, ou can reset the sick status with para clearSickNodes.
$ para problems > problems.0 2>&1
$ grep host problems.0 | sort | uniq -c | sort -rn | head
</pre>


<h4>See also:</h4>
<h4>See also:</h4>

Revision as of 20:37, 15 September 2008

Cluster Job Organization

Batch Location

Don't run your batches from your home directory. A runaway kluster job can quickly swamp the NFS server for the home directories and thereby lock out all users. Your batch is typically run from some /hive directory. Also, please make sure your umask is set to 002 rather than the more restrictive 022. We need to have group write permission to everyone's directory so we can fix stuff when you are not available.

There is an older document that describes the pre-hive filesystems, that contains some helpful information here: file system locations. Note that it may still be helpful to use local disk to reduce I/O congestion.

Input/Output

The most critical factor in designing your cluster jobs is to completely understand where your input data is coming from, where temporary files will be made during processing, and where your output data results are going. With several hundred CPUs reading and writing data, it is trivially simple to make life very difficult for the underlying NFS fileservers. The ideal case is, your input data comes from one file server, your temporary files are written to /scratch/tmp/ local disk space, and your output data goes back to a different NFS server than where your input data came from. For the case of input data that will be used in a variety of cluster jobs over an extended period of time, it can be arranged to copy that data to local /scratch/ disk space on each cluster node.

Important note: Remember to clean up any temporary files you create on /scratch/tmp

Job Script

A properly constructed job is typically a small .csh shell script that begins:

#!/bin/csh -fe

The -fe ensures the script will run to completion successfully or exit with an error if any of the commands fail. Parasol is aware of the errors if a command exits with errors so it will know a job has failed because of that. You can see many script examples in the kent source tree src/hg/makeDb/doc/*.txt files where we document all of our browser construction work.

If a line in your job file is too long it will cause the hub to crash. Each command, along with the header information, needs to fit in 1444 bytes.

Long-Running Jobs and Large Batches

If you really must run jobs that will occupy a lot of CPU time, it is highly recommend instead, to redesign your processing to avoid that. If you insist there is no other way, then you must use the cluster politely. You have to leave the cluster in a state where it can do work for other users. Genome browser work takes priority over other research on the klusters.

Use 'para try' and 'para time' to estimate your average job length and total cluster usage for your batch. Typical job times should be on the order of minutes or less, at the outside tens of minutes. Try to design your processing to stay within this guideline. If you are unable to do this, use the para option -maxNode=N to limit the number of nodes your long-running jobs are going to occupy. For example, hour-long jobs should be limited to 50 nodes. Batches of long-running jobs can easily monopolize the cluster!

Check with the group before running a batch that will take longer than two cluster-days, or if your average job time is more than 15 minutes. Also please check with the group before assigning more than 50 nodes to a batch containing long-running jobs.

Job Recovery

There will almost always be failed jobs for a variety of reasons. The most important thing to do is design your jobs such that they have an atomic file presence indicator of successful completion. The case is typically to make a job do all of its work on the /scratch/tmp/ filesystem, creating its result file there. When it has successfully completed its work there, it does a single copy of the result file back to a /hive/ filesystem, which is outside of the cluster and thus more permanent. The existence of that file result can be verified by parasol commands to determine if the job was successfully completed. Parasol keeps track of the jobs that are successful or not. To re-run the failed jobs, you merely do a 'para push' of the batch again, and the failed jobs will be retried. A job can be retried like this until it fails four times. A gensub2 template example to check a result file:

{check out line+ <result.file>}

is used to tell parasol to check that file to verify job completion.

gensub2 template syntax:

{check 'when' 'what' <file>}

where 'when' is either "in" or "out"
and 'what' is one of: "exists" "exists+" "line" "line+"
"exists" means file exists, may be zero size
"exists+" means file exists and is non-zero size
"line" means file may have 0 or more lines of ascii data and is properly line-feed terminated
"line+" means file is 1 or more lines of data and is properly line-feed terminated

Sick nodes

Sometimes a kluster node will become defective during the running of your batch. Parasol will stop assigning jobs to that defective node. You can see this with para showSickNodes. To reset the sick-status on your batch run para clearSickNodes.

Sick batch

If there are too many failures in a row, the system will consider your batch must be sick rather than the nodes that are failing and it will stop the batch. If you have encountered this problem and were able to fix the issue, ou can reset the sick status with para clearSickNodes.

See also: