Checking RR status through hgTracksRandom: Difference between revisions

From Genecats
Jump to navigationJump to search
(removed the "training" categories from this page, since most people don't need to worry about hgTracksRandom)
(updated to reflect switch of job to qateam's crontab on hgwdev. (http://redmine.soe.ucsc.edu/issues/9708#note-33))
Line 1: Line 1:
You can check the RR status by seeing the output of hgTracksRandom by running the following command '''from hgwbeta''':
You can check the RR status by seeing the output of hgTracksRandom by running the following command from hgwdev:
<pre>
<pre>
tail -100 /data/perf/hgTracksRandom.log
tail -100 /hive/users/qateam/perf/hgTracksRandom.log
</pre>
</pre>


Since 2006 every 15 minutes trusty hgTracksRandom has been randomly testing output on the RR and appending the results at /data/perf/hgTracksRandom.log where you will see normal output like:
(Results prior to July 10, 2013 are in /hive/users/qateam/perf/save.hgTracksRandom.log.)
 
Since 2006 every 15 minutes trusty hgTracksRandom has been randomly testing output on the RR and appending the results to the hgTracksRandom.log file, where you will see normal output like:
<pre>
<pre>
September 11, 2012 16:50
September 11, 2012 16:50
Line 49: Line 51:
</pre>
</pre>


This job runs every 15 minutes on the qateam crontab on hgwbeta.  Here's the relevant snippet (which you can also see in the genecats source tree, if you have it, in genecats/qa/crontabs/hgwbeta.crontab):
This job runs every 15 minutes on the qateam crontab on hgwdev.  Here's the relevant snippet (which you can also see in the genecats source tree, if you have it, in genecats/qa/crontabs/hgwdev.crontab):
<pre>
<pre>
MAILTO=kuhn,rhead,pauline,katrina,brianlee,braney,luvina,greg
MAILTO=kuhn,rhead,pauline,katrina,brianlee,braney,luvina,gary,ann,steve,jcasper
# performance log
# performance log
5,20,35,50 * * * * hgTracksRandom /data/perf/machines >> /data/perf/hgTracksRandom.log
5,20,35,50 * * * * hgTracksRandom /hive/users/qateam/perf/machines >> /hive/users/qateam/perf/hgTracksRandom.log
</pre>
</pre>
The job runs the program hgTracksRandom, on a file called /data/perf/machines.  You can try running the program yourself anytime . . . it's just a c program in the kent source tree.  The output goes into a file called /data/perf/hgTracksRandom.log on hgwbeta.  Beware!  It is a lonnnnng file. As mentioned output has been going into it every 15 minutes since 2006. 
The job runs the program hgTracksRandom, on a file called /hive/users/qateam/perf/machines.  You can try running the program yourself anytime . . . it's just a c program in the kent source tree.  The output goes into a file called /hive/users/qateam/perf/hgTracksRandom.log on hgwdev.


The message you get from cron when an error is happening like the above situations will likely be a note that doesn't tell you much. '' If you get a cron error for hgTracksRandom, it is the prompt for you to check the log file to see where the program may have gotten stuck, indicating the machines on the RR to check and see if they are loading. '' If they are not working, alert cluster-admin and browser-qa.
The message you get from cron when an error is happening like the above situations will likely be a note that doesn't tell you much. '' If you get a cron error for hgTracksRandom, it is the prompt for you to check the log file to see where the program may have gotten stuck, indicating the machines on the RR to check and see if they are loading. '' If they are not working, alert cluster-admin and browser-qa.

Revision as of 02:16, 11 July 2013

You can check the RR status by seeing the output of hgTracksRandom by running the following command from hgwdev:

tail -100 /hive/users/qateam/perf/hgTracksRandom.log

(Results prior to July 10, 2013 are in /hive/users/qateam/perf/save.hgTracksRandom.log.)

Since 2006 every 15 minutes trusty hgTracksRandom has been randomly testing output on the RR and appending the results to the hgTracksRandom.log file, where you will see normal output like:

September 11, 2012 16:50
hg19 chr1:26067320-26167320

hgwbeta.cse.ucsc.edu 1257
hgw0.cse.ucsc.edu 1335
hgw1.cse.ucsc.edu 1251
hgw2.cse.ucsc.edu 1220
hgw3.cse.ucsc.edu 1386
hgw4.cse.ucsc.edu 1519
hgw5.cse.ucsc.edu 1679
hgw6.cse.ucsc.edu 1765
hgw7.cse.ucsc.edu 48926 <---
hgw8.cse.ucsc.edu 1650
------------------------------

The numbers indicate how many milliseconds it took to load the position specified in hgTracks on a particular machine. Numbers are sometimes high (indicated by arrows), and that's fine. When one of the hgw 1-8 are missing, it is reason to investigate further by going to that machine online and testing functionality. For example abnormal output would look like this:

September 11, 2012 08:05
hg19 chr1:100683630-100783630

hgwbeta.cse.ucsc.edu 1628
hgw0.cse.ucsc.edu 1276
hgw1.cse.ucsc.edu 1141
hgw2.cse.ucsc.edu 1143
hgw3.cse.ucsc.edu 1337
hgw4.cse.ucsc.edu 1584
hgw5.cse.ucsc.edu 1621
hgw6.cse.ucsc.edu 1747
hgw7.cse.ucsc.edu 3178

Notice that hgw8 is missing from the list, and the nice little "-----" divider line at the end. That's because the program didn't get a response from hgw8 and stopped, and then it ran again 15 minutes later. If hgw4 was down instead, there wouldn't be any output after the hgw3 line, something like:

September 11, 2012 08:20
hg19 chr1:100683630-100783630

hgwbeta.cse.ucsc.edu 1628
hgw0.cse.ucsc.edu 1276
hgw1.cse.ucsc.edu 1141
hgw2.cse.ucsc.edu 1143
hgw3.cse.ucsc.edu 1337

This job runs every 15 minutes on the qateam crontab on hgwdev. Here's the relevant snippet (which you can also see in the genecats source tree, if you have it, in genecats/qa/crontabs/hgwdev.crontab):

MAILTO=kuhn,rhead,pauline,katrina,brianlee,braney,luvina,gary,ann,steve,jcasper
# performance log
5,20,35,50 * * * * hgTracksRandom /hive/users/qateam/perf/machines >> /hive/users/qateam/perf/hgTracksRandom.log

The job runs the program hgTracksRandom, on a file called /hive/users/qateam/perf/machines. You can try running the program yourself anytime . . . it's just a c program in the kent source tree. The output goes into a file called /hive/users/qateam/perf/hgTracksRandom.log on hgwdev.

The message you get from cron when an error is happening like the above situations will likely be a note that doesn't tell you much. If you get a cron error for hgTracksRandom, it is the prompt for you to check the log file to see where the program may have gotten stuck, indicating the machines on the RR to check and see if they are loading. If they are not working, alert cluster-admin and browser-qa.

Every once in a while, you will get output from this cron job that doesn't indicate a real problem with the RR. For instance, after the power was out, hgwbeta couldn't find mysqlbeta, and this cron job started complaining, even though the RR was fine. Also, it may not be alarming if a machine is unreachable but it is not currently in the RR. You can always tell what is in the RR with the host command, e.g.:

 [rhead@hgwbeta ~]$ host genome.ucsc.edu 

So, this is a very imperfect warning system that there *may* be a problem with the RR. (This program is ostensibly for the purpose of monitoring response times, but it functions as a warning that one or more machines are not responding, too. Btw, if you ever want to look at response time logs in a nice graphical way, the admins have pretty cacti graphs available. Ask around if you don't know the password.)