Checking RR status through hgTracksRandom

From Genecats
Jump to navigationJump to search

You can check the RR status by seeing the output of hgTracksRandom by running the following command from hgwdev:

tail -100 /hive/users/qateam/perf/hgTracksRandom.log

(Results prior to July 10, 2013 are in /hive/users/qateam/perf/save.hgTracksRandom.log.)

Since 2006 every 15 minutes trusty hgTracksRandom has been randomly testing output on the RR and appending the results to the hgTracksRandom.log file, where you will see normal output like:

September 11, 2012 16:50
hg19 chr1:26067320-26167320

hgwbeta.cse.ucsc.edu 1257
hgw0.cse.ucsc.edu 1335
hgw1.cse.ucsc.edu 1251
hgw2.cse.ucsc.edu 1220
hgw3.cse.ucsc.edu 1386
hgw4.cse.ucsc.edu 1519
hgw5.cse.ucsc.edu 1679
hgw6.cse.ucsc.edu 1765
hgw7.cse.ucsc.edu 48926 <---
hgw8.cse.ucsc.edu 1650
------------------------------

The numbers indicate how many milliseconds it took to load the position specified in hgTracks on a particular machine. Numbers are sometimes high (indicated by arrows), and that's fine. When one of the hgw 1-8 are missing, it is reason to investigate further by going to that machine online and testing functionality. For example abnormal output would look like this:

September 11, 2012 08:05
hg19 chr1:100683630-100783630

hgwbeta.cse.ucsc.edu 1628
hgw0.cse.ucsc.edu 1276
hgw1.cse.ucsc.edu 1141
hgw2.cse.ucsc.edu 1143
hgw3.cse.ucsc.edu 1337
hgw4.cse.ucsc.edu 1584
hgw5.cse.ucsc.edu 1621
hgw6.cse.ucsc.edu 1747
hgw7.cse.ucsc.edu 3178

Notice that hgw8 is missing from the list, and the nice little "-----" divider line at the end. That's because the program didn't get a response from hgw8 and stopped, and then it ran again 15 minutes later. If hgw4 was down instead, there wouldn't be any output after the hgw3 line, something like:

September 11, 2012 08:20
hg19 chr1:100683630-100783630

hgwbeta.cse.ucsc.edu 1628
hgw0.cse.ucsc.edu 1276
hgw1.cse.ucsc.edu 1141
hgw2.cse.ucsc.edu 1143
hgw3.cse.ucsc.edu 1337

This job runs every 15 minutes on the qateam crontab on hgwdev. Here's the relevant snippet (which you can also see in the genecats source tree, if you have it, in genecats/qa/crontabs/hgwdev.crontab):

MAILTO=kuhn,rhead,pauline,katrina,brianlee,braney,luvina,gary,ann,steve,jcasper
# performance log
5,20,35,50 * * * * hgTracksRandom /hive/users/qateam/perf/machines >> /hive/users/qateam/perf/hgTracksRandom.log

The job runs the program hgTracksRandom, on a file called /hive/users/qateam/perf/machines. You can try running the program yourself anytime . . . it's just a c program in the kent source tree. The output goes into a file called /hive/users/qateam/perf/hgTracksRandom.log on hgwdev.

The message you get from cron when an error is happening like the above situations will likely be a note that doesn't tell you much. If you get a cron error for hgTracksRandom, it is the prompt for you to check the log file to see where the program may have gotten stuck, indicating the machines on the RR to check and see if they are loading. If they are not working, alert cluster-admin and browser-qa & browser-dev. See more at RR_Down:_Sending_Alert_Messages_about_Genome_Browser_Being_Offline

Every once in a while, you will get output from this cron job that doesn't indicate a real problem with the RR. For instance, after the power was out, hgwbeta couldn't find the SQL host hgwbeta, and this cron job started complaining, even though the RR was fine. Also, it may not be alarming if a machine is unreachable but it is not currently in the RR. You can always tell what is in the RR with the host command, e.g.:

 [rhead@hgwbeta ~]$ host genome.ucsc.edu 

So, this is a very imperfect warning system that there *may* be a problem with the RR. (This program is ostensibly for the purpose of monitoring response times, but it functions as a warning that one or more machines are not responding, too. Btw, if you ever want to look at response time logs in a nice graphical way, the admins have pretty cacti graphs available. Ask around if you don't know the password.)

The machines being tested

This file /hive/users/qateam/perf/machines defines which machines are being tested. Very rarely we rotate the machines in the RR and only then does this file need to be changed. And the order of the machines is the inverse order that hgTracksRandom checks sites:

$ cat  /hive/users/qateam/perf/machines
genome-euro.ucsc.edu
genome-asia.ucsc.edu
hgw6.cse.ucsc.edu
hgw5.cse.ucsc.edu
hgw4.cse.ucsc.edu
hgw3.cse.ucsc.edu
hgw2.cse.ucsc.edu
hgw1.cse.ucsc.edu
hgw0.cse.ucsc.edu

Checking the error logs

Check out the Apache error log page to learn more about looking through the error logs to investigate what a user might be doing.