RR Down: Sending Alert Messages about Genome Browser Being Offline

From Genecats
Jump to navigationJump to search

Overview

This page has reminders of what to do if the RR is down for a long period. You want to verify the problem, contact cluster-admin. cc'ing the team, and then if it isn't fixed in a reasonable amount of time, consider additional messages.

Contact cluster-admin/cc qateam

Check logs

See Checking_RR_status_through_hgTracksRandom where you can tail -100 /hive/users/qateam/perf/hgTracksRandom.log to see the history of the RR over 15 minute intervals. Check out the Apache error log output page to learn more about trying to figure out what a user might be doing.

Confirm issue

Navigate to the machines to confirm there is a problem.

One approach is to have a secondary browser open new windows with all of the machines open as tabs for the home page.
For example, if Chrome is your main browser and Firefox is your secondary under Preferences/General "Home page:" and When Firefox starts: Show my homepage: paste the following for your homepage:
hgw0.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|hgw1.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|hgw2.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|genome-asia.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|genome-preview.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1
These open hgw0-hgw2, genome-euro, genome-asia, and genome-preview.

Send email

If things look serious send an email to cluster-admin and qateam & browser-dev sharing that the RR (or specific machine, say hgw5 if that what you checking shows) is down.

Things are bad: update twitter

If cluster-admin do not come back with a fix within half an hour, it is probably a good idea to start thinking about notifying the greater community. If the error is minor, for example, only one machine is out (say hgw5) then perhaps it isn't as important to notify the community. But if it is bad, for example mailing list questions start coming in, it might be time to update twitter.

Be sure to say genome-asia and genome-euro are available (if they are).

See this note about our twitter account. Here are some example twitter updates:

  • We have now resolved the problem on our main site. We apologize for any inconvenience and thank you for your understanding.

Things are really bad (over an hour+ offline): Ask cluster-admin to update to display the maintenance page

This RM has some history about this page. There is a file maintenance.html at /usr/local/apache/htdocs/ that gets turned on when admin touches another file (maintenance.enable?). Possible example email (be sure to CC the QAteam and other relevant parties):

Dear cluster-admin,
With the current issue on the RR, can we update the site to have the /usr/local/apache/htdocs/maintenance.html page display with the maintenance.enable mechanism since it looks like it is not going to be resolved soon.
Thanks!


P.S. If you ever want to edit this page, when you push it, ask for it to be pushed to:

Dear Pushers,

Please push:
/usr/local/apache/htdocs/maintenance.html

to 
/usr/local/apache/htdocs/maintenance.html

Reason: Update to the  maintenance.html page. 

Also, if we are putting up the maintenance.html we should send an email to genome-announce as there is a line on that page that suggests our "forum may contain details about this outage."

Example email to genome announce:

Browser Maintenance Today, Dec 3rd @ 4 pm

We will be performing some hardware maintenance this afternoon, the 3rd of December from 4 - 5 pm (UTC-8) Pacific Standard Time during our scheduled Thursday maintenance window. 

Due to recent power outages, we need to restart replication setups which may be experienced as a 30-minute service interruption. Thank you in advance for your understanding.

Regards,

Get the PST or PDT here and the - UTC/GMT: https://www.timeanddate.com/time/zone/usa/santa-cruz