RR Down: Sending Alert Messages about Genome Browser Being Offline: Difference between revisions

From Genecats
Jump to navigationJump to search
(→‎Confirm issue: Updating the example for Firefox tabs to check machines, adding genome-preview)
 
(20 intermediate revisions by 2 users not shown)
Line 4: Line 4:
==Contact cluster-admin/cc qateam==
==Contact cluster-admin/cc qateam==
===Check logs===
===Check logs===
See [http://genomewiki.cse.ucsc.edu/genecats/index.php/Checking_RR_status_through_hgTracksRandom Checking_RR_status_through_hgTracksRandom] where you can <code>tail -100 /hive/users/qateam/perf/hgTracksRandom.log</code> to see the history of the RR over 15 minute intervals.  
See [http://genomewiki.cse.ucsc.edu/genecats/index.php/Checking_RR_status_through_hgTracksRandom Checking_RR_status_through_hgTracksRandom] where you can <code>tail -100 /hive/users/qateam/perf/hgTracksRandom.log</code> to see the history of the RR over 15 minute intervals.  Check out the [http://genomewiki.ucsc.edu/genecats/index.php/Apache_error_log_output Apache error log output] page to learn more about trying to figure out what a user might be doing.  
===Confirm issue===
===Confirm issue===
Navigate to the machines to confirm there is a problem.
Navigate to the machines to confirm there is a problem.


One approach is to have a secondary browser open new windows with all of the machines open as tabs for the home page.
::One approach is to have a secondary browser open new windows with all of the machines open as tabs for the home page.
 
::For example, if Chrome is your main browser and Firefox is your secondary under Preferences/General "Home page:" and When Firefox starts: Show my homepage: paste the following for your homepage:  
For exampe, if Chrome is your main browser and Firefox is your secondary under Preferences/General "Home page:" and When Firefox starts: Show my homepage: paste the following for your homepage:  
::<code>hgw0.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|hgw1.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|hgw2.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|genome-asia.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|genome-preview.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1</code>  
 
::These open hgw0-hgw2, genome-euro, genome-asia, and genome-preview.
<code>hgw0.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw1.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw2.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw3.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw4.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw5.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw6.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38|http://genome-asia.ucsc.edu/cgi-bin/hgTracks?db=hg38| http://hgwdev.cse.ucsc.edu/cgi-bin/hgTracks?db=monDom5&hubUrl=http://genome-test.cse.ucsc.edu/~hiram/hubs/rrCGIStats/hub.txt&position=chr1%3A460068880-469555993</code>  
 
These open hgw0-hgw6, genome-euro, genome-asia, and Hiram's cool monitoring hub on hgwdev.


===Send email===
===Send email===
If things look serious send an email to cluster-admin and qateam sharing that the RR (or specific machine, say hgw5 if that what you checking shows)  is down.
If things look serious send an email to cluster-admin and qateam & browser-dev sharing that the RR (or specific machine, say hgw5 if that what you checking shows)  is down.


==Things are bad: update twitter/genome-announce==
==Things are bad: update twitter==
If cluster-admin do not come back with a fix within half an hour, it is probably a good idea to start thinking about notifying the greater community. If the error is minor, for example, only one machine is out (say hgw5) then perhaps it isn't as important to notify the community.  But if it is bad, for example mailing list questions start coming in, it might be time to update twitter and send an announcement.  
If cluster-admin do not come back with a fix within half an hour, it is probably a good idea to start thinking about notifying the greater community. If the error is minor, for example, only one machine is out (say hgw5) then perhaps it isn't as important to notify the community.  But if it is bad, for example mailing list questions start coming in, it might be time to update twitter.  
===Be sure to say genome-asia and genome-euro are available (if they are).===
===Be sure to say genome-asia and genome-euro are available (if they are).===
 
See this note about our [http://genomewiki.cse.ucsc.edu/genecats/index.php/Facebook_update#Twitter twitter account].
Here are some example [http://genomewiki.cse.ucsc.edu/genecats/index.php/Facebook_update#Twitter twitter] updates:
Here are some example twitter  updates:


* The Genome Browser is unexpectedly down. Please rest assured we are working on having it back up ASAP!
* The Genome Browser is unexpectedly down. Please rest assured we are working on having it back up ASAP!
Line 30: Line 27:
* We have now resolved the problem on our main site. We apologize for any inconvenience and thank you for your understanding.
* We have now resolved the problem on our main site. We apologize for any inconvenience and thank you for your understanding.


==Things are really bad (over an hour+ offline): update Index.html==
==Things are really bad (over an hour+ offline): Ask cluster-admin to update to display the maintenance page==
 
This [http://redmine.soe.ucsc.edu/issues/9608#note-40 RM] has some history about this page.  There is a file maintenance.html at  /usr/local/apache/htdocs/ that gets turned on when admin touches another file (maintenance.enable?). Possible example email (be sure to CC the QAteam and other relevant parties):
 
::Dear cluster-admin,
::
::With the current issue on the RR, can we update the site to have the  /usr/local/apache/htdocs/maintenance.html page display with the maintenance.enable mechanism since it looks like it is not going to be resolved soon.
::Thanks!
 


Here is an example and the html that could be put in place.


[[File:ExampleFireDrill.png]]
P.S.  If you ever want to edit this page, when you push it, ask for it to be pushed to:


<pre>
<pre>
<!--temporoary note about genome-euro and genome-asia -->
Dear Pushers,
      <div id="devWarningRow" class="jwRow">
 
        <div id="devWarningBox" class="jwWarningBox jwWarningBoxStatic">
Please push:
          <b> The Genome Browser Site Is Unexpectedly Offline, Mirror Sites Available.</b>
/usr/local/apache/htdocs/maintenance.html
          <p>
 
          While we work on returning our main site, our Asian and European mirrors are up and available:
to
          <li><a href="http://genome-euro.ucsc.edu" target="_blank">http://genome-euro.ucsc.edu</a></li>
/usr/local/apache/htdocs/maintenance.html
          <li><a href="http://genome-asia.ucsc.edu" target="_blank">http://genome-asia.ucsc.edu</a></li>
 
          </p>
Reason: Update to the maintenance.html page.  
          <p>
          On our mirror sites, custom track and custom session data will be divergent as they use
          different machines to store the data, please read more
          <a href="goldenPath/help/genomeEuro.html" target="_blank">here</a>.
          </p>
          <p>
          Please know we are working on having our main site back up as soon as possible.
          We apologize for any inconvenience and thank you for your understanding.
          </p>
        </div>
      </div>
</pre>
</pre>


This wouldn't have to be commited, it could be temporarily put in place with a make beta and pushed out, or with a direct edit to /usr/local/apache/htdocs/index.html by ssh'ing to the machines (pushing should wipe away the changes later).
Also, '''if we are putting up the maintenance.html ''' we should send an email to genome-announce as there is a line on that page that suggests our [https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome-announce "forum] may contain details about this outage."
 
Example email to genome announce:
<pre>
Browser Maintenance Today, Dec 3rd @ 4 pm


We will be performing some hardware maintenance this afternoon, the 3rd of December from 4 - 5 pm (UTC-8) Pacific Standard Time during our scheduled Thursday maintenance window.
Due to recent power outages, we need to restart replication setups which may be experienced as a 30-minute service interruption. Thank you in advance for your understanding.
Regards,
</pre>


Get the PST or PDT here and the - UTC/GMT: 
https://www.timeanddate.com/time/zone/usa/santa-cruz


[[Category:Browser QA]]  
[[Category:Browser QA]]  
[[Category:Browser Development]]
[[Category:Browser Development]]

Latest revision as of 21:00, 15 September 2020

Overview

This page has reminders of what to do if the RR is down for a long period. You want to verify the problem, contact cluster-admin. cc'ing the team, and then if it isn't fixed in a reasonable amount of time, consider additional messages.

Contact cluster-admin/cc qateam

Check logs

See Checking_RR_status_through_hgTracksRandom where you can tail -100 /hive/users/qateam/perf/hgTracksRandom.log to see the history of the RR over 15 minute intervals. Check out the Apache error log output page to learn more about trying to figure out what a user might be doing.

Confirm issue

Navigate to the machines to confirm there is a problem.

One approach is to have a secondary browser open new windows with all of the machines open as tabs for the home page.
For example, if Chrome is your main browser and Firefox is your secondary under Preferences/General "Home page:" and When Firefox starts: Show my homepage: paste the following for your homepage:
hgw0.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|hgw1.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|hgw2.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|genome-asia.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1|genome-preview.ucsc.edu/cgi-bin/hgTracks?db=hg38&measureTiming=1
These open hgw0-hgw2, genome-euro, genome-asia, and genome-preview.

Send email

If things look serious send an email to cluster-admin and qateam & browser-dev sharing that the RR (or specific machine, say hgw5 if that what you checking shows) is down.

Things are bad: update twitter

If cluster-admin do not come back with a fix within half an hour, it is probably a good idea to start thinking about notifying the greater community. If the error is minor, for example, only one machine is out (say hgw5) then perhaps it isn't as important to notify the community. But if it is bad, for example mailing list questions start coming in, it might be time to update twitter.

Be sure to say genome-asia and genome-euro are available (if they are).

See this note about our twitter account. Here are some example twitter updates:

  • We have now resolved the problem on our main site. We apologize for any inconvenience and thank you for your understanding.

Things are really bad (over an hour+ offline): Ask cluster-admin to update to display the maintenance page

This RM has some history about this page. There is a file maintenance.html at /usr/local/apache/htdocs/ that gets turned on when admin touches another file (maintenance.enable?). Possible example email (be sure to CC the QAteam and other relevant parties):

Dear cluster-admin,
With the current issue on the RR, can we update the site to have the /usr/local/apache/htdocs/maintenance.html page display with the maintenance.enable mechanism since it looks like it is not going to be resolved soon.
Thanks!


P.S. If you ever want to edit this page, when you push it, ask for it to be pushed to:

Dear Pushers,

Please push:
/usr/local/apache/htdocs/maintenance.html

to 
/usr/local/apache/htdocs/maintenance.html

Reason: Update to the  maintenance.html page. 

Also, if we are putting up the maintenance.html we should send an email to genome-announce as there is a line on that page that suggests our "forum may contain details about this outage."

Example email to genome announce:

Browser Maintenance Today, Dec 3rd @ 4 pm

We will be performing some hardware maintenance this afternoon, the 3rd of December from 4 - 5 pm (UTC-8) Pacific Standard Time during our scheduled Thursday maintenance window. 

Due to recent power outages, we need to restart replication setups which may be experienced as a 30-minute service interruption. Thank you in advance for your understanding.

Regards,

Get the PST or PDT here and the - UTC/GMT: https://www.timeanddate.com/time/zone/usa/santa-cruz