RR Down: Sending Alert Messages about Genome Browser Being Offline

From Genecats
Jump to navigationJump to search

Overview

This page has reminders of what to do if the RR is down for a long period. You want to verify the problem, contact cluster-admin. cc'ing the team, and then if it isn't fixed in a reasonable amount of time, consider additional messages.

Contact cluster-admin/cc qateam

Check logs

See Checking_RR_status_through_hgTracksRandom where you can tail -100 /hive/users/qateam/perf/hgTracksRandom.log to see the history of the RR over 15 minute intervals.

Confirm issue

Navigate to the machines to confirm there is a problem.

My approach is to have my secondary browser (Firefox for me) open new windows with all of the machines open as tabs (under Preferences/General "Home page:" and When Firefox starts: Show my homepage:)

hgw0.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw1.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw2.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw3.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw4.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw5.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|hgw6.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38|genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38|http://genome-asia.ucsc.edu/cgi-bin/hgTracks?db=hg38%7C http://hgwdev.cse.ucsc.edu/cgi-bin/hgTracks?db=monDom5&hubUrl=http://genome-test.cse.ucsc.edu/~hiram/hubs/rrCGIStats/hub.txt&position=chr1%3A460068880-469555993

These open hgw0-hgw6, genome-euro, genome-asia, and Hiram's cool monitoring hub on hgwdev)

Send email

If things look serious send an email to cluster-admin and qateam sharing that the RR (or specific machine, say hgw5 if that what you checking shows) is down.

Things are bad: update twitter/genome-announce

If cluster-admin do not come back with a fix within half an hour, it is probably a good idea to start thinking about notifying the greater community. If the error is minor, for example, only one machine is out (say hgw5) then perhaps it isn't as important to notify the community. But if it is bad, for example mailing list questions start coming in, it might be time to update twitter and send an announcement.

Be sure to say genome-asia and genome-euro are available (if they are).

Here are some example twitter updates:

  • We have now resolved the problem on our main site. We apologize for any inconvenience and thank you for your understanding.

Things are really bad (over an hour+ offline): update Index.html

Here is an example and the html that could be put in place.

ExampleFireDrill.png

<!--temporoary note about genome-euro and genome-asia -->
      <div id="devWarningRow" class="jwRow">
        <div id="devWarningBox" class="jwWarningBox jwWarningBoxStatic">
          <b> The Genome Browser Site Is Unexpectedly Offline, Mirror Sites Available.</b>
          <p>
          While we work on returning our main site, our Asian and European mirrors are up and available:
          <li><a href="http://genome-euro.ucsc.edu" target="_blank">http://genome-euro.ucsc.edu</a></li>
          <li><a href="http://genome-asia.ucsc.edu" target="_blank">http://genome-asia.ucsc.edu</a></li>
          </p>
          <p>
          On our mirror sites, custom track and custom session data will be divergent as they use
          different machines to store the data, please read more
          <a href="goldenPath/help/genomeEuro.html" target="_blank">here</a>.
          </p>
          <p>
          Please know we are working on having our main site back up as soon as possible.
          We apologize for any inconvenience and thank you for your understanding.
          </p>
        </div>
      </div>

This wouldn't have to be commited, it could be temporarily put in place with a make beta and pushed out, or with a direct edit to /usr/local/apache/htdocs/index.html by ssh'ing to the machines (pushing should wipe away the changes later).