Monitoring Tasks Notes

From Genecats
Jump to navigationJump to search

This page is intended to document procedures and notes for various QA monitoring tasks.

Cronjob: Results from checkGbibMd5.sh

Hiram generally requests the push of vXXX gBiB Store Push (push request example). This push is usually done soon after the Tuesday release of CGIs to the RR/world. However, if this push is not done before this script runs, then you may see the md5 mismatch. It is recommended to wait until the end of the week (and possibly wait until the end of Friday following the release) to give Hiram time to do the push. Check the push request group for the latest store push. If the md5sums still do not match after the push of the GBiB to the store, then let Hiram know.

The /cluster/home/qateam/bin/scripts/checkGbibMd5.sh script checks for matching 'modify' times between beta's hgTracks and RR's hgTracks:

ssh qateam@hgwbeta stat -c %y /usr/local/apache/cgi-bin/hgTracks
2017-03-20 10:31:02.000000000 -0700
ssh qateam@hgw1 stat -c %y /usr/local/apache/cgi-bin/hgTracks
2017-03-20 10:31:02.000000000 -0700

If those dates match, the md5sums are compared between the local gbibBeta with the version on the store, which should also match, but in this case they don't match.

md5sum /usr/local/apache/htdocs/gbib/gbibBeta.zip | awk '{print $1}'
14e5f65fd19ecf43af05a313f884d26c
curl -s https://genome-store.ucsc.edu/media/products/gbib.zip | md5sum | awk '{print $1}'
0a90536be4f4e2fdb3e2d865762db818

Cronjob: Results from checkMetaAday.csh

This is one of the monitoring tasks that looks at metaData such as in hgcentral on hgwdev, hgwbeta, and the RR and notifies people with an email that they must investigate to see if there are differences. Differences are then corrected. The output is put into genecats: http://genecats.cse.ucsc.edu/qa/test-results/metadata/ where each database gets an entry.

Email Example

checkMetaAday.csh mm8

database = mm8

 0 dbDb.mm8.hgcentralbetaOnly
 0 dbDb.mm8.hgcentralOnly
 1 dbDb.mm8.common
 0 blatServers.mm8.hgcentralbetaOnly
 0 blatServers.mm8.hgcentralOnly
 2 blatServers.mm8.common
 0 defaultDb.mm8.hgcentralbetaOnly
 0 defaultDb.mm8.hgcentralOnly
 0 defaultDb.mm8.common
 0 genomeClade.mm8.hgcentralbetaOnly
 0 genomeClade.mm8.hgcentralOnly
 1 genomeClade.mm8.common
 0 liftOverChain.mm8.hgcentralbetaOnly
 1 liftOverChain.mm8.hgcentralOnly   <---This should be a zero. There is "1" row difference. Need to fix. 
52 liftOverChain.mm8.common


  details in
 http://genecats.cse.ucsc.edu/qa/test-results/metadata/details

Process Example

Problem: The file below is in rr but not in beta

ornAna1 mm8 /gbdb/ornAna1/liftOver/ornAna1ToMm8.over.chain.gz 0.1 0 0 Y 1 N

Solution:​ Add row to beta liftOverChain table​

First, double check: rr

hgsql -h genome-centdb -Ne "SELECT * FROM liftOverChain WHERE fromDb = 'ornAna1' AND toDb = 'mm8'" hgcentral
+---------+-----+---------------------------------------------------+-----+---+---+---+---+---+
| ornAna1 | mm8 | /gbdb/ornAna1/liftOver/ornAna1ToMm8.over.chain.gz | 0.1 | 0 | 0 | Y | 1 | N |
+---------+-----+---------------------------------------------------+-----+---+---+---+---+---+

beta

hgsql -h hgwbeta -Ne "SELECT * FROM liftOverChain WHERE fromDb = 'ornAna1' AND toDb = 'mm8'" hgcentralbeta No results. No entries match. Need to move​ the row that exists in the rr table into​ ​the beta​ table​.

First, make a file of the row which is on rr but not on beta.

hgsql -h genome-centdb -Ne "SELECT * FROM liftOverChain WHERE fromDb = 'ornAna1' AND toDb = 'mm8'" hgcentral > chain.dev

Next, load the file into beta:

hgsql -h hgwbeta -e "LOAD DATA LOCAL INFILE 'chain.dev' INTO TABLE liftOverChain" hgcentralbeta

Look in beta to see if the contents of the file make it into the table:

hgsql -h hgwbeta -Ne "SELECT * FROM liftOverChain WHERE fromDb = 'ornAna1' AND toDb = 'mm8'" hgcentralbeta
+---------+-----+---------------------------------------------------+-----+---+---+---+---+---+
| ornAna1 | mm8 | /gbdb/ornAna1/liftOver/ornAna1ToMm8.over.chain.gz | 0.1 | 0 | 0 | Y | 1 | N |
+---------+-----+---------------------------------------------------+-----+---+---+---+---+---+

Looks good. CheckMetaAday again, for both ornAna1 and mm8

checkMetaAday.csh mm8

database = mm8

 0 dbDb.mm8.hgcentralbetaOnly
 0 dbDb.mm8.hgcentralOnly
 1 dbDb.mm8.common
 0 blatServers.mm8.hgcentralbetaOnly
 0 blatServers.mm8.hgcentralOnly
 2 blatServers.mm8.common
 0 defaultDb.mm8.hgcentralbetaOnly
 0 defaultDb.mm8.hgcentralOnly
 0 defaultDb.mm8.common
 0 genomeClade.mm8.hgcentralbetaOnly
 0 genomeClade.mm8.hgcentralOnly
 1 genomeClade.mm8.common
 0 liftOverChain.mm8.hgcentralbetaOnly
 0 liftOverChain.mm8.hgcentralOnly  <---Looks good!
 53 liftOverChain.mm8.common <--up by 1, good!

SLA Monitoring & Reporting

In Jan 2017, these Service Level Agreement data were migrated to a complicated Google Spreadsheet:

  • SLA Outage Report Worksheet.
  • Current manager of this spreadsheet: BrianL
  • Please contact the current manager of the spreadsheet to enter/modify outages.
  • The spreadsheet has a "README" tab to explain the spreadsheet format and best practices.

The old (unused) way to report outages was by adding it to an html table, SLA.html.

Cronjob: Results from realTime.csh (previously known as gbLoaded)

Previously, this job checked table times for the table gbLoaded in the database of the day. The problem with that is that the genome-asia machine doesn't like the pushed gbLoaded table because there's something different about the timestamp field type on that machine. No other genbank table uses that field type, and nothing uses gbLoaded except for this job. Braney proposed a switch to using xenoRefGene instead of gbLoaded for this check.

  • Job is tracked in the genecats repository under qa/crontabs).
  • Brian Lee updated this on hgwdev 3/30/17 and in the repository with commit 795b903 changing gbLoaded to xenoRefGene
  • Note that there are a few assemblies that don't have this table (xenTro3, fr1, fr2, fr3, eboVir3, dm2). When the database of the day is on those assemblies, the check will provide no output.
  • QA should check that the update times for each server are all close in time, within a week. Set 1 (dev & beta) are usually the same times, and Set 2 (rr/euro/asia) are usually on the same day, and the two sets are generally vary under a week.

Example output:

[qateam@hgwdev /cluster/home/qateam] /cluster/bin/scripts/realTime.csh `/cluster/bin/scripts/databaseAday.csh today` xenoRefGene verbose

xenoRefGene
=============
dev  2017-03-28 09:26:59
beta 2017-03-28 09:26:59

rr   2017-03-22 22:49:49
euro 2017-03-23 06:49:49
asia 2017-03-23 14:49:49


Cronjob: Results from checkTableStatus.csh " TABLE STATUS dump" emails

This cronjob checks the health of a Table Status dump instigated by Mark for checking the health of GenBank tracks, but also that table dump is used by another QA tool, updateTimes.csh. The only thing that matters from this cronjob is that the RR date looks recent:

TABLE STATUS files were last dumped:
hgwdev: 2016.10.24
hgwbeta: 2016.10.24
rr: 2019.01.20

We want the RR line to look good, rr: 2019.01.20 with a date that is recent. Here is where that file is in 2019 ls -lrt /hive/data/outside/genbank/var/tblstats/hgnfs1/ | tail at one time it was at /cluster/data/genbank/var/tblstats/hgnfs1/ instead.

We want that up-to-date because the script /cluster/bin/scripts/updateTimes.csh calls /cluster/bin/scripts/getTableStatus.csh for dev and beta, but for the RR it calls /cluster/bin/scripts/getRRtableStatus.csh

 
#  gets the status of any table from an RR database
#  using mark's genbank dumps.

And that uses these files to check when the last date is from the RR. This is to save abusive hgcentral queries against the RR, I believe.


UCSC Entrez LinkOut

A rare non-cron task: making any changes required to our Entrez LinkOut files. Changes are only requested once every few years, but they require some fiddling with our XML files and decisions about how we want to present ourselves on the various Entrez websites (e.g. which types of records are most important for us to display the UCSC link on). An example is described here: https://groups.google.com/a/soe.ucsc.edu/d/msg/browser-qa/8F7RU4xiMvc/d7laysnXNG0J

Entrez LinkOut sends requests for changes (and statistics) to the browser-qa email address. See their normal statistics update emails: https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/browser-qa/%22LinkOut$20team$20%22%7Csort:date

check that blat servers are running ok

See the BlatServer Backup Page. There is a way to check the blat error logs (located in /scratch/$db/gfServer.log) when you get a report. For example, let's say you get this report:

Couldn't read string length
Error reading status information from blat1d:17779
error 255 on mm10 blat1d:17779
Summary:
problems:
mm10 blat1d:17779

You can note that this is on blat1d for mm10 and then go to the blat1d machine:

ssh qateam@blat1d

Once you are connected (You might have to say YES as it is a connection you probably have not made before). You can look for the specific log with the specific day. In this case the error came in on 2019/03/07 and it was for the mm10 gfServer:

grep "2019/03/07" /scratch/mm10/gfServer.log | less

Then you will want to hit "f" to go forward in the log until around the time of the incident, in this case it was reported around 4am, but the incident that brought down the server happened around midnight:

2019/03/07 00:06:31: info: gfServer version 36x2 on host blat1d, port 17779 connection from 132.76.220.198
2019/03/07 00:06:31: debug: 0ddf270562684f29query 7028
2019/03/07 07:02:32: info: gfServer version 36x2 on host blat1d, port 17779  (2019-03-07 07:02)

In this case, the most that can be gleaned is that the IP is 132.76.220.198, one would hope that there would be a history of the query too, but that didn't make it into the log.

Jorge adds a quick check is just to search the log for "queries" usually in a line that says, "Server ready for queries!" This line is printed after a blat server restart. If you find that line, the last query before the server restart was the one that probably killed the server...