Otto Tracks

From Genecats
Jump to navigationJump to search

How to convert a regular track into an otto track

  1. The engineer sets up an otto job in their own cron according to instructions below
  2. Run the otto script, but omit the automatic pushing part
  3. A QAer reviews the output of the otto track like a regular track
  4. When it passes QA, they push through the usual push-request mechanism
  5. The engineer adds the otto meta information to this wiki page
  6. The engineer runs the otto on the regular schedule, letting it get automatically pushed to the RR
  7. When it's running correctly, the otto-meister takes over the cron

Overview

The complete otto crontab can be seen in the following location:

~/kent/src/hg/utils/otto/otto.crontab

As one Sept 21 2021 we have read access to the root directory where the autoPush scripts live, so any script in that crontab can be investigated directly.

Current Otto Schedule     Updated August 2021 The sys admins dev push crontab can be seen here:

cat /etc/crontab
   check/build on hgwdev (otto-meister crontab)
clinGen - daily at 9:00
decipher - daily at 4:11
gwas - weekly (Wednesday) at 4:41
geneReviews - weekly (Tuesday) at 8:00
   <more>

   sysadmin push from hgwdev:
archive - once a week (Sundays) at 3:45
clinGen - once a week (Sundays) at 7:56
clinvar - once a month (11th day) at 1:26
decipher - -once a week (Wednesdays) at 3:45- '''Currently disabled'''
geneReviews - once a week (Fridays) at 1:25
gwas - once a week (Wednesdays) at 3:25
hgPhyloPlaceData - daily at 3:00
hubSearchText - once a week (Sundays) at 21:05
isca - once a week (Wednesdays) at 3:55
lovd - once a week (Saturdays) at 1:25
mastermind - every 3 months (Jan, Apr, Jul, Oct) on the 15th day at 4:55
ncbiHg38RefSeq - once a week (Thursdays) at 3:05
ncbiHg38RefSeqDownload - once a week (Thursdays) at 3:03
nextstrain - daily at 2:22
omim - once a week (Wednesdays) at 21:45
releaseLog - daily at 8:30
uniprot - once a week (Sundays) at 2:00
UShER_SARS-CoV-2 - daily at 3:01
wuhCor1bbi - daily at  22:00
wuhCor1uniprot - once a week (Fridays) at 2:00

   sysadmin push from hgwbeta:
tableDescriptions - daily at 8:00
Quick reminders:
Kent tree location: kent/src/hg/utils/otto/*
Runtime location: /hive/data/outside/otto/*
Current otto-meister: Lou Nassar
Otto-meister crontab, currently viewable in ~lou/crontabs (?)

Otto is an umbrella term for tracks that receive automatic (otto-matic) periodic updates without review from QA. The current structure for this system is a set of scripts that run out of the otto-meister's crontab. Part of the process of passing the otto-meister mantle on to a new victim is to move those crontab entries to the new meister's crontab.

The scripts may be run with whatever frequency is deemed appropriate for each track. Once per week or once per day is most common. The scripts are set up to determine if any actual data changes have occurred since the last build, and to quit if no changes are found.

All scripts are stored in the kent tree in the directory kent/src/hg/utils/otto/, in separate subdirectories for each track. Makefiles in each of those subdirectories are responsible for copying the scripts to the appropriate runtime location, which will be a subdirectory of /hive/data/outside/otto/. That /hive/data/ subdirectory is also where all data files for the track will be stored (including for previous builds of the track).

After a pipeline has run, the updated data should be loaded into a track on hgwdev. From there, responsibility passes out of the hands of the otto-meister and into those of the system administrators. The admins have their own crontab, which is responsible for migrating those track data out to the beta and live servers.

Structure of an individual otto pipeline

Some variation exists in the pipelines of the existing otto pipelines, as they were developed by different people at different times. Most, however, are arranged according to the following guidelines. Good practice for any future otto pipelines is to also follow this structure.

There are four main scripts: XXXWrapper.sh, checkXXX.sh, buildXXX.sh, and validateXXX.sh. XXX in each name refers to the name of the script. For example, the OMIM pipeline consists of omimWrapper.sh, checkOmim.sh, buildOmim.sh (actually called buildOmimTracks.sh, note the previous comment about pipeline variation), and validateOmim.sh.

XXXWrapper.sh

The wrapper script is a small shell that just sets up emailing the otto report to whoever should receive it (generally either the current and previous otto-meister, or else the current and upcoming otto-meister when a handoff is imminent), and then calling the checkXXX.sh script.

#!/bin/sh -e

PATH=/cluster/bin/x86_64:/cluster/bin/scripts:$PATH
EMAIL="person1@soe.ucsc.edu,person2@ucsc.edu"
WORKDIR="/hive/data/outside/otto/omim"

cd $WORKDIR
./checkOmim.sh $WORKDIR 2>&1 |  mail -s "OMIM Build" $EMAIL

checkXXX.sh

The check script is responsible for overseeing the otto pipeline. It starts by fetching enough information to determine whether the track is in need of an update. If the source data has not changed, it can simply exit with a "No update" line for the email to the otto-meister. If the source data has changed, then the check script has several things to do.

First, the check script must create a directory (often named for the date of this latest build) to store the new build. It then fetches any required data files. If those files need to be unpacked, the check script might perform the unpacking or might leave that to the build script (the line gets a little blurry). The check script then invokes the build script using the new build directory as a working directory. Note that many otto tracks are built on multiple assemblies (or are built on one, but might be built on others in the future). Standard practice right now is to to create a parent directory named with the date for the build, and subdirectories within it for each assembly. For example, running the omim pipeline on August 16th, 2018 would result in the following directories being created:

/hive/data/outside/otto/omim/2018-08-16
/hive/data/outside/otto/omim/2018-08-16/hg18
/hive/data/outside/otto/omim/2018-08-16/hg19
/hive/data/outside/otto/omim/2018-08-16/hg38

Files common to the three builds may be placed in /hive/data/outside/otto/omim/2018-08-16/, but the run directory for each invocation of buildOmim.sh would be one of:

/hive/data/outside/otto/omim/2018-08-16/hg18
/hive/data/outside/otto/omim/2018-08-16/hg19
/hive/data/outside/otto/omim/2018-08-16/hg38

After the build is complete, the new track data will be stored in a set of "New" tables. For example, if the track data are found in the table "omim", then the build script should place data into an "omimNew" table. The check script then runs a validation script to ensure that the new build in "omimNew" is not a dramatic departure from the previous version of the track (in "omim"). If validation fails, the check script aborts with an error message and does not proceed further. This leaves the current version of the track as the live version.

If the new build passes validation, then the last duties of the check script are to copy the current track tables to "Old" tables (e.g., the "omim" table is renamed to "omimOld"), and the new build into the canonical track tables (e.g., the "omimNew" table is renamed to "omim"). The check script then reports a successful build.

Example of installing new tables:

for table in `cat ../omim.tables`
do
  new=$table"New"
  old=$table"Old"
  hgsqlSwapTables $db $new $table $old -dropTable3
done

This process makes use of a special auxiliary file called XXX.tables. This file lists all of the MySQL tables used by the track, one per line. The file is used both by the validation script and by the final updates in the check script to know which table names are affected.

Other auxiliary files are also sometimes involved in the otto process. For example, auxiliary files may contain a set of instructions for retrieving files from an FTP server. These auxiliary files can either be stored in the kent tree or regenerated by the check script itself for each run. Other auxiliary files may contain things like the username and password for accessing an FTP server. Those files are not stored in the kent tree. There are no backups for those files; please don't trash them. If you do trash them, we'll have to contact the data provider again for access.

buildXXX.sh

Each run of the build script is for constructing the track for a specific assembly. The script does whatever it needs to do to convert from the data provider's formats into the formats we use in our track. Sometimes this is straightforward, like copying a VCF file into place. Sometimes this means parsing fields out of a tab-separated file, joining that with position data from a different track, and creating a BED file. When the script is complete, any data that belongs in a table should have been placed in a table by that name with "New" appended. For example, data destined for the "omim" table should be loaded into "omimNew" by the build script. The check script is responsible for moving data from "omimNew" into "omim" after it passes validation.

validateXXX.sh

The validation script checks to make sure that a new track build looks similar to the previous build. If it does not look similar, it throws an error. This kills the build. The otto-meister is then responsible for manually reviewing the build to see if it was a legitimate problem with the data/pipeline, or whether it was simply an overzealous warning from the validation script (hint: it's often the latter, but we don't want to relax the threshold and miss something - it doesn't cry wolf often enough to be a problem).

The validation scripts are almost identical among the otto pipelines. They simply compare the live and new builds and see how many lines they have in common and how many appear only in one or the other. If the amount of newly created entries or newly deleted entries is greater than 10% of the number of entries in common, then an error is thrown. The results of the comparison are left in a newXXX.stats file for examination by the otto-meister.

Structure of a bigBed otto pipeline

As we move more and more features to bigBed based tracks (and bigWig/BAM/VCF/etc), we rely less and less on MySQL tables for storing track data. Many of the otto jobs have switched all (LOVD, ClinVar, ClinGen, dbVar, mastermind, wuhCor1 public annotations) or part (Decipher, OMIM) of their track data to bigBed.

The structure of the bigBed pipelines is almost identical to the regular MySQL table pipeline, a wrapper script runs a checkXXX script which runs a buildXXX script. There are two main differences:

1. validation can just be done with a single `bigBedToBed | awk` command and thus happens in the checkXXX script

2. bigBeds are placed in /hive/data/outside/otto/<track>/<release-date>/{db1,db2,...}/

Symlinks from /gbdb/<db>/<track>/ can then point to files in the release-date directory for the most up to date track data.

After you've switched a track to bigBed, to get the files to the RR you must tell the admins what files to push and how often, and you probably also want them to drop/stop pushing the old mysql tables as well.

Updating an otto pipeline

Sometimes it is necessary to make changes to an existing pipeline. This generally crops up because the build scripts encounter something unexpected. Maybe the data provider changed their file format, maybe they changed the encryption on a file, or maybe there is an error in the data file that they released (and which they should be notified about). All of those have happened in the past.

Step 1: Disable the admin's upload cron job. This is the first and most important step. You don't want any experiments you create while trying to fix the script to wind up on the live server and clobber the existing track. Do this by sending email to the admins, asking them to temporarily deactivate the automatic push of the relevant otto track data (presumably, the other otto tracks are fine and can continue to be pushed).

Step 2: Disable the otto-meister's cron job. Depending on how long hood is off for the otto pipeline, it might be inconvenient for some half-baked scripts to try running in the middle of the night. Disable the otto-meister's cron entry for the pipeline to avoid unexpected problems.

Step 3: Start tinkering! The best place to tinker is in the /hive/data/outside/otto/* directory for that pipeline. After you've got it where you want it, then you can check those changes into the kent tree (or if things go haywire, then you can restore from that directory).

Step 4: Testing changes. When you want to test changes you've made to the pipeline, the easiest way to do it is to create a new line in the otto-meister's crontab that will run the pipeline once a day in the next couple minutes. That way you can almost immediately get a run started using the exact same environment that the regularly scheduled otto run will.

Step 5: When you're ready to commit the changes you've made, check the updated scripts into the kent tree in kent/src/hg/utils/otto/*. Then run a 'make install' in that directory just to ensure the committed files match the ones you've been playing with in the live directory. There is something special about the otto makefiles: by default they will not update the script files in /hive/data! This is (for better or worse) an extra hoop to jump through: you have to specify a destination directory for the files. For example, instead of running

make install

in the kent/src/hg/utils/otto/omim/ directory to update the omim pipeline, you'll need to run

make install PREFIX=/hive/data/outside/otto/omim

Creating a new otto pipeline

Creating a new otto pipeline is a matter of creating a new directory in /hive/data/outside/otto/, a corresponding directory in kent/src/hg/utils/otto/, populating them with the required scripts, and then having the otto-meister add the job to their crontab.

This has gone a bit sideways on occasion, which is why Hiram and Max both have automatically updated tracks that run out of their own personal crontabs. The otto-meister is not responsible for those tracks, though it might be nice to bring them under the otto umbrella some day (likely when Hiram and Max are tired of managing the pipelines themselves). A certain amount of work would probably be required to fit these pipelines into the standard otto structure.

A good quick start is to:

1. Copy the omim/omimWrapper.sh and omim/validateOmim.sh scripts into your new directories, appropriately renamed. Edit those scripts to replace "omim" with your track name. Note that validateOmim.sh depends on ../../omim.tables (because the run directory will be something like "/hive/data/outside/otto/omim/2018-08-16/hg19", but the omim.tables file will live in "/hive/data/outside/otto/omim"). That file should be renamed and populated with the list of tables that you'll be building for the track (almost guaranteed to change as you figure out how you want to organize the track :-) ).

2. Figure out how you're going to download the latest data from the provider and check if there was an update. That goes into your new checkXXX.sh script. If checkXXX.sh determines there is an update, then it should create a DATE/ASSEMBLY subdirectory for the update (e.g., 2018-08-16/hg19), copy the data files into it, and run the buildXXX.sh script.

3. Figure out how you're going to build the track using the provider's data. This goes into your new buildXXX.sh script. The data should be loaded into tables and files with "New" appended to the name, so that this new putative build doesn't overwrite the existing track before it goes through validation.

4. Make sure you've updated the XXX.tables list of tables for your track.

5. Have checkXXX.sh call validateXXX.sh on each assembly/track. If any validate runs fail, checkXXX.sh should also fail. If all validation runs succeed, then checkXXX.sh should call hgsqlSwapTables to update the track.

6. If the whole pipeline succeeds, then checkXXX.sh should report a successful build.

Reminder: The stdout of any processes run by your build will be included in the email sent to the otto-meister. Make liberal use of redirecting output to files or to /dev/null, or else using quiet flags to keep the volume of text down. The harder it is to check whether the run succeeded, the harder it is for the otto-meister to detect errors. Of course, a small amount of information is very helpful for knowing why a build failed.