Otto Tracks

From Genecats
Revision as of 20:21, 6 August 2018 by Jonathan (talk | contribs)
Jump to navigationJump to search

Otto tracks

Quick reminders: Kent tree location: kent/src/hg/utils/otto/* Runtime location: /hive/data/outside/otto/* Current otto-meister: Jonathan (Chris Lee is on deck)

Overview

Otto is an umbrella term for tracks that receive automatic (otto-matic) periodic updates without review from QA. The current structure for this system is a set of scripts that run out of the otto-meister's crontab. Part of the process of passing the otto-meister mantle on to a new victim is to move those crontab entries to the new meister's crontab.

The scripts may be run with whatever frequency is deemed appropriate for each track. Once per week or once per day is most common. The scripts are set up to determine if any actual data changes have occurred since the last build, and to quit if no changes are found.

All scripts are stored in the kent tree in the directory kent/src/hg/utils/otto/, in separate subdirectories for each track. Makefiles in each of those subdirectories are responsible for copying the scripts to the appropriate runtime location, which will be a subdirectory of /hive/data/outside/otto/. That /hive/data/ subdirectory is also where all data files for the track will be stored (including for previous builds of the track).

After a pipeline has run, the updated data should be loaded into a track on hgwdev. From there, responsibility passes out of the hands of the otto-meister and into those of the system administrators. The admins have their own crontab, which is responsible for migrating those track data out to the beta and live servers.


Structure of an individual otto pipeline

Some variation exists in the pipelines of the existing otto pipelines, as they were developed by different people at different times. Most, however, are arranged according to the following guidelines. Good practice for any future otto pipelines is to also follow this structure.

There are four main scripts: XXXWrapper.sh, checkXXX.sh, buildXXX.sh, and validateXXX.sh. XXX in each name refers to the name of the script. For example, the OMIM pipeline consists of omimWrapper.sh, checkOmim.sh, buildOmim.sh (actually called buildOmimTracks.sh, note the previous comment about pipeline variation), and validateOmim.sh.

XXXWrapper.sh:

The wrapper script is a small shell that just sets up emailing the otto report to whoever should receive it (generally either the current and previous otto-meister, or else the current and upcoming otto-meister when a handoff is imminent), and then calling the checkXXX.sh script.

  1. !/bin/sh -e

PATH=/cluster/bin/x86_64:/cluster/bin/scripts:$PATH EMAIL="person1@soe.ucsc.edu,person2@ucsc.edu" WORKDIR="/hive/data/outside/otto/omim"

cd $WORKDIR ./checkOmim.sh $WORKDIR 2>&1 | mail -s "OMIM Build" $EMAIL


checkXXX.sh:

The check script is responsible for overseeing the otto pipeline. It starts by fetching enough information to determine whether the track is in need of an update. If the source data has not changed, it can simply exit with a "No update" line for the email to the otto-meister. If the source data has changed, then the check script has several things to do.

First, the check script must create a directory (often named for the date of this latest build) to store the new build. It then fetches any required data files. If those files need to be unpacked, the check script might perform the unpacking or might leave that to the build script (the line gets a little blurry). The check script then invokes the build script using the new build directory as a working directory.

After the build is complete, the new track data will be stored in a set of "New" tables. For example, if the track data are found in the table "omim", then the build script should place data into an "omimNew" table. The check script then runs a validation script to ensure that the new build (in "omimNew" is not a dramatic departure from the previous version of the track (in "omim"). If validation fails, the check script aborts with an error message and does not proceed further. This leaves the current version of the track as the live version.

If the new build passes validation, then the last duties of the check script are to copy the current track tables to "Old" tables (e.g., the "omim" table is renamed to "omimOld"), and the new build into the canonical track tables (e.g., the "omimNew" table is renamed to "omim"). The check script then reports a successful build.

Example of installing new tables:

 for table in `cat ../omim.tables`
 do
   new=$table"New"
   old=$table"Old"
   hgsqlSwapTables $db $new $table $old -dropTable3
 done

This process makes use of a special auxiliary file called XXX.tables. This file lists all of the MySQL tables used by the track, one per line. The file is used both by the validation script and by the final updates in the check script to know which table names are affected.

Other auxiliary files are also sometimes involved in the otto process. For example, auxiliary files may contain a set of instructions for retrieving files from an FTP server. These auxiliary files can either be stored in the kent tree or regenerated by the check script itself for each run. Other auxiliary files may contain things like the username and password for accessing an FTP server. Those files are not stored in the kent tree. There are no backups for those files; please don't trash them. If you do trash them, we'll have to contact the data provider again for access.


buildXXX.sh:

The build script does whatever the build script needs to do to convert from the data provider's formats into the formats we use in our track. Sometimes this is straightforward, like copying a VCF file into place. Sometimes this means parsing fields out of a tab-separated file, joining that with position data from a different track, and creating a BED file. When the script is complete, any data that belongs in a table should have been placed in a table by that name with "New" appended. For example, data destined for the "omim" table should be loaded into "omimNew" by the build script. The check script is responsible for moving data from "omimNew" into "omim" after it passes validation.


validateXXX.sh:

The validation script checks to make sure that a new track build looks similar to the previous build. If it does not look similar, it throws an error. This kills the build. The otto-meister is then responsible for manually reviewing the build to see if it was a legitimate problem with the data/pipeline, or whether it was simply an overzealous warning from the validation script (hint: it's often the latter, but we don't want to relax the threshold and miss something - it doesn't cry wolf often enough to be a problem).

The validation scripts are almost identical among the otto pipelines. They simply compare the live and new builds and see how many lines they have in common and how many appear only in one or the other. If the amount of newly created entries or newly deleted entries is greater than 10% of the number of entries in common, then an error is thrown. The results of the comparison are left in a newXXX.stats file for examination by the otto-meister.


Updating an otto pipeline

Sometimes it is necessary to make changes to an existing pipeline. This generally crops up because the build scripts encounter something unexpected. Maybe the data provider changed their file format, maybe they changed the encryption on a file, or maybe there is an error in the data file that they released (and which they should be notified about). All of those have happened in the past.

Step 1: Disable the admin's upload cron job. This is the first and most important step. You don't want any experiments you create while trying to fix the script to wind up on the live server and clobber the existing track. Do this by sending email to the admins, asking them to temporarily deactivate the automatic push of the relevant otto track data (presumably, the other otto tracks are fine and can continue to be pushed).

Step 2: Disable the otto-meister's cron job. Depending on how long hood is off for the otto pipeline, it might be inconvenient for some half-baked scripts to try running in the middle of the night. Disable the otto-meister's cron entry for the pipeline to avoid unexpected problems.

Step 3: Start tinkering! The best place to tinker is in the /hive/data/outside/otto/* directory for that pipeline. After you've got it where you want it, then you can check those changes into the kent tree (or if things go haywire, then you can restore from that directory).

Step 4: Testing changes. When you want to test changes you've made to the pipeline, the easiest way to do it is to create a new line in the otto-meister's crontab that will run the pipeline once a day in the next couple minutes. That way you can almost immediately get a run started using the exact same environment that the regularly scheduled otto run will.

Step 5: When you're ready to commit the changes you've made, check the updated scripts into the kent tree in kent/src/hg/utils/otto/*.

Final, fix scripts in the kent tree, commit and push. Then run make. Make idiosyncracy: you have to specify a destination or it won't go.


Creating a new otto pipeline