Automation: Difference between revisions

From genomewiki
Jump to navigationJump to search
m (added genscan to wishlist)
(Started fleshing out a bit, still lots more to add and need to make a page per lib module and script.)
Line 8: Line 8:
* it keeps our eyes from glazing over
* it keeps our eyes from glazing over


Of course, nothing is for free.  When something goes wrong in an automated process, we must work our way back from a usually cryptic error message through an additional level of code to the source of the problem.  (Or if it's GenBank automation, bug [[User:MarkD|MarkD]]. ;)  But the hope is that developers will spend their time on more tasks that require critical thinking and fewer boring repetitive tasks.   
Of course, nothing is for free.  When something goes wrong in an automated process, we must work our way back from a usually cryptic error message through an additional level of code to the source of the problem.  (Or if it's GenBank automation, bug [[User:Markd|MarkD]]. ;)  But the hope is that developers will spend their time on more tasks that require critical thinking and fewer boring repetitive tasks.   


The 5/30/06 genecats meeting was devoted to discussion and planning of build automation; [[User:Hiram|Hiram]] transcribed the whiteboard notes from the meeting in [[High Throughput Genome Builds]].
The 5/30/06 genecats meeting was devoted to discussion and planning of build automation; [[User:Hiram|Hiram]] transcribed the whiteboard notes from the meeting in [[High Throughput Genome Builds]].
==GenBank Update Automation Is a Whole 'Nother Story==
This page focuses solely on automation of genome annotation database and download files ''excluding'' [[User:Markd|MarkD]]'s amazing GenBank update system.  That is much more complicated and better engineered than the simple scripts described here, which have borrowed or otherwise duplicated a few of its concepts.  The source for the GenBank update system is in kent/src/hg/makeDb/genbank/ and MarkD has documented it [[http://www.soe.ucsc.edu/~markd/genbank-update/ here]].


==Automation Scripting Infrastructure==
==Automation Scripting Infrastructure==


use of perl... interpreted, nice support for regexes, hashes, etc.
[[http://www.perl.org/ Perl]] has been used for the libraries and scripts here because (let's be honest) [[User:AngieHinrichs|Angie]] likes it. Some of the good reasons for that include Perl's integral support for regexes, hashes, lists/arrays and (to a certain extent) objects.  Also, since Perl is interpreted, development and testing/debugging can happen in an extremely tight loop. 
 
Several Perl libraries (or "modules") contain shared code, variables and/or define objects that support automation of complex build processes in our compute environment.  The infrastructure absolutely depends on smooth functioning of ssh -- you should set up a passkey to use with ssh-agent and ssh-add.


* HgAutomate.pm
* HgAutomate.pm
Line 23: Line 29:


==Existing Automation Scripts==
==Existing Automation Scripts==
Existing automation scripts reside in kent/src/utils/ .  The names start with "make" or "do", loosely following this pattern: "make" scripts generate files or database tables that are not proper tracks (although makeGenomeDb does build a few simple tracks while setting up the database), while "do" scripts perform specific track build processes.  "do" scripts are more likely to include cluster runs, but they don't have to. 


* makeGenomeDb.pl
* makeGenomeDb.pl
Line 31: Line 39:
* doHgNearBlastp.pl
* doHgNearBlastp.pl
* makePushQSql.pl
* makePushQSql.pl
MarkD's genbank scripts...


==Automation Wish List==
==Automation Wish List==


* Repeat library generation (window masker?)
* Repeat library generation (window masker?)
* simpleRepeat (TRF)
* masking of 2bit sequence with both RepeatMasker and TRF output, distribution to cluster-local storage
* Brian's chained protein alignments
* Brian's chained protein alignments
* CpG islands
* CpG islands

Revision as of 23:55, 22 August 2006

Why Automate?

You've seen one genome assembly, you've seen 'em all -- hardly! But there are some very predictable, repetitive things that developers need to do every time we build a genome annotation database on a new genome assembly. It is in our best interest to automate these steps when possible for these reasons:

  • it saves time
  • it reduces copy-paste and didn't-see-that-error-message errors
  • it helps to enforce naming conventions, which helps us use each other's data
  • it can produce detailed and accurate documentation of the data
  • it keeps our eyes from glazing over

Of course, nothing is for free. When something goes wrong in an automated process, we must work our way back from a usually cryptic error message through an additional level of code to the source of the problem. (Or if it's GenBank automation, bug MarkD. ;) But the hope is that developers will spend their time on more tasks that require critical thinking and fewer boring repetitive tasks.

The 5/30/06 genecats meeting was devoted to discussion and planning of build automation; Hiram transcribed the whiteboard notes from the meeting in High Throughput Genome Builds.

GenBank Update Automation Is a Whole 'Nother Story

This page focuses solely on automation of genome annotation database and download files excluding MarkD's amazing GenBank update system. That is much more complicated and better engineered than the simple scripts described here, which have borrowed or otherwise duplicated a few of its concepts. The source for the GenBank update system is in kent/src/hg/makeDb/genbank/ and MarkD has documented it [here].

Automation Scripting Infrastructure

[Perl] has been used for the libraries and scripts here because (let's be honest) Angie likes it. Some of the good reasons for that include Perl's integral support for regexes, hashes, lists/arrays and (to a certain extent) objects. Also, since Perl is interpreted, development and testing/debugging can happen in an extremely tight loop.

Several Perl libraries (or "modules") contain shared code, variables and/or define objects that support automation of complex build processes in our compute environment. The infrastructure absolutely depends on smooth functioning of ssh -- you should set up a passkey to use with ssh-agent and ssh-add.

  • HgAutomate.pm
  • HgRemoteScript.pm
  • HgStepManager.pm

doTemplate.pl

Existing Automation Scripts

Existing automation scripts reside in kent/src/utils/ . The names start with "make" or "do", loosely following this pattern: "make" scripts generate files or database tables that are not proper tracks (although makeGenomeDb does build a few simple tracks while setting up the database), while "do" scripts perform specific track build processes. "do" scripts are more likely to include cluster runs, but they don't have to.

  • makeGenomeDb.pl
  • doRepeatMasker.pl
  • makeDownloads.pl
  • doSameSpeciesLiftOver.pl
  • doBlastzChainNet.pl
  • doHgNearBlastp.pl
  • makePushQSql.pl

Automation Wish List

  • Repeat library generation (window masker?)
  • simpleRepeat (TRF)
  • masking of 2bit sequence with both RepeatMasker and TRF output, distribution to cluster-local storage
  • Brian's chained protein alignments
  • CpG islands
  • genscan
  • multiz
  • phastCons
  • meta-automation of all blastz's, multiz, phastCons?
  • meta-automation of all scripts that we always run?

Automation Troubleshooting

  • fileserver/machines out of sync
  • cluster job dies
  • cluster job hangs
  • ssh hangs