UCSC Genes Staging Process

From Genecats
Revision as of 00:36, 1 November 2011 by Rhead (talk | contribs) (→‎Tables in the Assembly Database: took out reference to table lists in source tree)
Jump to navigationJump to search

The UCSC Gene set is created at UCSC for three vertebrate organisms: human, mouse, and rat. It is built once during the initial release of a new assembly, then updated sporadically after that. The process for QAing, staging, and releasing an update to the UCSC Gene track is complicated enough that it deserves to be documented.

Data Involved

  • Databases
    • Assembly Database (e.g. hg18) -- usually about 70 tables (more details on this below)
    • UniProt Database (e.g. sp080707)
    • Proteome Database (e.g. proteins080707)
    • One table (e.g. hgBlastTab) in each of 6 other assemblies (e.g. mm, rn, dr, dm, ce, sc)
  • Files
    • Index files to speed searching (e.g. /gbdb/hg18/knownGene.ix and /gbdb/hg18/knownGene.ixx)
    • hgdownload READMEs for protein and uniProt database dump dirs (e.g. /goldenPath/proteinDB/proteins080707/database/README.txt)
    • hgdownload files for hgPal (e.g. in /usr/local/apache/htdocs-hgdownload/goldenPath/hg19/multiz46way/alignments/, knownCanonical.exonAA.fa.gz, knownCanonical.exonNuc.fa.gz, knownGene.exonAA.fa.gz, knownGene.exonNuc.fa.gz)
      • (we DO NOT make/push anymore: Known Gene list for Google to index (e.g. /usr/local/apache/htdocs/knownGeneList/hg18/*) )
  • Tracks
    • In addition to the new (or updated) UCSC Genes track, there will also be a new (or updated) "Previous Version of UCSC Genes" track. This track is supported by 3 tables: something along the lines of: knownGeneOld5, kgXrefOld5, and kg5ToKg6.
    • The Alt Events companion track (supported by the knownAlt table) also needs to be QAd and pushed in tandem with the UCSC Genes release.

Tables in the Assembly Database

There are many tables involved in the UCSC Gene set. If there are tables supporting the UCSC Gene set in the assembly you are working with that are not on this list, please add them to the list. You might also consider checking the list of tables in the pushQ entry against the list of tables for the previous UCSC Gene set (sometimes developers forget to build all of the necessary tables).

For the Summer 2008 update to the UCSC Gene set on hg18, the tables are:

affyHumanExonGs, affyHumanExonGsMedian, affyHumanExonGsRatio, affyHumanExonGsRatioMedian, bioCycMapDesc, bioCycPathway, ccdsKgMap, ceBlastTab, cgapAlias, cgapBiocDesc, cgapBiocPathway, chromInfo, dmBlastTab, drBlastTab, foldUtr3, foldUtr5, gnfAtlas2Distance, gnfU95Distance, humanHprdP2P, humanVidalP2P, humanWankerP2P, keggMapDesc, keggPathway, kg3ToKg4, kgAlias, kgColor, kgProtAlias, kgProtMap2, kgSpAlias, kgTxInfo, kgXref, knownAlt, knownBlastTab, knownCanonical, knownGene, knownGeneMrna, knownGeneOld3, knownGenePep, knownIsoforms, knownToAllenBrain, knownToCdsSnp, knownToEnsembl, knownToGnf1h, knownToGnfAtlas2, knownToHInv, knownToHprd, knownToLocusLink, knownToPfam, knownToRefSeq, knownToSuper, knownToU133, knownToU133Plus2, knownToU95, knownToVisiGene, mmBlastTab, pbAnomLimit, pbResAvgStd, pbStamp, pepCCntDist, pepExonCntDist, pepHydroDist, pepIPCntDist, pepMolWtDist, pepMwAa, pepPi, pepPiDist, pepResDist, pfamDesc, rnBlastTab, scBlastTab, scopDesc, spMrna.


The QAing UCSC Genes page has additional information about:

  • the tables in UCSC Genes
  • the tables related to UCSC Genes
  • which tables populate which sections of the hgGene page
  • which tables populate which Gene Sorter columns

Details About UniProt and Proteome Databases

Each UCSC Gene set is related to one UniProt database and one Proteome Database. Each of these databases can support more than one UCSC Gene set (e.g. a single UniProt database might support the UCSC Genes on both hg18 and mm9).

These databases are given a name based on the date they were created. All UniProt databases are named using the following convention: spYYMMDD (e.g. sp080707). All Proteome databases are named using the following convention: proteinsYYMMDD (e.g. proteins080707).

Some parts of the code look for a generic "uniProt" database, while some parts look for a dated "spYYMMDD" database (see Fan's explanation below). The code that uses the "uniProt" database is looking for whatever the latest "spYYMMDD" is, which is accomplished by having a symbolic link called "uniprot" to the latest dated database. The same setup is used for "proteome" and the "proteinsYYMMDD" database. The "uniprot" database is alternately called "swissProt", and "proteome" is alternately called "proteins". There should be four symbolic links total, like this:

[rhead@hgwdev mysql]$ pwd
/var/lib/mysql
[rhead@hgwdev mysql]$ ls -l *rot*
lrwxrwxrwx 1 root  root    14 Sep 23 15:50 proteins -> proteins101005
lrwxrwxrwx 1 root  root    29 Sep 26 20:54 proteome -> /var/lib/mysql/proteins101005
lrwxrwxrwx 1 root  root     8 Sep 23 15:50 swissProt -> sp101005
lrwxrwxrwx 1 root  root    23 Sep 26 20:54 uniProt -> /var/lib/mysql/sp101005

Once you push the two dated databases to hgwbeta, ask cluster-admin to update the symbolic links in the /var/lib/mysql directory on mysqlbeta for uniProt and proteome to point to the newly-pushed databases. Likewise for the push from hgwbeta to the RR.

The symlinks also make the dated databases transparent to users; in the table browser you will see "uniProt" (but are actually using spYYMMDD). In MySQL, typing "use uniProt" or starting MySQL with "hgsql proteome" will get you to whichever dated database the symlink is pointing to.

Additionally, as you set up the new databases on hgwbeta (then on the public website) you will need to edit hgcentralbeta.gdbPdb (then hgcentral.gdbPdb) to point to the correct databases:

mysql> select * from hgcentraltest.gdbPdb where genomeDb = 'hg18'\G

genomeDb: hg18
proteomeDb: proteins080707

An email from Fan on 9/27/11 explains more:

This "unwise" set up was due to historical reasons.  When I developed UCSC
Known Genes and Proteome Browser, I designed them that for each genome, they
consist a set of tables built from a snap shot of genomic and proteomic DBs
at the time of build.  It is a static image of the world at that moment.
The advantage is that it is logically consistent.  The gdbPdb table in
hgcentral DB keeps the pairing info for genomic and proteomic DB for each KG
build. 

When Jim later developed hgGene (details page for KG), his designed it in a
way that the hgGene page will bring the latest data (from latest protein
DBs, UniProt and our own proteinsXXXXXX) to present data to users.  It does
not guarantee all the links and data items always available (since the
underlying DBs may change), but it has the advantage of the most recent data
get presented.

The old KG display and Proteome Browser code goes after the data they need
with spXXXXXX and proteinsXXXXXX DB names.  The hgGene code uses uniProt and
proteome as DB names for the latest protein data.

Fan.

Fan also mentioned that the sp* and proteins* databases are always created in pairs, with matching date digits, which is why only one of them needs to be specified in the hgcentral.gdbPdb table. There is another set of dated tables for go*, along with a symlink to a single go* table, but these have no correspondence to the dated proteins* and sp* tables.

Staging on hgwbeta

  • Databases and Tables

As usual, the new databases and tables will be built on hgwdev. After QAing on hgwdev, the whole set should be staged on hgwbeta.

Create new databases on hgwbeta for the new uniProt and proteome databaes. Push all tables from dev to beta into these two new databases. Update the gdbPdb table and ask the cluster-admin to update the symlinks (see above for details).

Push all of the necessary supporting tables (usually about 70 tables) from the assembly database from dev to beta.

Also push the xxBlastTab table from the other assembly databases. For example, if this is a human UCSC Gene set, the table will be named hgBlastTab and will exist in the most recent assembly of the following organisms: Mouse, Rat, Zebrafish, D. melanogaster, C. elegans, S. cerevisiae.

  • Searching

Searching for UCSC IDs is supported by these files:

/gbdb/hg18/knownGene.ix
/gbdb/hg18/knownGene.ixx

However, if this is a UCSC Gene update, and you push those files from hgwdev to hgnfs1 at this point, the searching on the public website for the current UCSC Genes will break (because it will be looking for the new IDs). So, you will have to put up with broken searching on hgwbeta until you are ready to make the final push to the public website.

Releasing to the public website

  • uniProt and Proteome databases
    • push them to the RR machines
    • ask for dump/autodump to download server
    • ask for them to be made available to the public mysql server
    • when UCSC Gene tables are in place, ask for the symlinks to be updated here: /var/lib/mysql/
  • pushing the tables from the main assembly databse

Here's a trick that causes a minimum of interruption to the users of the public website. Copy the tables from hgwbeta into a temporary database on the RR servers. When it's time for the switch, just do a unix mv into the real database. When we did this for hg18 UCSC Gene update in September of 2008, there was only a 70-second interruption. (Don't forget these tables: trackDb, hgFindSpec, tableDescriptions).

  • Update the hgcentral database:
    • Add a line to hgcentral.gdbPdb to point to the correct proteins database.
    • Update hgcentral.dbDb.hgNearOk (to 1) to enable the Gene Sorter.
  • Searching

Push these files from hgwdev to hgnfs1:

/gbdb/hg18/knownGene.ix
/gbdb/hg18/knownGene.ixx
  • Blast Tab tables

Push xxBlastTab tables to your organism from the other 6 assemblies.

  • Download page

Edit the downloads.html page to add a link to the newly-pushed proteins database and push the file to hgdownload.

Post-Release

We typically like to announce the release of a new UCSC Gene set. Ask Donna to prepare an announcement for the website and genome-announce. Also see this page: Post-Release-Checklist.