UCSC Genes Staging Process

From genomewiki
Revision as of 22:49, 16 July 2010 by Marygoldman (talk | contribs) (adding Category:Browser QA tracks)
Jump to navigationJump to search

The UCSC Gene set is created at UCSC for three vertebrate organisms: human, mouse, and rat. It is built once during the initial release of a new assembly, then updated sporadically after that. The process for QAing, staging, and releasing an update to the UCSC Gene track is complicated enough that it deserves to be documented.

Data Involved

  • Databases
    • Assembly Database (e.g. hg18) -- usually about 70 tables (more details on this below)
    • UniProt Database (e.g. sp080707)
    • Proteome Database (e.g. proteins080707)
    • One table (e.g. hgBlastTab) in each of 5 other assemblies(e.g. mm, dm, dr, rn, sc, ce)
  • Files
    • Index files to speed searching (e.g. /gbdb/hg18/knownGene.ix and /gbdb/hg18/knownGene.ixx)
    • Known Gene list for Google to index (e.g. /usr/local/apache/htdocs/knownGeneList/hg18/*)
    • hgdownload files (e.g. /goldenPath/proteinDB/proteins080707/database/README.txt)
  • Tracks
    • In addition to the new (or updated) UCSC Genes track, there will also be a new (or updated) "Previous Version of UCSC Genes" track. This track is supported by one table; something along the lines of: knownGeneOld2.
    • The Alt Events companion track (supported by the knownAlt table) also needs to be QAd and pushed in tandem with the UCSC Genes release.

Tables in the Assembly Database

There are many tables involved in the UCSC Gene set. For a complete list of all possible tables ever used to support any UCSC Genes set in any of the three organisms, see: /cluster/bin/scripts/kgTables. If there are tables supporting the UCSC Gene set in the assembly you are working with that are not on this list, please add them to the list. You might also consider checking the list of tables in the pushQ entry against the list of tables for the previous UCSC Gene set (sometimes developers forget to build all of the necessary tables).

For the Summer 2008 update to the UCSC Gene set on hg18, the tables are:

affyHumanExonGs, affyHumanExonGsMedian, affyHumanExonGsRatio, affyHumanExonGsRatioMedian, bioCycMapDesc, bioCycPathway, ccdsKgMap, ceBlastTab, cgapAlias, cgapBiocDesc, cgapBiocPathway, chromInfo, dmBlastTab, drBlastTab, foldUtr3, foldUtr5, gnfAtlas2Distance, gnfU95Distance, humanHprdP2P, humanVidalP2P, humanWankerP2P, keggMapDesc, keggPathway, kg3ToKg4, kgAlias, kgColor, kgProtAlias, kgProtMap2, kgSpAlias, kgTxInfo, kgXref, knownAlt, knownBlastTab, knownCanonical, knownGene, knownGeneMrna, knownGeneOld3, knownGenePep, knownIsoforms, knownToAllenBrain, knownToCdsSnp, knownToEnsembl, knownToGnf1h, knownToGnfAtlas2, knownToHInv, knownToHprd, knownToLocusLink, knownToPfam, knownToRefSeq, knownToSuper, knownToU133, knownToU133Plus2, knownToU95, knownToVisiGene, mmBlastTab, pbAnomLimit, pbResAvgStd, pbStamp, pepCCntDist, pepExonCntDist, pepHydroDist, pepIPCntDist, pepMolWtDist, pepMwAa, pepPi, pepPiDist, pepResDist, pfamDesc, rnBlastTab, scBlastTab, scopDesc, spMrna.


The QAing UCSC Genes page has additional information about:

  • the tables in UCSC Genes
  • the tables related to UCSC Genes
  • which tables populate which sections of the hgGene page
  • which tables populate which Gene Sorter columns

Details About UniProt and Proteome Databases

Each UCSC Gene set is related to one UniProt database and one Proteome Database. Each of these databases can support more than one UCSC Gene set (e.g. a single UniProt database might support the UCSC Genes on both hg18 and mm9).

These databases are given a name based on the date they were created. All UniProt databases are named using the following convention: spYYMMDD (e.g. sp080707). All Proteome databases are named using the following convention: proteinsYYMMDD (e.g. proteins080707).

To make this transparent to the users, a symbolic link is used; users see "uniProt" (but are actually using spYYMMDD). Once you push these two databases to hgwbeta, ask the cluster-admin to update the symbolic link in the /var/lib/mysql directory on mysqlbeta for uniProt and proteome to point to the newly-pushed databases. Likewise for the push from hgwbeta to the public website (request that they edit the directory specified above on mysqlrr).

Additionally, as you set up the new databases on hgwbeta (then on the public website) you will need to edit hgcentralbeta.gdbPdb (then hgcentral.gdbPdb) to point to the correct databases:

mysql> select * from hgcentraltest.gdbPdb where genomeDb = 'hg18'\G

genomeDb: hg18
proteomeDb: proteins080707

Staging on hgwbeta

  • Databases and Tables

As usual, the new databases and tables will be built on hgwdev. After QAing on hgwdev, the whole set should be staged on hgwbeta.

Create new databases on hgwbeta for the new uniProt and proteome databaes. Push all tables from dev to beta into these two new databases. Update the gdbPdb table and ask the cluster-admin to update the symlinks (see above for details).

Push all of the necessary supporting tables (usually about 70 tables) from the assembly database from dev to beta.

Also push the xxBlastTab table from the other assembly databases. For example, if this is a human UCSC Gene set, the table will be named hgBlastTab and will exist in the most recent assembly of the following organisms: Mouse, Rat, Zebrafish, D. melanogaster, C. elegans, S. cerevisiae.

  • Searching

Searching for UCSC IDs is supported by these files:

/gbdb/hg18/knownGene.ix
/gbdb/hg18/knownGene.ixx

However, if this is a UCSC Gene update, and you push those files from hgwdev to hgnfs1 at this point, the searching on the public website for the current UCSC Genes will break (because it will be looking for the new IDs). So, you will have to put up with broken searching on hgwbeta until you are ready to make the final push to the public website.

Releasing to the public website

  • uniProt and Proteome databases
    • push them to the RR machines
    • ask for dump/autodump to download server
    • ask for them to be made available to the public mysql server
    • when UCSC Gene tables are in place, ask for the symlinks to be updated here: /var/lib/mysql/
  • pushing the tables from the main assembly databse

Here's a trick that causes a minimum of interruption to the users of the public website. Copy the tables from hgwbeta into a temporary database on the RR servers. When it's time for the switch, just do a unix mv into the real database. When we did this for hg18 UCSC Gene update in September of 2008, there was only a 70-second interruption. (Don't forget these tables: trackDb, hgFindSpec, tableDescriptions).

  • Update the hgcentral database:
    • Add a line to hgcentral.gdbPdb to point to the correct proteins database.
    • Update hgcentral.dbDb.hgNearOk (to 1) to enable the Gene Sorter.
  • Searching

Push these files from hgwdev to hgnfs1:

/gbdb/hg18/knownGene.ix
/gbdb/hg18/knownGene.ixx
  • Gene Lists for Google

We attempt to get Google to index our list of UCSC Genes. To that end, this is the set of files that needs to be made available:

/usr/local/apache/htdocs/knownGeneList/<db>/*

Push those directories and files from hgwdev to the RR machines. It shows up in the browser here: http://genome.ucsc.edu/knownGeneList/hg18/

And is linked to from here: http://genome.ucsc.edu/knownGeneLists.html

The above page will need to be edited if this is a new UCSC Gene set (not an update).

  • Blast Tab tables

Push xxBlastTab tables to your organism from the other 6 assemblies.

  • Download page

Edit the downloads.html page to add a link to the newly-pushed proteins database and push the file to hgdownload.

  • Exon Primer Links

Send an email to Tim Strom (http://ihg2.helmholtz-muenchen.de/) so that he knows we are releasing a new UCSC Gene set. He will need to prepare his website for the new UCSC IDs.

  • HGNC Links to us

Contact Michael Lush: hgnc at genenames dot org to let him know about the update. He will need to rebuild his ucsc2hgnc mappings table.

  • Galaxy (may want to update and/or pre-load these data)

Contact Anton at Galaxy to let him know about the update.

  • UniProtKB

Create a speical file for the folks at UniProt. See this file for a sample of what they are looking for: hgwdev:/usr/local/apache/htdocs/goldenPath/hg18/UCSCGenes/uniProtToUcscGenes.txt

You can use this script to create the file: makeUniProtFile.csh

After creating this file, push it to the hgdownload machine. Then send an email to Elisabeth Gasteiger (Elisabeth -dot- Gasteiger -at- isb -dash- sib -dot- ch) letting her know we are releasing a new UCSC Gene set. She will download the file and change their links to our web site.

  • Exoniphy

Exoniphy is a companion track (in the Genes group). It is created by Adam Seipel's group at Cornell. Contact Adam to ask him to create this track for this UCSC Genes release.

Post-Release

We typically like to announce the release of a new UCSC Gene set. Ask Donna to prepare an announcement for the website and genome-announce. Also see this page: Post-Release-Checklist.