UCSC Genes Staging Process

From Genecats
Jump to navigationJump to search

wiki pages about known genes

QAing UCSC Genes
UCSC Genes tables
UCSC Genes Staging Process
Post-Release-Checklist

Data Involved

Databases

  • Assembly Database (e.g. hg18) -- usually about 70 tables
  • UniProt Database (e.g. sp080707)
  • Proteome Database (e.g. proteins080707)
  • One table (e.g. hgBlastTab) in each of 6 other assemblies (e.g. mm, rn, dr, dm, ce, sc)

Files

  • Index files to speed searching (e.g. /gbdb/hg18/knownGene.ix and /gbdb/hg18/knownGene.ixx)
  • hgdownload READMEs for protein and uniProt database dump dirs (e.g. /goldenPath/proteinDB/proteins080707/database/README.txt)
  • hgdownload files for hgPal (e.g. in /usr/local/apache/htdocs-hgdownload/goldenPath/hg19/multiz46way/alignments/, knownCanonical.exonAA.fa.gz, knownCanonical.exonNuc.fa.gz, knownGene.exonAA.fa.gz, knownGene.exonNuc.fa.gz)
    • (we DO NOT make/push anymore: Known Gene list for Google to index (e.g. /usr/local/apache/htdocs/knownGeneList/hg18/*) )

Tracks

  • In addition to the new (or updated) UCSC Genes track, there will also be a new (or updated) Old UCSC Genes track. This track is supported by 3 tables: something along the lines of: knownGeneOld5, kgXrefOld5, and kg5ToKg6.
  • The Alt Events companion track (supported by the knownAlt table) also needs to be QAed and pushed in tandem with the UCSC Genes release. (This is only on human and mouse.)

Details About UniProt and Proteome Databases

Each UCSC Gene set is related to one UniProt database and one Proteome Database. Each of these databases can support more than one UCSC Gene set (e.g. a single UniProt database might support the UCSC Genes on both hg18 and mm9).

These databases are given a name based on the date they were created. All UniProt databases are named using the following convention: spYYMMDD (e.g. sp080707). All Proteome databases are named using the following convention: proteinsYYMMDD (e.g. proteins080707).

Some parts of the code look for a generic "uniProt" database, while some parts look for a dated "spYYMMDD" database (see Fan's explanation below). The code that uses the "uniProt" database is looking for whatever the latest "spYYMMDD" is, which is accomplished by having a symbolic link called "uniprot" to the latest dated database. The same setup is used for "proteome" and the "proteinsYYMMDD" database. The "uniprot" database is alternately called "swissProt", and "proteome" is alternately called "proteins". There should be four symbolic links total, like this:

[rhead@hgwdev mysql]$ pwd
/var/lib/mysql
[rhead@hgwdev mysql]$ ls -l *rot*
lrwxrwxrwx 1 root  root    14 Sep 23 15:50 proteins -> proteins101005
lrwxrwxrwx 1 root  root    29 Sep 26 20:54 proteome -> /var/lib/mysql/proteins101005
lrwxrwxrwx 1 root  root     8 Sep 23 15:50 swissProt -> sp101005
lrwxrwxrwx 1 root  root    23 Sep 26 20:54 uniProt -> /var/lib/mysql/sp101005

Once you push the two dated databases to hgwbeta, ask cluster-admin to update the symbolic links in the /var/lib/mysql directory on hgwbeta for uniProt and proteome to point to the newly-pushed databases. Likewise for the push from hgwbeta to the RR.

The symlinks also make the dated databases transparent to users; in the table browser you will see "uniProt" (but are actually using spYYMMDD). In MySQL, typing "use uniProt" or starting MySQL with "hgsql proteome" will get you to whichever dated database the symlink is pointing to.


An email from Fan on 9/27/11 explains more:

This "unwise" set up was due to historical reasons.  When I developed UCSC
Known Genes and Proteome Browser, I designed them that for each genome, they
consist a set of tables built from a snap shot of genomic and proteomic DBs
at the time of build.  It is a static image of the world at that moment.
The advantage is that it is logically consistent.  The gdbPdb table in
hgcentral DB keeps the pairing info for genomic and proteomic DB for each KG
build. 

When Jim later developed hgGene (details page for KG), his designed it in a
way that the hgGene page will bring the latest data (from latest protein
DBs, UniProt and our own proteinsXXXXXX) to present data to users.  It does
not guarantee all the links and data items always available (since the
underlying DBs may change), but it has the advantage of the most recent data
get presented.

The old KG display and Proteome Browser code goes after the data they need
with spXXXXXX and proteinsXXXXXX DB names.  The hgGene code uses uniProt and
proteome as DB names for the latest protein data.

Fan.

Fan also mentioned that the sp* and proteins* databases are always created in pairs, with matching date digits, which is why only one of them needs to be specified in the hgcentral.gdbPdb table. There is another set of dated tables for go*, along with a symlink to a single go* table, but these have no correspondence to the dated proteins* and sp* tables.

Staging on hgwbeta

Databases and Tables

As usual, the new databases and tables will be built on hgwdev. After QAing on hgwdev, the whole set should be staged on hgwbeta.

Create new databases on hgwbeta for the new uniProt and proteome databaes. Push all tables from dev to beta into these two new databases. Ask the cluster-admin to update the symlinks (see above for details).

Push all of the necessary supporting tables (usually about 70 tables) from the assembly database from dev to beta.

Also push the xxBlastTab table from the other assembly databases. For example, if this is a human UCSC Gene set, the table will be named hgBlastTab and will exist in the most recent assembly of the following organisms: Mouse, Rat, D. melanogaster, C. elegans, S. cerevisiae.

If this is a new gene track for this assembly, set hgNearOk=1 in hgcentralbeta.dbDb.

Files for Searching

Searching for UCSC IDs is supported by these files:

/gbdb/<db>/knownGene.ix
/gbdb/<db>/knownGene.ixx

However, if this is a UCSC Gene update, and you push those files from hgwdev to hgnfs1 at this point, the searching on the public website for the current UCSC Genes will break (because it will be looking for the new IDs). So, you will have to put up with broken searching on hgwbeta until you are ready to make the final push to the public website.

PCR Target

  • Have some test primers handy that only work on new version of ucsc genes, so you can test that it is working.
  • Push /gbdb/<db>/targetDb/kgTargetSeq.2bit from hgwdev to hgnfs1. (NOTE: There is a redmine ticket for versioning the kgTargetSeq.2bit files: http://redmine.soe.ucsc.edu/issues/6829).
  • Update the hgcentralbeta targetDb and blatServers tables. For example:
hgsql -Ne "select * from targetDb where name='hg19Kg'" hgcentraltest > targetDb.dev

then, from within SQL:

delete from targetDb where name="hg19Kg";
load data local infile "targetDb.dev" into table targetDb;

update blatServers set host='blat5', port=17783 where db="hg19Kg";

Releasing to the public website

uniProt and Proteome databases

  • push them to the RR machines *include genome-euro*
  • ask for dump/autodump to download server, and ask for the README files to be pushed to hgdownload
  • ask for them to be made available to the public mysql server
  • ask for the symlinks to be updated here: /var/lib/mysql/ (it is okay to change the symlinks before the gene set goes out)

pushing the tables from the main assembly database

Here's a trick that causes a minimum of interruption to the users of the public website. Copy the tables from hgwbeta into a temporary database on the RR servers. When it's time for the switch, just do a unix mv into the real database. When we did this for hg18 UCSC Genes update in September of 2008, there was only a 70-second interruption. (Don't forget these tables: trackDb and friends, tableDescriptions).

Alternatively, try to push the tables during a low-traffic time. The push took less than 5 minutes for the hg19 UCSC Genes in January 2011, and there were no major problems. The track just looked slightly odd (e.g., the gene colors were not normal) during the push.

Searching

Push these files from hgwdev to hgnfs1 (for human and mouse UCSC Genes):

/gbdb/<db>/knownGene.ix
/gbdb/<db>/knownGene.ixx

Update the hgcentral database

  • If this is the first gene set on an assembly, set hgcentral.dbDb.hgNearOk (to 1) to enable the Gene Sorter.

Blast Tab tables

Push xxBlastTab tables to your organism from the other 5 assemblies (all but zebrafish). Zebrafish does not have an hgGene gene set available.

PCR Target

  • Do the same steps as in hgcentralbeta for hgcentral (using hgsql -h genome-centdb).
  • Remember to retire the old blat server once the new known genes is on the RR: remove the pointers to the old Kg blatServer from the blatServers tables and ask the admins to stop the blatServer for whichever host/port it was using.

Downloads

  • Edit the downloads.html page to add a link to the newly-pushed proteins database (look for the words "Protein database for <assembly>") and push the file to hgdownload.
  • Push the CDS FASTA alignment files, if any.

Drop any unused tables

If there are knownGene-related tables that did not get rebuilt with this release, they should most likely be dropped from hgwbeta/msqlrr. It is fine to leave older sets of "Old UCSC Genes" tables on the RR. Remember when sending drop requests to include genome-euro.