Public Hub QA

From Genecats
Jump to navigationJump to search

Overview

Process Overview

Public hubs are wrangled and QA'd by the UCSC GB QA team. The QAer should work directly with the data submitter from the start to get the track hub in the correct format, etc. Then the QA person QA's the hub (a light QA compared with native tracks) and releases it. QA communicates directly with the data contributor throughout the process. If QAer needs technical help, please contact the Project Manager or an engineer.

Philosophy of QAing Data in a Public Hub

Based on the data explore whether a data set should overlap or avoid certain regions such as coding exons (or maybe it would be promoters, or areas of open chromatin, or common SNPs, or highly conserved regions, or 5’ UTRs, or 3’ UTRs, or mitochondria, or the sex chromosomes, depending on the type of data, you can ask the hub provider for input to understand expectations of the data if it isn't clear). Using an idea of what the data should be about try to identify at least one native track in the Genome Browser that one might expect to correlate or anti-correlate with the tracks in the hub data set and visually spot check a few.

Public Hub Overview

Public hubs are made visible by a line the hgcentral*.hubPublic table i.e.:

+--------------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------+---------------------+---------+--------+
| hubUrl                               | shortLabel                  | longLabel                                                                                                        | registrationTime    | dbCount | dbList |
+--------------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------+---------------------+---------+--------+
| http://johnlab.org/xpad/Hub/UCSC.txt | DRS PolyA site & Expression | Genomewide expression & Polyadenylation landscape of cancer using Direct RNA sequencing (Tissues and Cell lines) | 2012-05-15 09:50:09 |       1 | hg19,  | 
+--------------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------+---------------------+---------+--------+

UCSC-Hosted Public Hubs

Public hubs created by browser engineers are an alternative to native tracks for specialty data. This kind of hub is QAed and released slightly differently from externally hosted hubs. The main difference is that there is an additional step to push the hub files from the development system to the public download server.

The first step is to review the hub, as described below, with the files hosted on genome-test so the developer can easily change them during QA. You can do this from the browser on genome.ucsc.edu, adding the hub URL on genome-test (e.g. http://genome-test.cse.ucsc.edu/gbdb/hubs/<your-hub>/hub.test.txt) via the 'My Hubs' panel of the Track Data Hubs page.

After you have completed review and the developer has made any changes needed, you will request a push of the hub files listed by the developer (e.g. in the file /hive/data/gbdb/hubs/<your-hub>/filelist.txt) to the download server (directory /mirrordata/hubs/<your-hub>).

After sanity checking the hub on hgdownload (e.g at http://hgdownload.soe.ucsc.edu/hubs/<your-hub>/hub.txt), you can proceed with the instructions below for adding to the hubPublic table. (NOTE: The additional testing on hgwbeta mentioned below is probably unnecessary).

Wrangling Public Hub

The entry in the hubPublic table on hgwdev (i.e. in hgcentraltest) will be added by the QAer and that QAer will work with the authors to get it into basic acceptable shape before QAing it. If the QAer has technical questions about wrangling the track, s/he should ask the Project Manager or an engineer for advice/assistance.

To add the hub to the hubPublic table on dev, follow the below steps about adding the hub to beta, except log onto the dev MySQL server instead of the beta server:

hubPublicCheck -addHub=http://johnlab.org/xpad/Hub/UCSC.txt hubPublic

[With commandline utilities that use udcCache, you have to specify the option -udcDir= so that it uses some place other than the nearly useless default value /tmp/udcCache, since some other user almost always beats you to it and you can't write there anymore. So, when running hubCheck, specify either "-udcDir=." or "-udcDir=$HOME" or "-udcDir=$HOME/udcCache"]

This outputs the insert statements you will need to insert into hgcentraltest to change the hubPublic table. The output should look something like:

mysql> insert into hubPublic (hubUrl,shortLabel,longLabel,registrationTime,dbCount,dbList) values ("http://zlab.umassmed.edu/zlab/publications/UMassMedZHub/hub.txt","UMassMed ZHub", "UMassMed H3K4me3 ChIP-seq data for Autistic brains",now(),1, "hg19,");

Once you have this text, go into hgcentraltest (hgsql hgcentraltest) and paste in the output of hubPublicCheck at the mysql prompt

You may wish to review engineer feedback providing Public Hub Guidelines to refresh areas to examine closely.

Note, if the hub submitter is looking to limit IP address access to only the Genome Browser public site, don't forget to include genome-euro. You can find the IP address for a server by using the host <server name> command, e.g. host genome-euro.ucsc.edu.

Stage the hub on Beta

Begin by running hubPublicCheck to generate the insert statement needed to add the line to hgcentralbeta.hubPublic. You will need to get the hub url either from the redmine ticket or from the instance of it on hgwdev i.e.:


hubPublicCheck -addHub=http://johnlab.org/xpad/Hub/UCSC.txt hubPublic

[With commandline utilities that use udcCache, you have to specify the option -udcDir= so that it uses some place other than the nearly useless default value /tmp/udcCache, since some other user almost always beats you to it and you can't write there anymore. So, when running hubCheck, specify either "-udcDir=." or "-udcDir=$HOME" or "-udcDir=$HOME/udcCache"]

This outputs the insert statements you will need to insert into hgcentralbeta to change the hubPublic table. The output should look something like:

mysql> insert into hubPublic (hubUrl,shortLabel,longLabel,registrationTime,dbCount,dbList) values ("http://zlab.umassmed.edu/zlab/publications/UMassMedZHub/hub.txt","UMassMed ZHub", "UMassMed H3K4me3 ChIP-seq data for Autistic brains",now(),1, "hg19,");

Once you have this text, go into hgcentralbeta (hgsql -h mysqlbeta hgcentralbeta) and paste in the output of hubPublicCheck at the mysql prompt. This should cause the hub you are testing to appear on beta immediately (i.e. you don't need to do a make). (Before adding you can use the commands "mysql> select * from hubPublic \G" to show all and again afterward to confirm addition).

Note, make sure that the order of the assemblies in the dbList field matches the order of assemblies in the hub's genomes.txt file. The first assembly in the dbList field determines the default assembly that will show up when connecting the assembly hub.

Cursory QA on beta

Once the hub is staged on beta do a minimal round of QA - including:

  • Making sure the tracks open.
  • Make sure there aren't too many tracks on by default - the hub should load quickly, if not you might need to ask the contributor to reduce the number of tracks on by default
  • Checking that tracks have description pages.
  • Review the shortLabels to see if any need to be shortened by displaying all tracks in dense. The shortLabel text should be under 17 characters, or meaningful information may be cut off from display.
  • The length for a longLabel should be about 75 characters.
  • Making sure that the authors' email address is prominently listed in the description pages (so our users can contact them with questions).
  • Take a moment to review the Public Hub Guidelines to refresh areas to examine closely.

Push to the RR

Once you've verified the hub is functioning and looks reasonable you can "push" it to the RR by performing the analogous insert into the hubPublic table on the RR (i.e. in hgcentral).

To go into hgcentral on hgwdev type at the prompt: hgsql -h genome-centdb

Then at the mysql prompt: use hgcentral

Paste in the same insert statement you input on beta (you don't need to rerun hubPublicCheck again on the RR - just paste the same text in). Updating note: If you are updating a hub in the future, because perhaps it has changed genomes.txt since you first added it, you may want to run the hubPublicCheck again to get the updated statement and be sure to update at dev,beta, and the RR (we have a hubPublicCheck cronjob that also checks for when remote hubs make changes).

Once this is done, update the redmine ticket and notify the hub contributor that their hub is live.

With new Public Hubs (especially with descriptionUrls), once they are on the RR, be sure to build and push an update of the index files.
1. To build these files navigate to hive and run the doPublicCrawl script.

 
cd /hive/groups/browser/hubCrawl
./doPublicCrawl     

2. The result will be an updated udcCache directory in /hive/groups/browser/hubCrawl/udcCache and an updated hubSearchText table in hgcentraltest. Ask the admins to push these two things to beta:

Please push the following directory on hgwdev:

/data/apache/userdata/hubCrawl/udcCache/

to the following location on hgnfs1:

/export/userdata/hubCrawl/rr/
/export/userdata/hubCrawl/beta/

and the following location on asia/euro:

/data/userdata/hubCrawl/

Please also push the following table 

hubSearchText

in the hgcentraltest database

from hgwdev ---> hgwbeta

3. You should now be able to search parts of the text on the new hub's descriptionUrl, hub's short or long label, assembly, or track labels. For example a search of methpipe matches a line on the DNA Methylation descriptionUrl: http://smithlabresearch.org/software/methbase/ or hg38 pulls up all the hg38 hubs.

Releasing UCSC-Hosted big data Public Hubs

This step is rare and only relevant for a big data hubs that UCSC is hosting. The main reasoning for implementing the change from /gbdb/hubs is to help ensure there isn't confusion with other /gbdb/ pushes (which the admins normally push to hgnfs1 as those files are often used by internal tracks in the RR). These public hubs don't go that route and instead go to hgdownload. These hubs are made available with a push request from /usr/local/apache/htdocs-hgdownload/hubs/ on hgwdev to /mirrordata/hubs on hgdownload.

Here's an example push request:

Please push the following file:

/usr/local/apache/htdocs-hgdownload/hubs/newHub/*

from hgwdev --> hgdownload/hgdownload-sd
    (in path, "/usr/local/apache/htdocs-hgdownload/" should become "/mirrordata/" on hgdownload)

Note that items that are symlinked on hgwdev should become real files on hgdownload. 

Reason:  Releasing new UCSC hosted hub newHub to hgdownload.

Thanks!

It may be useful to note that the UCSC GTEx data has restrictions on access. Apache for hgdownload only allows RR to access the hub data files so the hub displays on the RR only (the hub won't load on other sites, and files can not be directly downloaded -other external Public Hubs have taken similar steps to control their data).

Send genome-announce email

Here are some previous examples of announcement emails for public hubs. It is an opportunity to share a sentence or two about the lab and data (and maybe thank them for creating the public hub). The news could even be tweeted and added to Facebook too.

Notes for Public Assembly Hub QA

Assembly hubs are new feature released in early 2013. Refer to the Assembly Hubs page on the public wiki for more info on how to create your own.

The QA for a public Assembly Hub is very similar to that of a public Track Hub, although there are a few things unique to an Assembly Hub.

  • Check that the 2bit files exist and that the genomes.txt files points to the correct 2bit
  • Make sure the correct labels show up in drop down menus on the gateway page
  • Make sure contact information is clearly displayed on all description pages (gateway description, and the track description pages)
  • Strongly suggest the hub add these settings in each genome's entry in genomes.txt (You can explain to them that the last 3 settings will make it easier to find each assembly's hub species in hgGateway by UI search) :
    • defaultPos, scientificName, organism, description
  • Assembly Hub QA Warning: Be sure to check all your assemblies on the RR before you release. We have a lot of unreleased assemblies on hgwdev, a hub developer once developed a hub with preview assemblies, which worked fine on hgwdev, but on the RR such assemblies failed as the data wasn't there.

Then run through the basic public Track Hub QA talked about in the above sections.