Public Hub QA: Difference between revisions

From Genecats
Jump to navigationJump to search
Line 127: Line 127:
/data/userdata/hubCrawl/public/</pre>
/data/userdata/hubCrawl/public/</pre>
<pre>
<pre>
Please also push the following table  
Please also push the following table:


hubSearchText
hubSearchText

Revision as of 21:15, 24 July 2017

Overview

Public hubs are track or assembly hubs contributed by the worldwide research community. Public Hubs are wrangled and QA'd by the UCSC GB QA team. The QAer should work directly with the data submitter from the start to get the track hub in the correct format, etc. Then the QA person QA's the hub (a light QA compared with native tracks) and releases it. QA communicates directly with the data contributor throughout the process. If QAer needs technical help, please contact the Project Manager or an engineer.

Public hubs are made visible by a line in the hubPublic table that the QA-er will add to the various hgcentral* databases. For example:

+----------------------------------------------+-------------------+---------------------------------------+---------------------+---------+--------+--------------------------------+
| hubUrl                                       | shortLabel        | longLabel                             | registrationTime    | dbCount | dbList | descriptionUrl                 |
+----------------------------------------------+-------------------+---------------------------------------+---------------------+---------+--------+--------------------------------+
| http://lisanwanglab.org/DASHR/tracks/hub.txt | DASHR small ncRNA | DASHR Human non-coding RNA annotation | 2015-12-20 11:30:47 |       1 | hg19,  | http://lisanwanglab.org/DASHR/ |
+----------------------------------------------+-------------------+---------------------------------------+---------------------+---------+--------+--------------------------------+

Public Hub QA Process

The QA process for public hubs is simple: (1) Check that the hub meets our required guidelines, (2) if so, add it to the public list; if not, work with the contributor to ensure it meets those guidelines. Each section below contains more details as to what steps of the QA process you carry out on each of our sites/machines.

On hgwdev

This is where most of the Public Hub QA process will happen.

1) Add Hub to the hubPublic table on hgwdev

First, run hubPublicCheck on your hub:

 hubPublicCheck -addHub=http://lisanwanglab.org/DASHR/tracks/hub.txt hubPublic 

This command will generate the MySQL insert statement needed to add this hub to the hubPublic table. Here is an example command output by hubPublicCheck:

insert into hubPublic (hubUrl,shortLabel,longLabel,registrationTime,dbCount,dbList) values ("http://zlab.umassmed.edu/zlab/publications/UMassMedZHub/hub.txt","UMassMed ZHub", "UMassMed H3K4me3 ChIP-seq data for Autistic brains",now(),1, "hg19,");

Then, use hgsql to insert the line for this hub into the hubPublic table on dev:

hgsql -e “<your hubPublicCheck command here>” hgcentraltest

2) Display the hub in the Genome Browser and do some minimal QA

After adding the new hub to the hubPublic table, connect the hub and view in the Genome Browser. The primary QA you will do is ensuring that the hub meets our required guidelines, such as checking for track descriptions with contact information.

Based on the data, you should explore whether a data set should overlap or avoid certain regions such as coding exons (or maybe it would be promoters, or areas of open chromatin, or common SNPs, or highly conserved regions, or 5’ UTRs, or 3’ UTRs, or mitochondria, or the sex chromosomes, depending on the type of data, you can ask the hub provider for input to understand expectations of the data if it isn't clear). Using an idea of what the data is describing, try to identify at least one native track in the Genome Browser that one might expect to correlate or anti-correlate with the tracks in the hub dataset and visually spot check a few.

You should also look for recommended guidelines that the hub violates. For these recommended guidelines, note those that if fixed would greatly improve the usability of the hub for our users. For example, if a hub contains 300 tracks, but they aren’t organized into composites or superTracks, you should recommend that they group their tracks in a reasonable manner.

Notes for Assembly Hubs

In addition to the normal public track hub QA, there are a few things you should pay attention to:

The QA for a public Assembly Hub is very similar to that of a public Track Hub, although there are a few things unique to an Assembly Hub.

  • Check that the 2bit files exist and that the genomes.txt file points to the correct 2bit
  • Make sure the correct labels show up in drop down menus on the gateway page
  • Make sure contact information is clearly displayed on all description pages (gateway description and track description pages)
  • Warning: Be sure to check all your assemblies on the RR before you release. We have a lot of unreleased assemblies on hgwdev, a hub developer once developed a hub with preview assemblies, which worked fine on hgwdev, but on the RR such assemblies failed as the data wasn't there.

3) Pass feedback to the hub contributor

If during the previous step, you encountered issues with the hub or noticed that it violates some of our required guidelines, pass these on to the hub contributor in an email. When contacting the hub contributor, lay out your feedback in a clear and concise manner, such as through a numbered list. Often it’s helpful to not just point out the issues but to provide a solution as well, especially if it’s the misuse of a trackDb tag.

For our “recommended” guidelines, you should only pass on feedback for those items that would greatly increase the usability of the hub. For example, organizing hundreds of loose tracks into a superTrack or composite. When passing along these you should note that the contributor isn’t required to change these things, but that it would greatly increase the usefulness of their hub.

On hgwbeta

1) Add the hub to the hubPublic table on hgwbeta

Same as the step for adding this hub to hubPublic on hgwdev.

2) Cursory QA on beta

Be sure that the hub and all its tracks load properly on beta. The best way to check this is by clicking the “hide all” button on hgTracks and then navigating to the “Configure” page. On the Configure page, click the “show all” button on the track group for your track hub and then click “submit”. Check that all of the tracks load and that you don’t see any yellow error messages indicating that there were issues loading certain tracks.

Release to the RR

Use the same insert statement that you used to add this hub to hubPublic on hgwbeta to add this hub to the hubPublic table on the RR.

Post-RR release

1) Rebuild and push hub search files and tables

With new Public Hubs (especially with descriptionUrls), once they are on the beta/RR, be sure to build and push an update of the index files.
A. To build these files navigate to hive and run the doPublicCrawl script.

cd /hive/groups/browser/hubCrawl
./doPublicCrawl     

B. The result will be an updated udcCache directory in /hive/groups/browser/hubCrawl/udcCache and an updated hubSearchText table in hgcentraltest. We need to get this udcCache directory into a place where Apache can access it, ask the admins to move the files like so:

Can you please rsync --delete the contents of the following directory:

/hive/groups/browser/hubCrawl/udcCache/

to the following location on hgwdev:

/data/apache/userdata/hubCrawl/udcCache/

Once the files are in this location, verify by running ls -l /data/apache/userdata/hubCrawl/udcCache/path/to/your/new/hub, ask the admins to push these two things to beta:

Please rsync --delete the contents of the following directory on hgwdev:

/data/apache/userdata/hubCrawl/udcCache/

to the following location on hgnfs1:

/export/userdata/hubCrawl/beta/
Please also push the following table 

hubSearchText

in the hgcentraltest database to hgcentralbeta
After the table has been pushed, please 'flush tables' on mysqlbeta. 

C. You should now be able to search parts of the text on the new hub's descriptionUrl, hub's short or long label, assembly, or track labels. For example a search of methpipe matches a line on the DNA Methylation descriptionUrl: http://smithlabresearch.org/software/methbase/ or hg38 pulls up all the hg38 hubs.

D. Once you've verified that searching is working on hgwbeta, push the files/table to rr/euro/asia:

Please rsync --delete the contents of the following directory on hgnfs1:

/export/userdata/hubCrawl/beta/

to the following location on hgnfs1:

/export/userdata/hubCrawl/public/

and to the following location on euro/asia:

/data/userdata/hubCrawl/public/
Please also push the following table:

hubSearchText

in the hgcentralbeta database to hgcentral on genome-centdb/euro/asia.
After the table has been pushed, please 'flush tables' on genome-centdb/euro/asia.

2) Notify hub contributor

Contact the hub contributor and let them know that they can contact our internal mailing list (genome-www) with any questions or concerns.

3) Send genome-announce email

Here are some previous examples of announcement emails for public hubs. It is an opportunity to share a sentence or two about the lab and data (and maybe thank them for creating the public hub). The news could even be tweeted and added to Facebook too.

QA for UCSC-hosted public hubs

Public hubs created by browser engineers are an alternative to native tracks for specialty data. These types of hubs are QAed and released in a very similar manner to externally hosted hubs with a few minor differences. The main difference is that there is an additional step to push the hub files from the development system to the public download server. UCSC-hosted public hubs were previously hosted on hgwdev in /gbdb/hubs and then pushed to hgdownload for display on the public site. Now we host these hubs on hgwdev in /usr/local/apache/htdocs-hgdownload/hubs/ although they are still pushed to hgdownload for display on the public site. This shift was done to reduce confusion with other /gbdb pushed that normally go to hgnfs1.

You should still review the hub on hgwdev as described above. Since we’re the ones hosting and providing these hubs, it’s alright to be a little more strict in regards to our hub guidelines. Once the hub is looking good on hgwdev, you can release it to the RR using the steps described in the next section.

Releasing UCSC-Hosted big data Public Hubs

This step is rare and only relevant for a big data hubs that UCSC is hosting.

These hubs are made available with a push request from /usr/local/apache/htdocs-hgdownload/hubs/ on hgwdev to /mirrordata/hubs on hgdownload.

Here's an example push request:

Please push the following file:

/usr/local/apache/htdocs-hgdownload/hubs/newHub/*

from hgwdev --> hgdownload/hgdownload-sd
    (in path, "/usr/local/apache/htdocs-hgdownload/" should become "/mirrordata/" on hgdownload)

Note that items that are symlinked on hgwdev should become real files on hgdownload. 

Reason:  Releasing new UCSC hosted hub newHub to hgdownload.

Thanks!

It may be useful to note that the UCSC GTEx data have restrictions on access. Apache for hgdownload only allows RR to access the hub data files so the hub displays on the RR only (the hub won't load on other sites, and files can not be directly downloaded -other external Public Hubs have taken similar steps to control their data).

What to do if a Public Hub is down?

If you notice that a hub is consistently down for an extended period of time (3-5 days), then you should contact the hub contributor to let them know that their hub is having issues. We keep a page of the contact information for all of our public hubs here: http://genecats.cse.ucsc.edu/qa/test-results/publicHubContactInfo/publicHubContact.html. We also have a cronjob that checks the status of all of our public hubs, so be sure to check in with the person receiving those emails before sending your message.