Public Hub QA: Difference between revisions

From Genecats
Jump to navigationJump to search
(→‎Push to the RR: updating trash directories for hub searching)
(→‎What to do if a Public Hub is down?: Updating section format)
 
(77 intermediate revisions by 11 users not shown)
Line 1: Line 1:
==Overview==
== Overview ==
===Process Overview===
Public hubs are wrangled and QA'd by the UCSC GB QA team. The QAer should work directly with the data submitter from the start to get the track hub in the correct format, etc. Then the QA person QA's the hub (a light QA compared with native tracks) and releases it. QA communicates directly with the data contributor throughout the process. If QAer needs technical help, please contact the Project Manager or an engineer.
====Philosophy of QAing Data in a Public Hub====
Based on the data explore whether a data set should overlap or avoid certain regions such as coding exons (or maybe it would be promoters, or areas of open chromatin, or common SNPs, or highly conserved regions, or 5’ UTRs, or  3’ UTRs, or mitochondria, or the sex chromosomes, depending on the type of data, you can ask the hub provider for input to understand expectations of the data if it isn't clear). Using an idea of what the data should be about try to identify at least one native track in the Genome Browser that one might expect to correlate or anti-correlate with the tracks in the hub data set and visually spot check a few.


===Public Hub Overview===
[https://genome.ucsc.edu/cgi-bin/hgHubConnect Public hubs] are track or assembly hubs contributed by the worldwide research community. Public Hubs are wrangled and QA'd by the UCSC GB QA team. The QAer should work directly with the data submitter from the start to get the track hub in the correct format, etc. Then the QA person QA's the hub (a light QA compared with native tracks) and releases it. QA communicates directly with the data contributor throughout the process. If QAer needs technical help, please contact the Project Manager or an engineer.


Public hubs are made visible by a line the hgcentral*.hubPublic table i.e.:
Public hubs are made visible by a line in the hubPublic table that the QA-er will add to the various hgcentral* databases. For example:


<pre>
<pre>
+--------------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------+---------------------+---------+--------+
+----------------------------------------------+-------------------+---------------------------------------+---------------------+---------+--------+--------------------------------+
| hubUrl                               | shortLabel                 | longLabel                                                                                                       | registrationTime    | dbCount | dbList |
| hubUrl                                       | shortLabel       | longLabel                             | registrationTime    | dbCount | dbList | descriptionUrl                |
+--------------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------+---------------------+---------+--------+
+----------------------------------------------+-------------------+---------------------------------------+---------------------+---------+--------+--------------------------------+
| http://johnlab.org/xpad/Hub/UCSC.txt | DRS PolyA site & Expression | Genomewide expression & Polyadenylation landscape of cancer using Direct RNA sequencing (Tissues and Cell lines) | 2012-05-15 09:50:09 |      1 | hg19,  |  
| http://lisanwanglab.org/DASHR/tracks/hub.txt | DASHR small ncRNA | DASHR Human non-coding RNA annotation | 2015-12-20 11:30:47 |      1 | hg19,  | http://lisanwanglab.org/DASHR/ |
+--------------------------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------+---------------------+---------+--------+
+----------------------------------------------+-------------------+---------------------------------------+---------------------+---------+--------+--------------------------------+
</pre>
</pre>


=== UCSC-Hosted Public Hubs ===
== Automation Script ==
After you do manual QA of the hub tracks, description pages, and the hub [http://genomewiki.ucsc.edu/index.php/Public_Hub_Guidelines#Required_Guidelines passes the requirements], you can run all the table insertion steps below with the following script. Hit "N" for the first question and read carefully. Note that this script can insert a new hub OR update an old one. Follow the prompts.
~/genecats/qa/testTools/hubPublicScripts/updateHubPublic
Contact Daniel Schmelter or edit the script yourself for bugs/improvements.


Public hubs created by browser engineers are an alternative to native tracks for specialty data.  This kind of hub is QAed and released slightly differently from externally hosted hubs.  The main difference is that there is an additional step to push the hub files from the development system to the public download server.
== Public Hub QA Process ==


The first step is to review the hub, as described below, with the files hosted on genome-test so the developer can easily change them during QA.  You can do this from the browser on genome.ucsc.edu, adding the hub URL on genome-test (e.g. http://genome-test.cse.ucsc.edu/gbdb/hubs/<your-hub>/hub.test.txt) via the 'My Hubs' panel of the Track Data Hubs page.
The QA process for public hubs is simple: (1) Check that the hub meets our [http://genomewiki.ucsc.edu/index.php/Public_Hub_Guidelines#Required_Guidelines required guidelines], (2) if so, add it to the public list; if not, work with the contributor to ensure it meets those guidelines. Each section below contains more details as to what steps of the QA process you carry out on each of our sites/machines. If applicable, recommend that the hub creator review the [https://genome.ucsc.edu/goldenPath/help/metadata.html metadata guide] to add any extra information for their experiments.


After you have completed review and the developer has made any changes needed, you will request a push of the hub files listed by the developer (e.g. in the file /hive/data/gbdb/hubs/<your-hub>/filelist.txt) to the download server (directory /mirrordata/hubs/<your-hub>). 
=== On hgwdev ===


After sanity checking the hub on  hgdownload (e.g at  http://hgdownload.soe.ucsc.edu/hubs/<your-hub>/hub.txt), you can proceed with the instructions below for adding to the hubPublic table.  (NOTE: The additional testing on hgwbeta mentioned below is probably unnecessary).
This is where most of the Public Hub QA process will happen.


==Wrangling Public Hub==
==== 1) Add Hub to the hubPublic table on hgwdev ====
The entry in the hubPublic table on hgwdev (i.e. in hgcentraltest) will be added by the QAer and that QAer will work with the authors to get it into basic acceptable shape before QAing it. If the QAer has technical questions about wrangling the track, s/he should ask the Project Manager or an engineer for advice/assistance.


To add the hub to the hubPublic table on dev, follow the below steps about adding the hub to beta, except log onto the dev MySQL server instead of the beta server:
First, run hubPublicCheck on your hub:
<pre>
hubPublicCheck hubPublic -udcDir=. -addHub=http://lisanwanglab.org/DASHR/tracks/hub.txt
</pre>
The option '''-udcDir=''' is need to prevent people from using the default udcCache directory -- we all need to use separate -udcDir in order to avoid stepping on each others' toes.
 
This command will generate the MySQL insert statement needed to add this hub to the hubPublic table. Here is an example command output by hubPublicCheck:
<pre>
insert into hubPublic (hubUrl,shortLabel,longLabel,registrationTime,dbCount,dbList) values ("http://zlab.umassmed.edu/zlab/publications/UMassMedZHub/hub.txt","UMassMed ZHub", "UMassMed H3K4me3 ChIP-seq data for Autistic brains",now(),1, "hg19,");
</pre>
 
Then, use hgsql to insert the line for this hub into the hubPublic table on dev (or use Daniel's above script to update dev/beta/RR):
<pre>
hgsql -e '<your hubPublicCheck command here>' hgcentraltest
</pre>
 
==== 2) Display the hub in the Genome Browser and do some minimal QA ====
 
After adding the new hub to the hubPublic table, connect the hub and view in the Genome Browser. The primary QA you will do is ensuring that the hub meets our [http://genomewiki.ucsc.edu/index.php/Public_Hub_Guidelines#Required_Guidelines required guidelines], such as checking for track descriptions with contact information.


hubPublicCheck -addHub=http://johnlab.org/xpad/Hub/UCSC.txt hubPublic
Based on the data, you should explore whether a data set should overlap or avoid certain regions such as coding exons (or maybe it would be promoters, or areas of open chromatin, or common SNPs, or highly conserved regions, or 5’ UTRs, or  3’ UTRs, or mitochondria, or the sex chromosomes, depending on the type of data, you can ask the hub provider for input to understand expectations of the data if it isn't clear). Using an idea of what the data is describing, try to identify at least one native track in the Genome Browser that one might expect to correlate or anti-correlate with the tracks in the hub dataset and visually spot check a few.


[With commandline utilities that use udcCache, you have to specify the option -udcDir= so that it
You should also look for [http://genomewiki.ucsc.edu/index.php/Public_Hub_Guidelines#Recommended_Guidelines recommended guidelines] that the hub violates. For these recommended guidelines, note those that if fixed would greatly improve the usability of the hub for our users. For example, if a hub contains 300 tracks, but they aren’t organized into composites or superTracks, you should recommend that they group their tracks in a reasonable manner.  
uses some place other than the nearly useless default value /tmp/udcCache, since some other user
almost always beats  you to it and you can't write there anymore. So, when running hubCheck, specify either "-udcDir=."  or  "-udcDir=$HOME" or "-udcDir=$HOME/udcCache"]


This outputs the insert statements you will need to insert into hgcentraltest to change the hubPublic table. The output should look something like:
===== Notes for Assembly Hubs =====


In addition to the normal public track hub QA, there are a few things you should pay attention to:
The QA for a public Assembly Hub is very similar to that of a public Track Hub, although there are a few things unique to an Assembly Hub.
* Check that the 2bit files exist and that the genomes.txt file points to the correct 2bit
* Make sure the correct labels show up in drop down menus on the gateway page
* Make sure contact information is clearly displayed on all description pages (gateway description and track description pages)
* '''Warning:'''  Be sure to check ''all your assemblies on the RR before you release''.  We have a lot of unreleased assemblies on hgwdev, a hub developer once developed a hub with preview assemblies, which worked fine on hgwdev, but on the RR such assemblies failed as the data wasn't there.
** Also note that some hub developers might try to sneak in an assembly hub without understanding them correctly, see #20761 where <code>genome hub_10649_araTha1 </code> was used to try to reference another assembly hub (someday we might support this).
==== 3) Pass feedback to the hub contributor ====
If during the previous step, you encountered issues with the hub or noticed that it violates some of our required guidelines, pass these on to the hub contributor in an email. When contacting the hub contributor, lay out your feedback in a clear and concise manner, such as through a numbered list. Often it’s helpful to not just point out the issues but to provide a solution as well, especially if it’s the misuse of a trackDb tag.
For our “recommended” guidelines, you should only pass on feedback for those items that would greatly increase the usability of the hub. For example, organizing hundreds of loose tracks into a superTrack or composite. When passing along these you should note that the contributor isn’t required to change these things, but that it would greatly increase the usefulness of their hub.
=== On hgwbeta ===
==== 1) Add the hub to the hubPublic table on hgwbeta ====
Same as the step for adding this hub to hubPublic on hgwdev.
Note the quotation marks (' ' vs " "), you will want to use single quotes(' ')
because the output of hubPublicCheck is encapsulated in double quotes (" ")
hgsql -h hgwbeta -e '<insert hubPublicCheck output>' hgcentralbeta
==== 2) Cursory QA on beta ====
Be sure that the hub and all its tracks load properly on beta. The best way to check this is by clicking the “hide all” button on hgTracks and then navigating to the “Configure” page. On the Configure page, click the “show all” button on the track group for your track hub and then click “submit”. Check that all of the tracks load and that you don’t see any yellow error messages indicating that there were issues loading certain tracks.
=== Release to the RR ===
Use the same insert statement that you used to add this hub to hubPublic on hgwbeta to add this hub to the hubPublic table on the RR.
hgsql -h genome-centdb -e '<insert hubPublicCheck output>' hgcentral
'''Note: If your hub has restricted data''' (data only loading on the IPs of certain machines) be sure the Public Hub Provider is given all the IPs of our mirrors:
<pre>
<pre>
mysql> insert into hubPublic (hubUrl,shortLabel,longLabel,registrationTime,dbCount,dbList) values ("http://zlab.umassmed.edu/zlab/publications/UMassMedZHub/hub.txt","UMassMed ZHub", "UMassMed H3K4me3 ChIP-seq data for Autistic brains",now(),1, "hg19,");
128.114.119.* = genome.ucsc.edu
129.70.40.99 = european mirror, genome-euro.ucsc.edu
134.160.84.67 = asian mirror, genome-asia.ucsc.edu
128.114.198.32 = genome-test.gi.ucsc.edu, used by developers and for debugging
</pre>
</pre>


Once you have this text, go into hgcentraltest (hgsql hgcentraltest) and paste in the output of hubPublicCheck at the mysql prompt
=== Post-RR release  ===
==== 1) Rebuild and push hub search files and tables (Depreciated mid-2021) ====


You may wish to review engineer feedback providing [http://genomewiki.ucsc.edu/index.php/Public_Hub_Guidelines Public Hub Guidelines] to refresh areas to examine closely.
<h2><span style="color:red"> These steps are now done automatically over each weekend and can be ignored by QA. They serve as a reference for what once was a time-consuming, manual process.</span></h2>


Note, if the hub submitter is looking to limit IP address access to only the Genome Browser public site, don't forget to include genome-euro. You can find the IP address for a server by using the  <tt>host <server name></tt> command, e.g. <tt>host genome-euro.ucsc.edu</tt>.
'''With new Public Hubs (especially with descriptionUrls), once they are on the beta/RR, be sure to build and push an update of the index files'''. <br>
A. To build these files navigate to hive and run the doPublicCrawl script (you might want to use <code>nohup</code> to let this run in the background). The script took approximately 9 hours on 3/10/21.
<pre>cd /hive/groups/browser/hubCrawl
nohup ./doPublicCrawl 
</pre>


==Stage the hub on Beta==
B. To make life easier for the admins also request a separate email to push a table:
Begin by running hubPublicCheck to generate the insert statement needed to add the line to hgcentralbeta.hubPublic. You will need to get the hub url either from the redmine ticket or from the instance of it on hgwdev i.e.:
<pre>
SUBJECT:Push hub search table to hgwbeta
Please push the following table


hubSearchText


hubPublicCheck -addHub=http://johnlab.org/xpad/Hub/UCSC.txt hubPublic
from the hgcentraltest database on hgwdev to the hgcentralbeta database on hgwbeta


[With commandline utilities that use udcCache, you have to specify the option -udcDir= so that it
After the table has been pushed, please 'flush tables' on hgwbeta.  
uses some place other than the nearly useless default value /tmp/udcCache, since some other user
almost always beats  you to it and you can't write there anymore. So, when running hubCheck, specify either "-udcDir=."  or  "-udcDir=$HOME" or "-udcDir=$HOME/udcCache"]


This outputs the insert statements you will need to insert into hgcentralbeta to change the hubPublic table. The output should look something like:
Reason: Updating the public hub search table on hgwbeta, refs #
</pre>
C. On Beta you should now be able to search parts of the text on the new hub's descriptionUrl, hub's short or long label, assembly, or track labels. For example a search of '''methpipe''' matches a line on the DNA Methylation descriptionUrl: http://smithlabresearch.org/software/methbase/ or '''hg38''' pulls up all the hg38 hubs.


D. To make life easier for the admins also request a separate email to push a table:
<pre>
<pre>
mysql> insert into hubPublic (hubUrl,shortLabel,longLabel,registrationTime,dbCount,dbList) values ("http://zlab.umassmed.edu/zlab/publications/UMassMedZHub/hub.txt","UMassMed ZHub", "UMassMed H3K4me3 ChIP-seq data for Autistic brains",now(),1, "hg19,");
SUBJECT:  Push hubSearchText table to RR
Hello pushers,
 
Please push the following table:
 
hubSearchText
 
in the hgcentralbeta database on hgwbeta to hgcentral on genome-centdb/euro/asia.
After the table has been pushed, please 'flush tables' on genome-centdb/euro/asia.
 
Reason: Updating the hub search table on the RR, refs #
</pre>
</pre>


Once you have this text, go into hgcentralbeta (hgsql -h mysqlbeta hgcentralbeta) and paste in the output of hubPublicCheck at the mysql prompt. This should cause the hub you are testing to appear on beta immediately (i.e. you don't need to do a make).  (Before adding you can use the commands "mysql> select * from hubPublic \G" to show all and again afterward to confirm addition).
==== 2) Notify hub contributor ====


Note, make sure that the order of the assemblies in the dbList field matches the order of assemblies in the hub's genomes.txt file. The first assembly in the dbList field determines the default assembly that will show up when connecting the assembly hub.
Contact the hub contributor and let them know that they can contact our internal mailing list (genome-www) with any questions or concerns.


==Cursory QA on beta==
==== 3) Create a Public Session and Image caption for announcements ====
Once the hub is staged on beta do a minimal round of QA - including:


*Making sure the tracks open.
Create a snapshot image that you'll use on a Twitter/Facebook post and also create a Public Session with a very short description.
*Make sure there aren't too many tracks on by default - the hub should load quickly, if not you might need to ask the contributor to reduce the number of tracks on by default
*Checking that tracks have description pages.
*Review the shortLabels to see if any need to be shortened by displaying all tracks in dense.  The shortLabel text should be under 17 characters, or meaningful information may be cut off from display.
*The length for a longLabel should be about 75 characters.
*Making sure that the authors' email address is prominently listed in the description pages (so our users can contact them with questions).
*Take a moment to review the [http://genomewiki.ucsc.edu/index.php/Public_Hub_Guidelines Public Hub Guidelines] to refresh areas to examine closely.


==Push to the RR==
Example post for a hub on Twitter:
Once you've verified the hub is functioning and looks reasonable you can "push" it to the RR by performing the analogous insert into the hubPublic table on the RR (i.e. in hgcentral).
::Thanks to @PeteHaitch, @LindsayRizzardi, @KasperDHansen and others at Johns Hopkins for the Brain Epigenome public hub, now available!
::http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=chmalee&hgS_otherUserSessionName=hg19_brainEpigenome …


To go into hgcentral on hgwdev type at the prompt:
Example post for a hub Public Session:
hgsql -h genome-centdb
::Description: This session highlights data from the Brain Epigenome Hub, created by Peter Hickey, Lindsay Rizzardi, and Kaspar Hansen and others at Johns Hopkins University. The hub shows methylation, ATAC-seq, and RNAseq across different brain regions.


Then at the mysql prompt:
==== 4) Send genome-announce email ====
use hgcentral


Paste in the same insert statement you input on beta (you don't need to rerun hubPublicCheck again on the RR - just paste the same text in). '''Updating note:''' If you are updating a hub in the future, because perhaps it has changed genomes.txt since you first added it, you may want to run the <code>hubPublicCheck</code> again to get the updated statement and be sure to update at dev,beta, and the RR (we have a [http://genomewiki.cse.ucsc.edu/genecats/index.php/Monitoring_Tasks hubPublicCheck cronjob] that also checks for when remote hubs make changes).  
Here are some previous examples of [https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!searchin/genome-announce/public$20hub announcement emails for public hubs.]   It is an opportunity to share a sentence or two about the lab and data (and maybe thank them for creating the public hub). The news could even be tweeted and added to Facebook too.


Once this is done, update the redmine ticket and notify the hub contributor that their hub is live.
==== 5) Add your hubUrl and email to automatic uptime emailer====
Log into qateam@hgwdev and add your hub to the bottom of the auto mailer cron file:
/cluster/home/qateam/cronScripts/hubPublicMailStatus.tab
This system is an automatic hubUrl checker that emails the contributor's email after 2 days of being down, no more than once every 6 days. It is run out of qateam's hgwdev crontab.


'''With new Public Hubs (especially with descriptionUrls), once they are on the RR, be sure to build and push an update of the index files'''. <br>
== User Requests: Changing the URL in an Existing Public Hub==
1. To build these files navigate to hive and run the doPublicCrawl script.
<pre>
cd /hive/groups/browser/hubCrawl
./doPublicCrawl   
</pre>
2. The result will be an updated udcCache directory in /hive/groups/browser/hubCrawl/udcCache and an updated hubSearchText table in hgcentraltest. Ask the admins to push these two things to beta:
<pre>Please push the following directory on hgwdev:


/data/apache/userdata/hubCrawl/udcCache/
If a hub provider asks to change their Public hub's URL to a new address, you can do so with the following script:
<pre>~/kent/src/utils/qa/hubPublicUrlChanger</pre>
The script requires no arguments and has info requests and asks for approval before making any changes


to the following location on hgnfs1:
'''Note''' that any URL change breaks saved sessions and cart connections to that Public Hub. This is because the hubStatus ID is unique to each URL and needs to be updated in the SQL table. Needs work here to point the old hubStatus ID to the new URL. Edits to hubStatus can have drastic consequences upon the RR, before making any edits be sure to check with senior team members and also create a backup of hubStatus by downloading the table with a SQL select command (if in doubt, do not edit hubStatus).


/export/userdata/hubCrawl/rr/
== QA for UCSC-hosted public hubs ==
/export/userdata/hubCrawl/beta/


and the following locations on asia/euro:
Public hubs created by browser engineers are an alternative to native tracks for specialty data. These types of hubs are QAed and released in a very similar manner to externally hosted hubs with a few minor differences.  The main difference is that there is an additional step to push the hub files from the development system to the public download server. UCSC-hosted public hubs were previously hosted on hgwdev in /gbdb/hubs and then pushed to hgdownload for display on the public site. Now we host these hubs on hgwdev in /usr/local/apache/htdocs-hgdownload/hubs/ although they are still pushed to hgdownload for display on the public site. This shift was done to reduce confusion with other /gbdb pushed that normally go to hgnfs1. In an email thread with cluster-admin, Hiram wrote:
<pre>
Perhaps the confusion arises because we are using the directory /gbdb/hubs/
on hgwdev to construct our symlinks to release these files.


These items have nothing to do with /gbdb/, the hubs are not
under /gbdb/ on hgdownload, they are /hubs/


On hgwdev they are under /gbdb/hubs/ only because /gbdb/ is the location we use to
create symlinks to deliver stuff to the outside world.


We could instead place the symlinks on hgwdev in the directory:
  /usr/local/apache/htdocs-hgdownload/hubs/


Please also push the following table
And thus eliminate any reference to gbdb</pre>


hubSearchText
You should still review the hub on hgwdev as described above. Since we’re the ones hosting and providing these hubs, it’s alright to be a little more strict in regards to our hub guidelines. Once the hub is looking good on hgwdev, you can release it to the RR using the steps described in the next section.


in the hgcentraltest database
=== Releasing UCSC-Hosted big data Public Hubs ===


from hgwdev ---> hgwbeta
This step is rare and only relevant for a big data hubs that UCSC is hosting.  
</pre>
3. You should now be able to search parts of the text on the new hub's descriptionUrl, hub's short or long label, assembly, or track labels.  For example a search of '''methpipe''' matches a line on the DNA Methylation descriptionUrl: http://smithlabresearch.org/software/methbase/ or '''hg38''' pulls up all the hg38 hubs.


==Releasing UCSC-Hosted big data Public Hubs==
This step is rare and only relevant for a big data hubs that UCSC is hosting.
The main reasoning for implementing the change from /gbdb/hubs is to help ensure there isn't confusion with other /gbdb/ pushes (which the admins normally push to hgnfs1 as those files are often used by internal tracks in the RR). These public hubs don't go that route and instead go to hgdownload.
These hubs are made available with a push request from '''/usr/local/apache/htdocs-hgdownload/hubs/''' on hgwdev to '''/mirrordata/hubs''' on hgdownload.  
These hubs are made available with a push request from '''/usr/local/apache/htdocs-hgdownload/hubs/''' on hgwdev to '''/mirrordata/hubs''' on hgdownload.  


Line 146: Line 215:
</pre>
</pre>


It may be useful to note that the UCSC GTEx data has restrictions on access. Apache for hgdownload only allows RR to access the hub data files so the hub displays on the RR only (the hub won't load on other sites, and files can not be directly downloaded -other external Public Hubs have taken similar steps to control their data).
It may be useful to note that the UCSC GTEx data have restrictions on access. Apache for hgdownload only allows RR to access the hub data files so the hub displays on the RR only (the hub won't load on other sites, and files can not be directly downloaded -other external Public Hubs have taken similar steps to control their data).


==Send genome-announce email==
== What to do if a Public Hub is down? ==


Here are some previous examples of [https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!searchin/genome-announce/public$20hub announcement emails for public hubs.]   It is an opportunity to share a sentence or two about the lab and data (and maybe thank them for creating the public hub). The news could even be tweeted and added to Facebook too.
If you notice that a hub is consistently down for an extended period of time (3-5 days), then you should contact the hub contributor to let them know that their hub is having issues. We keep a page of the contact information for all of our public hubs here: http://genecats.soe.ucsc.edu/qa/test-results/publicHubContactInfo/publicHubContact.html. We also have a [[Monitoring_Tasks|cronjob]] that checks the status of all of our public hubs, so be sure to check in with the person receiving those emails before sending your message.
 
We now have a continuously updating log of down public hubs. It's on the qateam dev login. Check it out here:
[qateam@hgwdev /cluster/home/qateam] vi /hive/users/qateam/perf/hubPublicCheckCron.log
=== Removing public hub from the RR: ===
 
Use hgsql and hubUrl to delete the line from the hubPublic table on the RR:
<pre>
hgsql -h genome-centdb -Ee "delete from hubPublic where hubUrl='...hub.txt' limit 1" hgcentral
</pre>
 
= Hub Public Coordinator Role=
 
One QAer historically has been assigned the role of '''Hub Public Coordinator'''. This role is to maintain an '''updated''' and '''error-free''' listing of Public Hubs by informing hub providers when their hub hosting sites go down. This role has changed a lot from 2018 to 2022 during Dan's tenure. It is now almost completely automated and on the qateam cron at "~/genecats/qa/crontabs/hgwdev.crontab". I will describe the two-part automation below.


==Notes for Public Assembly Hub QA==
1. First is keeping hubs updated: labels/titles and the search index. These used to be manual tasks and are now fully automated.
* The hubPublic labels/titles are updated by the script "/cluster/bin/scripts/hubPublicAutoUpdate", which runs on each of our 3 sites every weekday morning from the QAteam crontab.
* The hubSearchText search index is updated on Dev by the script "/cluster/bin/scripts/hubSearchUpdate" every Tue/Sat and is auto-pushed by the Pushers every Sunday.


Assembly hubs are new feature released in early 2013. Refer to the [http://genomewiki.ucsc.edu/index.php/Assembly_Hubs Assembly Hubs] page on the public wiki for more info on how to create your own.
2. Second is keeping the hubs error-free. If a hub.txt website link is no longer accessible, you'll see an error in the Public Hub listing in red. Historically, these errors were emailed to a certain person and they'd wait about a week before contacting the hub provider.  


The QA for a public Assembly Hub is very similar to that of a public Track Hub, although there are a few things unique to an Assembly Hub.
Now, this is automated and hubs are checked every 2 hours and if they get errors for 24 consecutive checks (48hrs) then they get an automatic email. Here is the script that runs that auto-email: "/cluster/bin/x86_64/hubPublicMail". This is also run out of the QA Team crontab. This program will spam the hubContact email every few days until they fix it. If a hub provider doesn't fix their error-prone hub after perhaps 1 month, the hub should be removed from the hubPublic listing.


* Check that the 2bit files exist and that the genomes.txt files points to the correct 2bit
For reference, it is worth noting that we have not historically heckled hub providers over broken tracks (which there are many), only if their overall hub is down. Also, the hub contact list is here:
* Make sure the correct labels show up in drop down menus on the gateway page
https://genecats.gi.ucsc.edu/qa/test-results/publicHubContactInfo/publicHubContact.html
* Make sure contact information is clearly displayed on all description pages (gateway description, and the track description pages)
* Strongly suggest the hub add these settings in each genome's entry in genomes.txt (You can explain to them that the last 3 settings will make it easier to find each assembly's hub species in hgGateway by UI search) :
** defaultPos, scientificName, organism, description


* '''Assembly Hub QA Warning:'''  Be sure to check ''all your assemblies on the RR before you release''.  We have a lot of unreleased assemblies on hgwdev, a hub developer once developed a hub with preview assemblies, which worked fine on hgwdev, but on the RR such assemblies failed as the data wasn't there.  
As the person in charge, you should be a MAILTO on each of these 3 cronjobs. Best of luck!


Then run through the basic public Track Hub QA talked about in the above sections.


[[Category:Browser QA tracks]]
[[Category:Browser QA tracks]]
[[Category:Browser QA]]
[[Category:Browser QA]]

Latest revision as of 20:04, 28 March 2023

Overview

Public hubs are track or assembly hubs contributed by the worldwide research community. Public Hubs are wrangled and QA'd by the UCSC GB QA team. The QAer should work directly with the data submitter from the start to get the track hub in the correct format, etc. Then the QA person QA's the hub (a light QA compared with native tracks) and releases it. QA communicates directly with the data contributor throughout the process. If QAer needs technical help, please contact the Project Manager or an engineer.

Public hubs are made visible by a line in the hubPublic table that the QA-er will add to the various hgcentral* databases. For example:

+----------------------------------------------+-------------------+---------------------------------------+---------------------+---------+--------+--------------------------------+
| hubUrl                                       | shortLabel        | longLabel                             | registrationTime    | dbCount | dbList | descriptionUrl                 |
+----------------------------------------------+-------------------+---------------------------------------+---------------------+---------+--------+--------------------------------+
| http://lisanwanglab.org/DASHR/tracks/hub.txt | DASHR small ncRNA | DASHR Human non-coding RNA annotation | 2015-12-20 11:30:47 |       1 | hg19,  | http://lisanwanglab.org/DASHR/ |
+----------------------------------------------+-------------------+---------------------------------------+---------------------+---------+--------+--------------------------------+

Automation Script

After you do manual QA of the hub tracks, description pages, and the hub passes the requirements, you can run all the table insertion steps below with the following script. Hit "N" for the first question and read carefully. Note that this script can insert a new hub OR update an old one. Follow the prompts.

~/genecats/qa/testTools/hubPublicScripts/updateHubPublic

Contact Daniel Schmelter or edit the script yourself for bugs/improvements.

Public Hub QA Process

The QA process for public hubs is simple: (1) Check that the hub meets our required guidelines, (2) if so, add it to the public list; if not, work with the contributor to ensure it meets those guidelines. Each section below contains more details as to what steps of the QA process you carry out on each of our sites/machines. If applicable, recommend that the hub creator review the metadata guide to add any extra information for their experiments.

On hgwdev

This is where most of the Public Hub QA process will happen.

1) Add Hub to the hubPublic table on hgwdev

First, run hubPublicCheck on your hub:

 hubPublicCheck hubPublic -udcDir=. -addHub=http://lisanwanglab.org/DASHR/tracks/hub.txt 

The option -udcDir= is need to prevent people from using the default udcCache directory -- we all need to use separate -udcDir in order to avoid stepping on each others' toes.

This command will generate the MySQL insert statement needed to add this hub to the hubPublic table. Here is an example command output by hubPublicCheck:

insert into hubPublic (hubUrl,shortLabel,longLabel,registrationTime,dbCount,dbList) values ("http://zlab.umassmed.edu/zlab/publications/UMassMedZHub/hub.txt","UMassMed ZHub", "UMassMed H3K4me3 ChIP-seq data for Autistic brains",now(),1, "hg19,");

Then, use hgsql to insert the line for this hub into the hubPublic table on dev (or use Daniel's above script to update dev/beta/RR):

hgsql -e '<your hubPublicCheck command here>' hgcentraltest

2) Display the hub in the Genome Browser and do some minimal QA

After adding the new hub to the hubPublic table, connect the hub and view in the Genome Browser. The primary QA you will do is ensuring that the hub meets our required guidelines, such as checking for track descriptions with contact information.

Based on the data, you should explore whether a data set should overlap or avoid certain regions such as coding exons (or maybe it would be promoters, or areas of open chromatin, or common SNPs, or highly conserved regions, or 5’ UTRs, or 3’ UTRs, or mitochondria, or the sex chromosomes, depending on the type of data, you can ask the hub provider for input to understand expectations of the data if it isn't clear). Using an idea of what the data is describing, try to identify at least one native track in the Genome Browser that one might expect to correlate or anti-correlate with the tracks in the hub dataset and visually spot check a few.

You should also look for recommended guidelines that the hub violates. For these recommended guidelines, note those that if fixed would greatly improve the usability of the hub for our users. For example, if a hub contains 300 tracks, but they aren’t organized into composites or superTracks, you should recommend that they group their tracks in a reasonable manner.

Notes for Assembly Hubs

In addition to the normal public track hub QA, there are a few things you should pay attention to:

The QA for a public Assembly Hub is very similar to that of a public Track Hub, although there are a few things unique to an Assembly Hub.

  • Check that the 2bit files exist and that the genomes.txt file points to the correct 2bit
  • Make sure the correct labels show up in drop down menus on the gateway page
  • Make sure contact information is clearly displayed on all description pages (gateway description and track description pages)
  • Warning: Be sure to check all your assemblies on the RR before you release. We have a lot of unreleased assemblies on hgwdev, a hub developer once developed a hub with preview assemblies, which worked fine on hgwdev, but on the RR such assemblies failed as the data wasn't there.
    • Also note that some hub developers might try to sneak in an assembly hub without understanding them correctly, see #20761 where genome hub_10649_araTha1 was used to try to reference another assembly hub (someday we might support this).

3) Pass feedback to the hub contributor

If during the previous step, you encountered issues with the hub or noticed that it violates some of our required guidelines, pass these on to the hub contributor in an email. When contacting the hub contributor, lay out your feedback in a clear and concise manner, such as through a numbered list. Often it’s helpful to not just point out the issues but to provide a solution as well, especially if it’s the misuse of a trackDb tag.

For our “recommended” guidelines, you should only pass on feedback for those items that would greatly increase the usability of the hub. For example, organizing hundreds of loose tracks into a superTrack or composite. When passing along these you should note that the contributor isn’t required to change these things, but that it would greatly increase the usefulness of their hub.

On hgwbeta

1) Add the hub to the hubPublic table on hgwbeta

Same as the step for adding this hub to hubPublic on hgwdev. Note the quotation marks (' ' vs " "), you will want to use single quotes(' ') because the output of hubPublicCheck is encapsulated in double quotes (" ")

hgsql -h hgwbeta -e '<insert hubPublicCheck output>' hgcentralbeta

2) Cursory QA on beta

Be sure that the hub and all its tracks load properly on beta. The best way to check this is by clicking the “hide all” button on hgTracks and then navigating to the “Configure” page. On the Configure page, click the “show all” button on the track group for your track hub and then click “submit”. Check that all of the tracks load and that you don’t see any yellow error messages indicating that there were issues loading certain tracks.

Release to the RR

Use the same insert statement that you used to add this hub to hubPublic on hgwbeta to add this hub to the hubPublic table on the RR.

hgsql -h genome-centdb -e '<insert hubPublicCheck output>' hgcentral

Note: If your hub has restricted data (data only loading on the IPs of certain machines) be sure the Public Hub Provider is given all the IPs of our mirrors:

128.114.119.* = genome.ucsc.edu
129.70.40.99 = european mirror, genome-euro.ucsc.edu
134.160.84.67 = asian mirror, genome-asia.ucsc.edu
 128.114.198.32 = genome-test.gi.ucsc.edu, used by developers and for debugging

Post-RR release

1) Rebuild and push hub search files and tables (Depreciated mid-2021)

These steps are now done automatically over each weekend and can be ignored by QA. They serve as a reference for what once was a time-consuming, manual process.

With new Public Hubs (especially with descriptionUrls), once they are on the beta/RR, be sure to build and push an update of the index files.
A. To build these files navigate to hive and run the doPublicCrawl script (you might want to use nohup to let this run in the background). The script took approximately 9 hours on 3/10/21.

cd /hive/groups/browser/hubCrawl
nohup ./doPublicCrawl  

B. To make life easier for the admins also request a separate email to push a table:

SUBJECT:Push hub search table to hgwbeta
Please push the following table 

hubSearchText

from the hgcentraltest database on hgwdev to the hgcentralbeta database on hgwbeta

After the table has been pushed, please 'flush tables' on hgwbeta. 

Reason: Updating the public hub search table on hgwbeta, refs #

C. On Beta you should now be able to search parts of the text on the new hub's descriptionUrl, hub's short or long label, assembly, or track labels. For example a search of methpipe matches a line on the DNA Methylation descriptionUrl: http://smithlabresearch.org/software/methbase/ or hg38 pulls up all the hg38 hubs.

D. To make life easier for the admins also request a separate email to push a table:

SUBJECT:  Push hubSearchText table to RR
Hello pushers,

Please push the following table:

hubSearchText

in the hgcentralbeta database on hgwbeta to hgcentral on genome-centdb/euro/asia.
After the table has been pushed, please 'flush tables' on genome-centdb/euro/asia.

Reason: Updating the hub search table on the RR, refs #

2) Notify hub contributor

Contact the hub contributor and let them know that they can contact our internal mailing list (genome-www) with any questions or concerns.

3) Create a Public Session and Image caption for announcements

Create a snapshot image that you'll use on a Twitter/Facebook post and also create a Public Session with a very short description.

Example post for a hub on Twitter:

Thanks to @PeteHaitch, @LindsayRizzardi, @KasperDHansen and others at Johns Hopkins for the Brain Epigenome public hub, now available!
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=chmalee&hgS_otherUserSessionName=hg19_brainEpigenome

Example post for a hub Public Session:

Description: This session highlights data from the Brain Epigenome Hub, created by Peter Hickey, Lindsay Rizzardi, and Kaspar Hansen and others at Johns Hopkins University. The hub shows methylation, ATAC-seq, and RNAseq across different brain regions.

4) Send genome-announce email

Here are some previous examples of announcement emails for public hubs. It is an opportunity to share a sentence or two about the lab and data (and maybe thank them for creating the public hub). The news could even be tweeted and added to Facebook too.

5) Add your hubUrl and email to automatic uptime emailer

Log into qateam@hgwdev and add your hub to the bottom of the auto mailer cron file:

/cluster/home/qateam/cronScripts/hubPublicMailStatus.tab

This system is an automatic hubUrl checker that emails the contributor's email after 2 days of being down, no more than once every 6 days. It is run out of qateam's hgwdev crontab.

User Requests: Changing the URL in an Existing Public Hub

If a hub provider asks to change their Public hub's URL to a new address, you can do so with the following script:

~/kent/src/utils/qa/hubPublicUrlChanger

The script requires no arguments and has info requests and asks for approval before making any changes

Note that any URL change breaks saved sessions and cart connections to that Public Hub. This is because the hubStatus ID is unique to each URL and needs to be updated in the SQL table. Needs work here to point the old hubStatus ID to the new URL. Edits to hubStatus can have drastic consequences upon the RR, before making any edits be sure to check with senior team members and also create a backup of hubStatus by downloading the table with a SQL select command (if in doubt, do not edit hubStatus).

QA for UCSC-hosted public hubs

Public hubs created by browser engineers are an alternative to native tracks for specialty data. These types of hubs are QAed and released in a very similar manner to externally hosted hubs with a few minor differences. The main difference is that there is an additional step to push the hub files from the development system to the public download server. UCSC-hosted public hubs were previously hosted on hgwdev in /gbdb/hubs and then pushed to hgdownload for display on the public site. Now we host these hubs on hgwdev in /usr/local/apache/htdocs-hgdownload/hubs/ although they are still pushed to hgdownload for display on the public site. This shift was done to reduce confusion with other /gbdb pushed that normally go to hgnfs1. In an email thread with cluster-admin, Hiram wrote:

Perhaps the confusion arises because we are using the directory /gbdb/hubs/
on hgwdev to construct our symlinks to release these files.

These items have nothing to do with /gbdb/, the hubs are not
under /gbdb/ on hgdownload, they are /hubs/ 

On hgwdev they are under /gbdb/hubs/ only because /gbdb/ is the location we use to
create symlinks to deliver stuff to the outside world. 

We could instead place the symlinks on hgwdev in the directory:
  /usr/local/apache/htdocs-hgdownload/hubs/

And thus eliminate any reference to gbdb

You should still review the hub on hgwdev as described above. Since we’re the ones hosting and providing these hubs, it’s alright to be a little more strict in regards to our hub guidelines. Once the hub is looking good on hgwdev, you can release it to the RR using the steps described in the next section.

Releasing UCSC-Hosted big data Public Hubs

This step is rare and only relevant for a big data hubs that UCSC is hosting.

These hubs are made available with a push request from /usr/local/apache/htdocs-hgdownload/hubs/ on hgwdev to /mirrordata/hubs on hgdownload.

Here's an example push request:

Please push the following file:

/usr/local/apache/htdocs-hgdownload/hubs/newHub/*

from hgwdev --> hgdownload/hgdownload-sd
    (in path, "/usr/local/apache/htdocs-hgdownload/" should become "/mirrordata/" on hgdownload)

Note that items that are symlinked on hgwdev should become real files on hgdownload. 

Reason:  Releasing new UCSC hosted hub newHub to hgdownload.

Thanks!

It may be useful to note that the UCSC GTEx data have restrictions on access. Apache for hgdownload only allows RR to access the hub data files so the hub displays on the RR only (the hub won't load on other sites, and files can not be directly downloaded -other external Public Hubs have taken similar steps to control their data).

What to do if a Public Hub is down?

If you notice that a hub is consistently down for an extended period of time (3-5 days), then you should contact the hub contributor to let them know that their hub is having issues. We keep a page of the contact information for all of our public hubs here: http://genecats.soe.ucsc.edu/qa/test-results/publicHubContactInfo/publicHubContact.html. We also have a cronjob that checks the status of all of our public hubs, so be sure to check in with the person receiving those emails before sending your message.

We now have a continuously updating log of down public hubs. It's on the qateam dev login. Check it out here:

[qateam@hgwdev /cluster/home/qateam] vi /hive/users/qateam/perf/hubPublicCheckCron.log

Removing public hub from the RR:

Use hgsql and hubUrl to delete the line from the hubPublic table on the RR:

hgsql -h genome-centdb -Ee "delete from hubPublic where hubUrl='...hub.txt' limit 1" hgcentral

Hub Public Coordinator Role

One QAer historically has been assigned the role of Hub Public Coordinator. This role is to maintain an updated and error-free listing of Public Hubs by informing hub providers when their hub hosting sites go down. This role has changed a lot from 2018 to 2022 during Dan's tenure. It is now almost completely automated and on the qateam cron at "~/genecats/qa/crontabs/hgwdev.crontab". I will describe the two-part automation below.

1. First is keeping hubs updated: labels/titles and the search index. These used to be manual tasks and are now fully automated.

  • The hubPublic labels/titles are updated by the script "/cluster/bin/scripts/hubPublicAutoUpdate", which runs on each of our 3 sites every weekday morning from the QAteam crontab.
  • The hubSearchText search index is updated on Dev by the script "/cluster/bin/scripts/hubSearchUpdate" every Tue/Sat and is auto-pushed by the Pushers every Sunday.

2. Second is keeping the hubs error-free. If a hub.txt website link is no longer accessible, you'll see an error in the Public Hub listing in red. Historically, these errors were emailed to a certain person and they'd wait about a week before contacting the hub provider.

Now, this is automated and hubs are checked every 2 hours and if they get errors for 24 consecutive checks (48hrs) then they get an automatic email. Here is the script that runs that auto-email: "/cluster/bin/x86_64/hubPublicMail". This is also run out of the QA Team crontab. This program will spam the hubContact email every few days until they fix it. If a hub provider doesn't fix their error-prone hub after perhaps 1 month, the hub should be removed from the hubPublic listing.

For reference, it is worth noting that we have not historically heckled hub providers over broken tracks (which there are many), only if their overall hub is down. Also, the hub contact list is here:

https://genecats.gi.ucsc.edu/qa/test-results/publicHubContactInfo/publicHubContact.html

As the person in charge, you should be a MAILTO on each of these 3 cronjobs. Best of luck!