GENCODEqa

From Genecats
Jump to navigationJump to search

QA process for both GENCODE versions (hg19, mm10, mm39) and GENCODE knownGene (hg38, mm39)

This track is 'semi-otto'. That is to say, the original data comes standardized from GENCODE and then it is run through an almost entirely automatic script in order to create the Genome Browser tracks. For this reason, the QA will be expedited and reserved to only the following steps. All GENCODE tracks should be QA'd together.

  1. Make necessary trackDb edits - for GENCODE versions that means hiding previous track. Add the new/updated pennantIcon pointing to what will be the new news archive post
  2. Perform a sanity check on dev, this should be opening the track at a single loci against a similar gene model track to ensure nothing obvious is wrong. This should not be more than 5 minutes. For knownGene tracks: Click on an item and test out all of the hgGene linkouts, report any that are not working. Also, only for For knownGene tracks, you will need to update the desc pages statistics and mentions of any old assembly numbers, etc.
  3. Push tables/data to hgwbeta. See below for updating the BLAT servers and the targetDb table, then make sure to test that you see/can use the knownGene annotation as a target in is-pcr: https://hgwbeta.soe.ucsc.edu/cgi-bin/hgTracks?db=hg38
  4. Sanity check to make sure that all the correct tables/files were pushed and the track displays. Should be quick like the previous one. No need to re-check links on knownGene
  5. Push tables/data to RR
  6. Final sanity check on the RR, quick as all the others
    1. For knownGene you will also want to update the BLAT servers, the targetDb table, and afterwards push the blastTab tables
    2. For target Db you can see here (http://genomewiki.ucsc.edu/genecats/index.php?title=UCSC_Genes_Staging_Process) essentially the process should be the following for each updating database, for example hg38 below:
# Sanity check first always, so select statement
hgsql -e "select * from targetDb where name like 'hg38Kg%'" hgcentraltest
hgsql -Ne "select * from targetDb where name like 'hg38Kg%'" hgcentraltest > targetDb.dev
hgsql -h hgwbeta -e "select * from targetDb where name like 'hg38Kg%'" hgcentralbeta
hgsql -h hgwbeta -e "delete from targetDb where name='hg38KgSeqV43'" hgcentralbeta
hgsql -h hgwbeta -e "load data local infile 'targetDb.dev' into table targetDb" hgcentralbeta
hgsql -h genome-centdb -e "select * from targetDb where name like 'hg38Kg%'" hgcentral
hgsql -h genome-centdb -e "delete from targetDb where name='hg38KgSeqV43'" hgcentral
hgsql -h genome-centdb -e "load data local infile 'targetDb.dev' into table targetDb" hgcentral

Also, depending on when pushes happened, the time in targetDb needs to be newer than the date on the file it references. Running the following may then be needed (using the example of hg38KgSeqV43):

hgsql -h hgwbeta -e "update targetDb set time=now() where name='hg38KgSeqV43'" hgcentralbeta
hgsql -h genome-centdb -e "update targetDb set time=now() where name='hg38KgSeqV43'" hgcentral

Then for the BLAT update you will want to see which host/port is being used now, e.g.

$ hgsql -e "select * from blatServers where db like '%hg38Kg%'" hgcentraltest
+--------------+---------------------+-------+---------+--------+---------+
| db           | host                | port  | isTrans | canPcr | dynamic |
+--------------+---------------------+-------+---------+--------+---------+
| hg38KgSeqV43 | blat1a.soe.ucsc.edu | 17915 |       0 |      1 |       0 |
+--------------+---------------------+-------+---------+--------+---------+
$ hgsql -h genome-centdb -e "select * from blatServers where db like '%hg38Kg%'" hgcentral
+--------------+--------+-------+---------+--------+---------+
| db           | host   | port  | isTrans | canPcr | dynamic |
+--------------+--------+-------+---------+--------+---------+
| hg38KgSeqV41 | blat1a | 17909 |       0 |      1 |       0 |
+--------------+--------+-------+---------+--------+---------+

Note: We drop the '.soe.ucsc.edu' part, so when updating beta/RR we would only keep blat1a from that.

In this case blat1a is the same as the hgwbeta and RR host, so no need to update that. Only the port number and the name of the db. In other cases, you will also need to update the host. You can drop the line and reload the line like we did for targetDb above, or you can edit the two entries in place like so:

hgsql -h hgwbeta -e "update blatServers set port='17915' where db like '%hg38Kg%'" hgcentralbeta
hgsql -h hgwbeta -e "update blatServers set db='hg38KgSeqV43' where db like '%hg38KgSeqV41%'" hgcentralbeta
hgsql -h hgwbeta -e "select * from blatServers where db like '%hg38Kg%'" hgcentralbeta
hgsql -h genome-centdb -e "update blatServers set port='17915' where db like '%hg38Kg%'" hgcentral
hgsql -h genome-centdb -e "update blatServers set db='hg38KgSeqV43' where db like '%hg38KgSeqV41%'" hgcentral
hgsql -h genome-centdb -e "select * from blatServers where db like '%hg38Kg%'" hgcentral

Make sure to test that you see/can use the knownGene annotation as a target in is-pcr: https://genome.ucsc.edu/cgi-bin/hgPcr?db=hg38

Remove the include knownGene.alpha.ra line on trackDb.ra, delete the knownGene.alpha.ra file, and update the knownGene.ra release tag

Update knownGene.ra:

shortLabel GENCODE V__
longLabel GENCODE V__
...
bigDataUrl /gbdb/hg38/gencode/gencodeV__.bb
...
#externalDb knownGeneV__
html knownGeneV__
...
searchTrix /gbdb/hg38/knownGeneFastV__.ix


Lastly, draft the news archive for all of the tracks together and announce them. Let the engineer know in RM for knownGene that we are ready for the blastTab tables.

If any errors are encountered while performing these steps, bring it up to the track developer and have them include checks for the error in future builds of the track.