QAing Gencode Genes

From Genecats
Jump to navigationJump to search


This page may be out of date and may need updating.




Background

We swapped our UCSC Genes for GENCODE Genes, this page was added by Luvina in 2015 to capture some of those changes.

Redmine tips

For such a large track, it is extremely helpful to create new Redmine tickets for any problems found and relate it to the main ticket. It is also useful (especially as the release of the track nears) to have a "next gene build" ticket to keep track of problems that aren't going to be dealt with until the next build.


Determining the tables involved

The complete list of tables (for the assembly, proteome and uniProt databases) *should* be in the push queue. However, you should look for tables that shouldn't be pushed (e.g., history, chromInfo, tableDescriptions), and tables that are missing. Compare the current list to the previous list in Gencode Genes tables. Keep that list up to date so that it can be used in future releases.

joinerCheck

Makedoc

The script used to generate Gencode Genes can be found in the source tree in kent/src/hg/makeDb/doc/ucscGenes/; there won't be an entry for it in the assembly.txt makedoc.

QAing the uniProt and proteome databases

See UCSC_Genes_Staging_Process#Details_About_UniProt_and_Proteome_Databases for background information on these databases.

  • Find out which versions of spYYMMDD and proteinsYYMMDD were used to build the gene set. Make sure the symlinks are going to the correct place on hgwdev.
  • If a new download of proteins from uniProt was done for this build, you will need to QA the tables in the spYYMMDD and proteinsYYMMDD databases. Compare the tables present and the row counts for each to the previous databases. Note that the history and tableDescriptions tables *should* be included in the push of these databases.
  • Check the README files in 4 places:
/usr/local/apache/htdocs-hgdownload/goldenPath/proteinDB/
/usr/local/apache/htdocs-hgdownload/goldenPath/proteinDB/proteinsYYMMDD/database/
/usr/local/apache/htdocs-hgdownload/goldenPath/uniProt/
/usr/local/apache/htdocs-hgdownload/goldenPath/uniProt/spYYMMDD/database/
  • Make sure that the UniProt license info is current. Include the new assembly in the goldenPath/proteinDB/proteinsYYMMDD/database/README.txt (and edit it out of the previous file, if applicable). After the push to the RR, you will need to update the "Protein database for <assembly>" link in downloads.html.

QAing the Gene Set

Look at the numbers of genes in the old-to-new table like so:

mysql> select count(*), status from kg8ToKg9 group by status;
+----------+------------+
| count(*) | status     |
+----------+------------+
|    43681 | compatible |
|     9459 | exact      |
|    22088 | none       |
|    28950 | overlap    |
+----------+------------+
4 rows in set (0.10 sec)

Compare the counts to the previous old-to-new table. If the numbers are very different, there may be a problem in the build. Get the developer's opinion if you're not sure what is expected.

Instructions from Jim: Look at about 5 genes for each of these major cases, and make sure the changes make sense:

  • genes that were dropped (status "none")
  • genes that have compatible extensions, meaning that the new gene is the same as the old gene, except at the ends (status "compatible")
  • genes that overlap with old genes, meaning that there is a chunk in the middle somewhere that has changed (status "overlap")
  • genes that are completely new in this build

New genes with status "compatible" or "overlap" will have a note about the way in which they have changed. Most genes have more than one note. You can see the different kinds of notes with mysql queries like this one:

mysql> select oldId, newId, note from kg8ToKg9 where status="overlap" order by rand() limit 5;
+------------+------------+-----------------------------------------------------------------------------------------------+
| oldId      | newId      | note                                                                                          |
+------------+------------+-----------------------------------------------------------------------------------------------+
| uc010qty.2 | uc001lgs.4 | differentNumberOfBlocks,basesAddedToCoding,intronsChanged,basesAddedToUTR                     |
| uc010lxg.1 | uc064mjq.1 | differentNumberOfBlocks,intronsChanged,basesRemovedFromUTR,basesAddedToUTR                    |
| uc010ogx.2 | uc001bug.4 | basesAddedToCoding,intronsChanged,basesAddedToUTR                                             |
| uc001oya.3 | uc001oxx.4 | differentNumberOfBlocks,basesAddedToCoding,intronsChanged,basesAddedToUTR                     |
| uc002rcb.1 | uc002rcc.3 | differentNumberOfBlocks,basesAddedToCoding,intronsChanged,basesRemovedFromUTR,basesAddedToUTR |
+------------+------------+-----------------------------------------------------------------------------------------------+
5 rows in set (0.14 sec)

Finding the genes that are completely new to this build will take a couple of steps. You will want to compare the newIds of the kg#ToKg# table with the names of the knownGene table. Use commands similar to the following:


QAing the many related tables

Compare row counts of each table in the old vs. new gene set and look for anomalously high or low counts that may indicate a problem. Here is an example loop to run on hgwdev to get counts for all tables on both machines (pushQlist is the list of tables to check):

$ for table in `cat pushQlist`
do
  echo hgwdev:
  hgsql -e "select count(*) as $table from $table" hg38 
  echo hgwbeta:
  hgsql -h hgwbeta -e "select count(*) as $table from $table" hg38 
  echo
done