QAing UCSC Genes

From Genecats
Revision as of 00:37, 24 August 2011 by Mary (talk | contribs) (→‎Actual steps: after troubleshooting the script with brooke, redid the section on HGGeneCheck. Do not use "nice" or else it will take 24+ hours to run. git commit ID: aac31ef75585)
Jump to navigationJump to search

wiki pages about known genes

http://genomewiki.cse.ucsc.edu/genecats/index.php/UCSC_Genes_Staging_Process
http://genomewiki.cse.ucsc.edu/genecats/index.php/QAing_UCSC_Genes
http://genomewiki.cse.ucsc.edu/genecats/index.php/Post-Release-Checklist


Determining the tables involved

Background

The first thing you need to do is determine the complete list of tables that should be associated with your release. If this is an update you can look at the previous release for tables that went out with the previous release of UCSC genes. There are basically 3 places where you will find UCSC genes related tables and you basically need to compare the tables present in each of these 3 locations and make a list of those that are common and those that are missing from any or all of these locations:

  • the pushQ:
    • tables missing from the pushQ - if these tables exist they need to be added to the queue, or if not they need to be made
    • extra tables in the pushq - sometimes there are tables that no longer go out with UCSC genes and they get added automatically to the pushQ entry. Do not assume that because a table is in the pushQ entry that it is meant to go out.
  • kgTables,gsTables, and pbTables: these are mega lists of all tables that have ever been associated with UCSC genes, the gene sorter, and the proteome browser respectively. The git controlled copy of these lives here: kent/src/utils/qa/kgTables[or gsTables/pbTables] (and this gets copied to the analogous location here: /cluster/bin/scripts/*Tables).
    • tables missing from kgTables - if there are tables associated with your release that are missing from this file they should be added to the list
    • tables in kgTables but not in the current release - this list contains all tables ever released with UCSC genes and as such will contain many tables that do not apply to you release
  • the tables actually present in the database:
    • tables missing: if you determine that there are missing tables you will have to contact the developer and ask them to make these tables
    • tables in hgsql but not in pushQ or kgTables: You may find tables that look like they should go out with UCSC genes but are not listed in the pushQ or in kgTables. If this is an update check and see if these tables went out with the last release and in either case check with the developer if these should be pushed.

Actual steps

  1. Run HGGeneCheck

This script basically runs (statement from HGGeneCheck.java):

For all rows in knownGene, view details page.
*  Loops over all assemblies.
*  For all pages viewed, check for non-200 return code.
*  Doesn't click into any links.
*  Doesn't check for HGERROR.

Note: to run it just type HGGeneCheck (if you type HGGeneCheck it won't run). This program defaults to running on hg17 but if you supply a props file you can specify a database for it to run on.

Sample usage statement: nohup HGGeneCheck props > HGrobot.out

Sample props file (note that unlike TrackCheck if you include a zoomcount line HGGeneCheck will not run):

machine hgwdev.cse.ucsc.edu
quick false
dbSpec rn4

Note that this script is basically just checking the hgGene details page. Thus, you can change the machine name to another machine than hgwdev (e.g. hgwdev-demo5) if there are CGI changes that have not yet been checked into the master branch but are still in development.

  1. Run knownGene.csh in the same dir as you ran HGGeneCheck but with a different props file:
server hgwdev.cse.ucsc.edu
machine hgwdev.cse.ucsc.edu
quick false
dbSpec mm9
table all
  1. Run joinerCheck on the list of tables from 1.

tip: There will probably be errors with a table that is linked to many other tables in your list. This will cause the joinerCheck output to have the exact same error repeats for each of the associated tables and therefore cause the output to be highly repetitive. An easy way to get a unique, relevant list of the errors is to grep for "error" and then sort and uniq that output:

grep -i error | sort | uniq > youroutputfile

Other random notes

  • Script used to generate UCSC genes can be found here:

kent/src/hg/makeDb/doc/ucscGenes

hgGene Page Source Information

Click on the following link to view a sample hgGene page annotated with the sources of the different components: File:Hg19uc002ypa.2.pdf

Gene Sorter Column Sources

Name

Description

Source

#

Item Number in Displayed List/Select Gene

n/a

Name

Gene Name/Select Gene

kgXref.geneSymbol

UCSC ID

UCSC Transcript ID

knownGene.name

UniProtKB

UniProtKB Protein Display ID

kgXref.spDisplayID or kgXref.spID_organism

UniProtKB Acc

UniProtKB Protein Accession

kgXref.spID

RefSeq

NCBI RefSeq Gene Accession

kgXref.refseq

Entrez Gene

NCBI Entrez Gene/LocusLink ID

knownToLocusLink

GenBank

GenBank mRNA Accession

kgXref.refseq or kgXref.mRNA

Ensembl

Ensembl Transcript ID

knownToEnsembl

GNF Atlas 2 ID

ID of Associated GNF Atlas 2 Expression Data

knownToGnfAtlas2

Gene Category

High Level Gene Category - Coding, Antisense, etc.

kgTxInfo.category

CDS Score

Coding potential score from txCdsPredict

kgTxInfo.cdsScore

VisiGene

UCSC VisiGene In Situ Image Browser

knownToVisiGene

Allen Brain

Allen Brain Atlas In Situ Images of Adult Mouse Brains

knownToAllenBrain & allenBrainUrl

U133 ID

ID of Associated Affymetrix U133 Expression Data

knownToU133

GNF Atlas 2

GNF Expression Atlas 2 Data from U133A and GNF1H Chips

gnfAtlas2

Max GNF Atlas 2

Maximum Expression Value of GNF Expression Atlas 2

calculated?

GNF Atlas 2 Delta

Normalized Difference in GNF Expression Atlas 2 from Selected Gene

gnfAtlas2Distance

BLASTP
Bits

NCBI BLASTP Bit Score

knownBlastTab.bitScore

BLASTP
E-Value

NCBI BLASTP E-Value

knownBlastTab.evalue

%ID

NCBI BLASTP Percent Identity

knownBlastTab.identity

5' UTR Fold

5' UTR Fold Energy (Estimated kcal/mol)

foldUtr5.energy

3' UTR Fold

3' UTR Fold Energy (Estimated kcal/mol)

foldUtr3.energy

Exon Count

Number of Exons (Including Non-Coding)

knownGene.exonCount

Intron Size

Size of biggest (or optionally smallest) intron

knownGene exonStarts - exonEnds

Genome Position

Genome Position/Link to Genome Browser

(knownGene.txStart + txEnd)/2

Mouse

Mouse Ortholog (Best Blastp Hit to UCSC Known Genes)

mmBlastTab

Rat

Rat Ortholog (Best Blastp Hit to UCSC Known Genes)

rnBlastTab

Zebrafish

Danio rerio Ortholog (Best Blastp Hit to Ensembl)

drBlastTab

Drosophila

D. melanogaster Ortholog (Best Blastp Hit to FlyBase Proteins)

dmBlastTab

C. elegans

C. elegans Ortholog (Best Blastp Hit to WormPep)

ceBlastTab

Yeast

Saccharomyces cerevisiae Ortholog (Best Blastp Hit to RefSeq)

scBlastTab

Pfam Domains

Protein Family Domain Structure

knownToPfam à pfamDesc

Superfamily

Protein Superfamily Assignments

ucscScop & scopDesc

PDB

Protein Data Bank

kgProtMap2 & sp###### database

Gene Ontology

Gene Ontology (GO) Terms Associated with Gene

kgProtMap2 & sp###### database

M. Vidal P2P

Human Protein-Protein Interaction Network from Marc Vidal

humanVidalP2P

E. Wanker P2P

Human Protein-Protein Interaction Network from Erich Wanker

humanWankerP2P

HPRD P2P

Human Protein-Protein Interaction Network from the Human Reference Protein Database

humanHprdP2P

Description

Short Description Line/Link to Details Page

kgXref.description

Table Descriptions

File:Hg19uc002ypa.2.jpg
Top of pdf image of UCSC Genes details page showing table source of each item. Use pdf link to the left see entire details page.

Annotated details page: File:Hg19uc002ypa.2.pdf


Attempt to describe the uses of the tables used in or related to UCSC Genes.

UCSC Gene & GS Table Descriptions

  • allenBrainGene - "Human Cortex Gene Expression" link in "Sequence & Links to Tools & Databases" section of hgGene
  • allenBrainUrl - w/ knownToAllenBrain creates GS column, "Allen Brain"
  • bioCycMapDesc - BioCyc description name in "Biochem & Signaling Pathways" section of hgGene
  • bioCycPathway - BioCyc pathway name in "Biochem & Signaling Pathways" section of hgGene
  • ccdsKgMap - CCDS in the "Other names for this Gene" section of hgGene
  • ceBlastTab - C. elegans info in "Orthologous Genes in Other Species" section of hgGene
  • cgapAlias - links cgapID with kgXref.geneSymbol to pull info for gene
  • cgapBiocDesc - BioCarta description in "Biochem & Signaling Pathways" section of hgGene
  • cgapBiocPathway - BioCarta pathway name in "Biochem & Signaling Pathways" section of hgGene
  • dmBlastTab - D. melanogaster info in "Orthologous Genes in Other Species" section of hgGene
  • drBlastTab - zebrafish info in "Orthologous Genes in Other Species" section of hgGene
  • foldUtr3 - 3' info in "mRNA Secondary Structure of 3' and 5' UTRs" section of hgGene
  • foldUtr5 - 5' info in "mRNA Secondary Structure of 3' and 5' UTRs" section of hgGene
  • gnfAtlas2 - separate track, QA'd with that track but also determines the "Microarray expression Data" section of hgGene and the Gene Sorter column, "GNF Atlas 2"
  • gnfAtlas2Distance - Gene Sorter column "GNF Atlas 2 Delta" & "Expression (GNF Atlas2)" "sort by" option
  • humanHprdP2P - Gene Sorter column "HPRD P2P" & "sort by"
  • humanVidalP2P - Gene Sorter column "M. Vidal Protein-to-Protein" & "sort by"
  • humanWankerP2P - Gene Sorter column "E. Wanker Protein-to-Protein" & "sort by"
  • keggMapDesc - KEGG pathway description in "Biochem & Signaling Pathways" section of hgGene
  • keggPathway - KEGG pathway name in "Biochem & Signaling Pathways" section of hgGene
  • kg4ToKg5 - allows searching of an old ID from previous gene set in new gene set or users can check the kg4ToKg5 table directly to find corresponding gene IDs.
  • kgAlias - "Alternate Gene Symbols" in "Other Names for This Gene" section of hgGene
  • kgColor - colors the gene in browser
  • kgProtAlias - intermediate table?
  • kgProtMap2 - Scop Domains in "Protein Domain & Structure Information" section of hgGene & Protein Data Bank column in GS need this table to work properly; also involved with proteome browser (not releasing with proteome browser with hg19; being phased out)
  • kgSpAlias - duplicate of kgAlias w/ extra field, spID, that is blank in all records
  • kgTxInfo - table info in the "Gene Model Information" section of hgGene
  • kgXref - "Alternate Gene Symbols" in the "Other Names for This Gene" section of hgGene
  • knownAlt - separate track, "Alt Events"; needs to be QA'd separately
  • knownBlastTab - Gene Sorter columns: GS "ID%"=knownBlastTab.identity, GS"BLASTP E-Value"=knownBlastTab.eValue, GS "BLASTP Bits"=knownBlastTab.bitScore)
  • knownCanonical - best transcript from each clusterId (note, GS only works with genes in this table)
  • knownGene - primary table
  • knownGeneMrna - "mRNA" link in "Sequence & Links to Tools &Databases" section of hgGene
  • knownGenePep - "protein" link in "Sequence & Links to Tools &Databases" section of hgGene
  • knownIsoforms - transcript grouped into clusters named by clusterId
  • knownToAllenBrain - w/ allenBrainUrl creates Gene Sorter "Allen Brain" column/link
  • knownToEnsembl - used in link to Ensembl

knownToGnf1m (similar to knownToGnfAtlas2 - not sure what it's for)

  • knownToGnfAtlas2 - "Microarray Expression Data" section, Gene Sorter column "GNF Atlas 2 ID"
  • knownToHprd - creates the "HPRD" link in "Sequence & Links to Tools &Databases" section of hgGene
  • knownToLocusLink - used in link to Entrez Gene, see issues below
  • knownToPfam - Pfam Domains in "Protein Domain & Structure info" of hgGene & Gene Sorter column: Pfam Domains
  • knownToRefSeq - used in link to RefSeq in "Other Names for This Gene" section of hgGene
  • knownToSuper - contains scop domain info with gene name & start/end
  • knownToTreefam - used in link to Treefam website in "Sequence & Links to Tools &Databases" section of hgGene
  • knownToU133 - Gene Sorter column "U133 ID"
  • knownToVisiGene - used in link to VisiGene
  • mmBlastTab - mouse info in "Orthologous Genes in Other Species" section of hgGene
  • pfamDesc - Pfam description in "Protein Domain & Structure Info" section of hgGene and in "Pfam Domains" column of Gene Sorter
  • rnBlastTab - rat info in "Orthologous Genes in Other Species" section of hgGene
  • scBlastTab - S. cerevisiae info in "Orthologous Genes in Other Species" section of hgGene
  • scopDesc - acc and description in "SCOP Domains" of "Prot Domainn & Structure Info" section of hgGene
  • spMrna - intermediate table? Doesn't seem to directly affect hgGene or GS
  • ucscScop - from ucscID gets scop domainName

Click for more information about blastTabs

UCSC Genes Tables in other Databases

Proteome DB (e.g. proteins090821)

  • spReactomeEvent - "Reactome" info in "Biochemical and Signaling Pathways section of hgGene (linked through dependent on spID in kgXref)
  • spReactomeId - "Reactome" link in "Sequence & Links to Tools &Databases" section of hgGene (unsure??)

Tables Related to UCSC Genes That are Separate tracks

  • affyU133
  • allenBrainAli
  • exoniphy - created by Adam Siepel of Cornell for each assembly (2nd choice is to lift from previous assembly)
  • gnfAtlas2
  • nibbImageProbes
  • omimGene
  • omimGeneMap
  • omimMorbidMap
  • omimToKnownCanonical
  • vgAllProbes

No longer UCSC Genes Tables

  • knownToCdsSnp - dropping on all assemblies. Found too many issues; Populated Cds Snp column in Gene Sorter.
  • knownToGnf1h - part of GNF Atlas 1, which is not on hg19

Proteome Browser Tables (no longer releasing)

  • pbAnomLimit
  • pbResAvgStd
  • pepCCntDist
  • pepExonCntDist
  • pepHydroDist
  • pepIPCntDist
  • pepMolWtDist
  • pepPi
  • pepPiDist
  • pepResDist
  • pepMwAa

Links to Other UCSC Genes Genomewiki Pages