GenBank QA: Difference between revisions

From Genecats
Jump to navigationJump to search
(→‎Tracks: added location of genbank counts file)
(→‎Other checks: linked to minimal browser page)
Line 39: Line 39:


==Other checks==
==Other checks==
Remember to still do the normal QA stuff listed in the [[New_track_checklist]]. These checks can be pretty cursory, as the the tracks are built by the GenBank process, which has been automated and running for some time without problems. It is arguably more important to think about things like whether the tracks that are present on your assembly make sense (e.g., should there be a RefSeq track but there is not one?).
Remember to still do the normal QA stuff listed in the [[New_track_checklist]]. These checks can be pretty cursory, as the the tracks are built by the GenBank process, which has been automated and running for some time without problems. It is arguably more important to think about things like whether the tracks that are present on your assembly make sense (e.g., should there be a RefSeq track but there is not one?).  There are tracks required for a [[Minimal_browser]].


The file /cluster/data/genbank/data/organism.lst has counts of how many mRNAs, ESTs and RefSeq genes are available for each organism in GenBank.  If there are no RefSeq Genes for an organism, the RefSeq Genes track should not be enabled.  If there is at least one RefSeq Gene, the track should exist.  (The idea is that the track will grow as more genes are added to GenBank for that organism.)
The file /cluster/data/genbank/data/organism.lst has counts of how many mRNAs, ESTs and RefSeq genes are available for each organism in GenBank.  If there are no RefSeq Genes for an organism, the RefSeq Genes track should not be enabled.  If there is at least one RefSeq Gene, the track should exist.  (The idea is that the track will grow as more genes are added to GenBank for that organism.)

Revision as of 23:36, 5 October 2012

The GenBank tracks are a suite of tracks that are created by an automated process that runs separately on hgwdev, hgwbeta, and the RR. Most questions about GenBank tracks should be directed to Mark Diekhans and Brian Raney.

Tracks

The possible "GenBank tracks" are:

  • RefSeq Genes
  • Other RefSeq
  • MGC Genes
  • ORFeome Clones
  • (organism) mRNAs
  • Spliced ESTs
  • (organism) ESTs
  • Other mRNAs
  • Other ESTs

Not every track is on every assembly. Most assemblies will have at least mRNAs, ESTs and Other RefSeq. The RefSeq Genes track is usually only available if there are enough RefSeq Genes for that organism. You can see counts of the mRNAs, ESTs and RefSeq genes available from genbank in the file /cluster/data/genbank/data/organism.lst.

Check to see which tracks are present on this assembly. If a previous assembly for this organism is available, check the new assembly to be sure all of the tracks on the previous assembly are also present on this one.

Tables

There are many related tables for these tracks. They must always be pushed as a group at the same time.

The current list of all possible Genbank tables (curated by Mark Diekhans and Brian Raney) is located at hgwdev:/cluster/data/genbank/etc/genbank.tbls (also located at hgwbeta:/genbank/etc/genbank.tbls). All tables in the list up to 'gbLoaded' must exist; those after 'gbLoaded' are optional. To get a list of those tables included in a database (using hg19 as an example), do:

hgsql -N -e 'SHOW TABLES' hg19 | egrep -f /cluster/data/genbank/etc/genbank.tbls  (hgwdev)
hgsql -N -e 'SHOW TABLES' hg19 | egrep -f /genbank/etc/genbank.tbls  (hgwbeta)

The two tables 'gbCdnaInfo' and 'gbStatus' are main tables that should contain all entries for a database.

doGenbankTests

There is a script that runs several automated checks on all GenBank tables, called doGenbankTests. It determines which GenBank tracks are present for an assembly, then runs the following checks on the appropriate tables:

  • genePredCheck
  • pslCheck
  • joinerCheck
  • featureBits

It also runs something called gbSanity. It often identifies some problems that are known but not dire, and some that are definitely things that should keep us from releasing the tracks. Contact the GenBank gurus if it finds any errors.

Other checks

Remember to still do the normal QA stuff listed in the New_track_checklist. These checks can be pretty cursory, as the the tracks are built by the GenBank process, which has been automated and running for some time without problems. It is arguably more important to think about things like whether the tracks that are present on your assembly make sense (e.g., should there be a RefSeq track but there is not one?). There are tracks required for a Minimal_browser.

The file /cluster/data/genbank/data/organism.lst has counts of how many mRNAs, ESTs and RefSeq genes are available for each organism in GenBank. If there are no RefSeq Genes for an organism, the RefSeq Genes track should not be enabled. If there is at least one RefSeq Gene, the track should exist. (The idea is that the track will grow as more genes are added to GenBank for that organism.)