GenBank QA

From Genecats
Jump to navigationJump to search

The GenBank tracks are a suite of tracks that are created by an automated process that runs separately on hgwdev, hgwbeta, and the RR. Most questions about GenBank tracks should be directed to Mark Diekhans and Brian Raney.

Tracks

The possible "GenBank tracks" are:

  • RefSeq Genes
  • Other RefSeq
  • MGC Genes
  • ORFeome Clones
  • (organism) mRNAs
  • Spliced ESTs
  • (organism) ESTs
  • Other mRNAs
  • Other ESTs

Not every track is on every assembly. Most assemblies will have at least mRNAs, ESTs and Other RefSeq. The RefSeq Genes track is usually only available if there are enough RefSeq Genes for that organism. You can see counts of the mRNAs, ESTs and RefSeq genes available from genbank in the file /cluster/data/genbank/data/organism.lst.

Check to see which tracks are present on this assembly. If a previous assembly for this organism is available, check the new assembly to be sure all of the tracks on the previous assembly are also present on this one.

Tables

There are many related tables for these tracks. They must always be pushed as a group at the same time.

The current list of all possible Genbank tables (curated by Mark Diekhans and Brian Raney) is located at hgwdev:/cluster/data/genbank/etc/genbank.tbls (also located at hgwbeta:/genbank/etc/genbank.tbls). All tables in the list up to 'gbLoaded' must exist; those after 'gbLoaded' are optional. To get a list of those tables included in a database (using hg19 as an example), do:

hgsql -N -e 'SHOW TABLES' hg19 | egrep -f /cluster/data/genbank/etc/genbank.tbls  (hgwdev)
hgsql -N -e 'SHOW TABLES' hg19 | egrep -f /genbank/etc/genbank.tbls  (hgwbeta)

The two tables 'gbCdnaInfo' and 'gbStatus' are main tables that should contain all entries for a database.

doGenbankTests

There is a script that runs several automated checks on all GenBank tables, called doGenbankTests. It determines which GenBank tracks are present for an assembly, then runs the following checks on the appropriate tables:

  • genePredCheck
  • pslCheck
  • joinerCheck
  • featureBits

It also runs something called gbSanity. In the past it would help identify some problems that are known but not dire, and some that are definitely things that should keep us from releasing the tracks, however, gbSanity was more of a way for the original engineer, MarkD, to ensure the process around building GenBank tracks was working properly and less about being a QA script.

NOTE: Braney has said there is no longer any reason to worry about gbSanity anymore if it reports any errors.

Other checks

Remember to still do the normal QA stuff listed in the New_track_checklist. These checks can be pretty cursory, as the the tracks are built by the GenBank process, which has been automated and running for some time without problems. It is arguably more important to think about things like whether the tracks that are present on your assembly make sense (e.g., should there be a RefSeq track but there is not one?). There are tracks required for a Minimal_browser.

The file /cluster/data/genbank/data/organism.lst has counts of how many mRNAs, ESTs and RefSeq genes are available for each organism in GenBank. If there are no RefSeq Genes for an organism, the RefSeq Genes track should not be enabled. If there is at least one RefSeq Gene, the track should exist. (The idea is that the track will grow as more genes are added to GenBank for that organism.)