GenBank QA: Difference between revisions

From Genecats
Jump to navigationJump to search
m (→‎Tables: formatting)
(→‎Other checks: added location of file that contains mrna/est/refSeq counts)
Line 37: Line 37:


==Other checks==
==Other checks==
Remember to still do the normal QA stuff listed in the [[New_track_checklist]]. These checks can be pretty cursory, as the the tracks are built by the GenBank process, which has been automated and running for some time without problems. It is arguably more important to think about things like whether the tracks that are present on your assembly make sense (e.g., should there be a RefSeq track but there is not one?) and whether the numbers of items in the track make sense (e.g., are there only 3 genes in the RefSeq track?).
Remember to still do the normal QA stuff listed in the [[New_track_checklist]]. These checks can be pretty cursory, as the the tracks are built by the GenBank process, which has been automated and running for some time without problems. It is arguably more important to think about things like whether the tracks that are present on your assembly make sense (e.g., should there be a RefSeq track but there is not one?).
 
The file /cluster/data/genbank/data/organism.lst has counts of how many mRNAs, ESTs and RefSeq genes are available for each organism in GenBank.  If there are no RefSeq Genes for an organism, the RefSeq Genes track should not be enabled. If there is at least one RefSeq Gene, the track should exist. (The idea is that the track will grow as more genes are added to GenBank for that organism.)


[[Category:Browser QA tracks]]
[[Category:Browser QA tracks]]
[[Category:Browser QA]]
[[Category:Browser QA]]

Revision as of 22:45, 3 October 2011

The GenBank tracks are a suite of tracks that are created by an automated process that runs separately on hgwdev, hgwbeta, and the RR. Most questions about GenBank tracks should be directed to Mark Diekhans and Brian Raney.

Tracks

The possible "GenBank tracks" are:

  • RefSeq Genes
  • Other RefSeq
  • MGC Genes
  • ORFeome Clones
  • (organism) mRNAs
  • Spliced ESTs
  • (organism) ESTs
  • Other mRNAs
  • Other ESTs

Not every track is on every assembly. Most assemblies will have at least mRNAs, ESTs and Other RefSeq. The RefSeq Genes track is usually only available if there are enough RefSeq Genes for that organism.

Tables

There are many related tables for these tracks. They must always be pushed as a group at the same time.

The current list of all possible Genbank tables (curated by Mark Diekhans and Brian Raney) is located at hgwdev:/cluster/data/genbank/etc/genbank.tbls (also located at hgwbeta:/genbank/etc/genbank.tbls). All tables in the list up to 'gbLoaded' must exist; those after 'gbLoaded' are optional. To get a list of those tables included in a database (using hg19 as an example), do:

hgsql -N -e 'SHOW TABLES' hg19 | egrep -f /cluster/data/genbank/etc/genbank.tbls  (hgwdev)
hgsql -N -e 'SHOW TABLES' hg19 | egrep -f /genbank/etc/genbank.tbls  (hgwbeta)

The two tables 'gbCdnaInfo' and 'gbStatus' are main tables that should contain all entries for a database.

doGenbankTests

There is a script that runs several automated checks on all GenBank tables, called doGenbankTests. It determines which GenBank tracks are present for an assembly, then runs the following checks on the appropriate tables:

  • genePredCheck
  • pslCheck
  • joinerCheck
  • featureBits

It also runs something called gbSanity. It often identifies some problems that are known but not dire, and some that are definitely things that should keep us from releasing the tracks. Contact the GenBank gurus if it finds any errors.

Other checks

Remember to still do the normal QA stuff listed in the New_track_checklist. These checks can be pretty cursory, as the the tracks are built by the GenBank process, which has been automated and running for some time without problems. It is arguably more important to think about things like whether the tracks that are present on your assembly make sense (e.g., should there be a RefSeq track but there is not one?).

The file /cluster/data/genbank/data/organism.lst has counts of how many mRNAs, ESTs and RefSeq genes are available for each organism in GenBank. If there are no RefSeq Genes for an organism, the RefSeq Genes track should not be enabled. If there is at least one RefSeq Gene, the track should exist. (The idea is that the track will grow as more genes are added to GenBank for that organism.)