GenBank QA

From Genecats
Revision as of 19:13, 30 September 2011 by Rhead (talk | contribs) (used old documentation at http://genecats.cse.ucsc.edu/qa/test-protocols/tracks.html as a starting point)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

The GenBank tracks are a suite of tracks that are created by an automated process that runs separately on hgwdev, hgwbeta, and the RR. Most questions about GenBank tracks should be directed to Mark Diekhans and Brian Raney.

Tracks

The possible "GenBank tracks" are:

  • RefSeq Genes
  • Other RefSeq
  • MGC Genes
  • ORFeome Clones
  • (organism) mRNAs
  • Spliced ESTs
  • (organism) ESTs
  • Other mRNAs
  • Other ESTs

Not every track is on every assembly. Most assemblies will have at least mRNAs, ESTs and Other RefSeq. The RefSeq Genes track is usually only available if there are enough RefSeq Genes for that organism.

Tables

There are many related tables for these tracks. They must always be pushed as a group at the same time.

The current list of all possible Genbank tables (curated by Mark Diekhans and Brian Raney) is located at hgwdev:/cluster/data/genbank/etc/genbank.tbls (also located at hgwbeta:/genbank/etc/genbank.tbls). All tables in the list up to 'gbLoaded' must exist; those after 'gbLoaded' are optional. To get a list of those tables included in a database (using hg19 as an example), do:

         hgsql -N -e 'SHOW TABLES' hg19 | egrep -f /cluster/data/genbank/etc/genbank.tbls  (hgwdev)
         hgsql -N -e 'SHOW TABLES' hg19 | egrep -f /genbank/etc/genbank.tbls  (hgwbeta)

The two tables 'gbCdnaInfo' and 'gbStatus' are main tables that should contain all entries for a database.

doGenbankTests

There is a script that runs several automated checks on all GenBank tables, called doGenbankTests. It determines which GenBank tracks are present for an assembly, then runs the following checks on the appropriate tables:

  • genePredCheck
  • pslCheck
  • joinerCheck
  • featureBits

It also runs something called gbSanity. It often identifies some problems that are known but not dire, and some that are definitely things that should keep us from releasing the tracks. Contact the GenBank gurus if it finds any errors.

Other checks

Remember to still do the normal QA stuff listed in the New_track_checklist. These checks can be pretty cursory, as the the tracks are built by the GenBank process, which has been automated and running for some time without problems. It is arguably more important to think about things like whether the tracks that are present on your assembly make sense (e.g., should there be a RefSeq track but there is not one?) and whether the numbers of items in the track make sense (e.g., are there only 3 genes in the RefSeq track?).