PCR on cDNA

From genomewiki
Revision as of 19:48, 21 April 2008 by AngieHinrichs (talk | contribs) (Updated description of seqTable and extFileTable which are now optional (seqFile can be used instead).)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a sort of design document after-the-fact/in-progess. I would love to get input on any part of it... see any holes?

Overview

We will enhance hgPcr to offer not only genomic assembly sequence, but also cDNA sequences such as UCSC Genes and quarterly snapshots of GenBank native mRNAs, as targets for search.

hgPcr

When the necessary tables, files and gfServers are in place, hgPcr's front page gets a new select box labelled Target. The first and default choice is "genome assembly" -- no change to the current behavior. The subsequent choices have a brief shortLabel-like description, e.g. "UCSC Genes" or "Human mRNAs".

If the user selects one of the new targets, then the primer pair is passed to a gfServer running on cDNA sequences. The gfPcr result page for cDNA looks much like the genome assembly result page, except cDNA coordinates and sequence are displayed and the hgTracks links have position=accession instead of a chrom:start-end range. Unlike the genomic result page, if a match is to the opposite strand of a cDNA (assumed to be mRNA at this point), a message is printed out.

Infrastructure

A new central database table describes the targets. Each new target requires a new blat server, two new /gbdb files per genome db and three new tables per genome db.

Synchronization across {central db, blat servers, /gbdb, genome db tables} x {hgwdev, hgwbeta, RR} will be interesting -- more on that below.


Implementation

Central database

A new table targetDb describes each pair of target and genome db to which it has been aligned. Fields (as in kent/src/hg/lib/targetDb.{as,sql}):

field description example
name Unique identifier of target hg18KgApr08
description Brief description for select box, like shortLabel UCSC Genes
db Genome assembly database to which this target has been aligned hg18
pslTable PSL table in db that maps target coords to db coords kgTargetAli
seqTable (optional) Table in db that has extFileTable indices of target sequences kgTargetSeq (or blank)
extFileTable (optional) Table in db that has .id, .path, and .size of target sequence files kgTargetExtFile (or blank)
seqFile Target sequence .2bit file path (typically /gbdb/db/targetDb/name.2bit) /gbdb/hg18/targetDb/kgTargetSeq.2bit
priority Relative priority compared to other targets for same db (smaller numbers are higher priority) 1.0
time Time at which this record was updated -- should be newer than db tables and seqFile (so should blat server) 2008-04-10 14:11:35


When the gfServer info for the target is added to blatServers, blatServers.db must be the same as targetDb.name (not targetDb.db!).

pslTable, seqTable, and extFileTable are not necessary for performing PCR on non-genomic targets and displaying the result with target sequence and coords -- they are for addition of a PCR results track. They could be used to enhance the hgPcr results page, too.

seqTable and extFileTable are optional -- seqFile will be used if those fields are left blank.


Blat servers

The number of new blat servers will depend on how many genome dbs and targets we want to support. It could be a lot. Fortunately a gfServer running on transcript sequences doesn't require much memory -- hg18 UCSC Genes and native mRNAs require about 50M and 100M respectively. However, if we add a lot of new servers, there will be a lot to keep track of, and I wonder if cluster-admin might be annoyed at all of the new start and stop requests.


/gbdb/ files

The location of these is flexible. Currently I'm using /gbdb/db/targetDb/.

In that directory are two files per target: name.2bit (seqFile) and, if seqTable and extFileTable are specified, name.fa.

It is possible that one target server and sequence file could be shared across several genome databases; for example, a human mRNA gfServer could be shared by hg* dbs. In that case, putting the sequence somewhere other than the /gbdb/db directories would make sense.


Genome database tables

pslTable maps the target sequence coordinates to genomic; if specified, seqTable and extFileTable support flexible organization of sequences into fasta files.

Potential for automation

All of this could be automated except for a few critical pieces: running the blatServers, QA, and release.  :)

However, a script could certainly print out instructions (including template email to cluster-admin, template push request) and make sub-scripts for the automatable parts: creating the necessary tables/files and updating targetDb and blatServers in the central database.

For QA as well, some automation is possible. For example, a script could pick a sequence from the target .2bit, extract some primers and their coords, and show what the hgPcr result should be -- same thing would work for genomic PCR (with $db.2bit).


Synchronization/Release

Synchronizing across the central db, blat servers, /gbdb and genome db tables will be a challenge, especially when rolling out changes from hgwdev/hgcentraltest to hgwbeta and the RR.

One safeguard in place is a timestamp check: targetDb.time must be newer than any of the genome db tables or /gbdb file, or that target will be ignored. That will prevent a table/file update from causing incorrect results from hgPcr, but it doesn't cover the blat server. (Could gfServer status be enhanced to give an uptime / start date? maybe even name and timestamp of input file? :)

Rolling out an update of anything with both db tables and files, from hgwdev/hgcentraltest -> hgwbeta/hgcentralbeta + shared-with-RR /gbdb -> RR/hgcentral, is always more complicated than pushing out a brand new set. Here, even more moving parts are involved. So I think the only way that we can support updates is to use target names that include some kind of date (MmmYY like Apr08 should be enough). That allows us to add all components of a new target, while leaving the old target in place, then switchover in targetDb when everything else is in place.


Prototype on genome-test

hg18 hgPcr now shows the Target menu with two non-genome choices, UCSC Genes and Human mRNAs. Currently the blat servers are running on kolossus instead of proper blat server machines.