DbSNP Track Notes

From genomewiki
Revision as of 05:35, 30 November 2007 by AngieHinrichs (talk | contribs) (First draft of notes on SNP track development.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This page exists to document the construction of UCSC's SNP track, based on NCBI's dbSNP. The goal is to enable a Genome Browser developer to come up to speed quickly when first building or maintaining the track and associated code.

This page was first written during the construction of the hg18 snp128 track, based on dbSNP version 128, during Nov.-Dec. 2007. If you are working on snp136 in 2011, and this page has not been updated since then, practice skepticism.


NCBI dbSNP

NCBI produces numbered releases of dbSNP about twice a year. dbSNP includes an enormous relational database, and large specially formatted fasta files that contain each SNP's variant(?) and flanking sequences. We download the fasta files and a subset of the dbSNP, and then extract the pieces used by our SNP track (more below).

links to NCBI docs, dbsnp-announce email list, ftp dirs

UCSC snpNNN track overview

UCSC's track corresponding to dbSNP release NNN is (shortLabel) SNP NNN, (tableName) snpNNN.

Track tables and files

db tables:

  • core track: snpNNN, snpSeq, snpNNNExceptions, snpNNNExceptionsDesc(is this used?)
  • associated tables: snpNNNOrtho{PanTro2,RheMac2}, snpNNNorthoPanTro2RheMac2
  • download files: masked sequences

Genome Browser track code

  • hgTracks
  • hgc
  • hgTrackUi

Subset of NCBI fields used to build snpNNN track

snpNNN field NCBI dbSNP table(s)/file
chrom ContigLoc / contigInfo / liftUp
chromStart ContigLoc / liftUp; check vs phys_pos_from
chromEnd ContigLoc / liftUp
name rs + numeric snp_id that joins all the other sources
score 0
strand ContigLoc.orientation
refNCBI ContigLoc.allele
refUCSC ContigLoc.allele if insertion, othw. from genomic
observed fasta headers
molType fasta headers
class fasta headers
valid SNP
avHet SNP
avHetSE SNP
func ContigLocusId
locType ContigLoc
weight MapInfo


Overview of track build process

planning to automate this...

The process of building the core SNP track follows these basic steps:

  1. Download subset of files from NCBI
  2. Create a temporary db on a workhorse machine and load (subset of) NCBI tables
  3. Extract the relevant fields of NCBI tables and fasta headers into files sorted and indexed by SNP ID.
  4. Use the SNP ID to join the separate files into a single file of NCBI's encoding of SNP data. Use liftUp to translate from contig coords to chrom coords.
  5. Translate NCBI's encoding of SNP data into UCSC's representation, and check for inconsistencies or other problems with the data.
    • If necessary, work with NCBI to resolve any major issues discovered above.
    • If necessary, update the Genome Browser CGIs to handle new values (e.g. new function annotations).
  6. Install sequence files in gbdb and load database tables.

That is just for the core track, so QA can get started -- the next steps are to generate masked SNP sequences and orthologous SNPs; more on those later.

The first several steps are straightforward and scripted using good old unix commands like awk, sort and join, as well as hgsql to pull named fields from the NCBI tables. The translation and encoding step is performed by kent/src/hg/snp/snpLoad/snpProcessRawData.c.

snpProcessRawData

The most complex part of the process, and the most likely to require development work, is the translation of NCBI encodings into UCSC's format and consistency checks performed by snpProcessRawData. NCBI has made some changes and extensions to dbSNP in the past several revisions, and that can be expected to continue, so our code (both snpProcessRawData and the CGIs that it feeds) must keep up.

error-checking

  • errAbort for problems that indicate wrong data file or need to update software
  • error file, and omit row, for serious data inconsistencies
  • exception file for minor data inconsistencies

reformatting

exceptions

reporting problems to NCBI

other sanity checks

comparison to previous version

summary?

maybe something like blood test results where you see measurement plus normal range, and flag things out of normal range? Heather has various hints in the code for how many of each type there should be. This could probably be done by the post-processing.

after loading the SNP track:

make masked sequences

update orthos

both of those (especially orthos) are quite long & involved processes, probably worthy of separate automation and doc.

stats?

in addition to all of the howto stuff... actual snp128 stats! :) maybe on a separate page. might be useful for reporting to NCBI.