DbSNP Track Notes: Difference between revisions

From genomewiki
Jump to navigationJump to search
(First draft of notes on SNP track development.)
 
(Updated after putting snp128 in the pushQ; could still say a lot more.)
Line 1: Line 1:
This page exists to document the construction of UCSC's SNP track, based on NCBI's [http://www.ncbi.nlm.nih.gov/projects/SNP/ dbSNP].  The goal is to enable a Genome Browser developer to come up to speed quickly when first building or maintaining the track and associated code.
This page is intended to be a start-up guide for developers about to build UCSC's SNP track, based on NCBI's [http://www.ncbi.nlm.nih.gov/projects/SNP/ dbSNP].  It provides some background information about dbSNP and our build process -- good stuff to know in case you need to update our code to keep up with the dbSNP developers' changes.


This page was first written during the construction of the hg18 snp128 track, based on dbSNP version 128, during Nov.-Dec. 2007.  If you are working on snp136 in 2011, and this page has not been updated since then, practice skepticism.
This page was first written during the construction of the hg18 snp128 track, based on dbSNP version 128, circa Jan. 2008.  If you are working on snp136 in 2011, and this page has not been updated since then, practice skepticism.




== NCBI dbSNP ==
== NCBI dbSNP ==
NCBI produces numbered releases of [http://www.ncbi.nlm.nih.gov/projects/SNP/ dbSNP] about twice a year.  dbSNP includes an enormous relational database, and large specially formatted fasta files that contain each SNP's variant(?) and flanking sequences.  We download the fasta files and a subset of the dbSNP, and then extract the pieces used by our SNP track (more below).
NCBI produces numbered releases of [http://www.ncbi.nlm.nih.gov/projects/SNP/ dbSNP] about twice a year.  dbSNP includes an enormous relational database, and large specially formatted fasta files: for each SNP, there is a detailed fasta header line, followed by the left flanking sequence, a single IUPAC ambiguous base representing the SNP on a line by itself, and the right flanking sequence.  We download the fasta files and a subset of the dbSNP database, and then extract the pieces used by our SNP track (more below).


links to NCBI docs, dbsnp-announce email list, ftp dirs
''a bit more about their build process in general''


== UCSC snpNNN track overview ==
''links to NCBI docs, dbsnp-announce email list, ftp dirs''
UCSC's track corresponding to dbSNP release NNN is (shortLabel) SNP NNN, (tableName) snpNNN.
 
=== Track tables and files ===
db tables:
* core track: snpNNN, snpSeq, snpNNNExceptions, snpNNNExceptionsDesc(is this used?)
* associated tables: snpNNNOrtho{PanTro2,RheMac2}, snpNNNorthoPanTro2RheMac2
* download files: masked sequences
 
=== Genome Browser track code ===
* hgTracks
* hgc
* hgTrackUi


== Subset of NCBI fields used to build snpNNN track ==
=== Subset of NCBI fields used to build snpNNN track ===


{| cellspacing="0" border="1"
{| cellspacing="0" border="1"
Line 30: Line 18:
|-
|-
|chrom
|chrom
|ContigLoc / contigInfo / liftUp
|ContigLoc / ContigInfo / liftUp
|-
|-
|chromStart
|chromStart
Line 81: Line 69:
|}
|}


== UCSC snpNNN track overview ==
UCSC's track/table corresponding to dbSNP release NNN is snpNNN; the shortLabel is "SNPs (NNN)".
=== Track tables and files ===
db tables:
* core track: snpNNN, snpNNNSeq, snpNNNExceptions, snpNNNExceptionDesc
* associated tables: snpNNNOrtho{PanTro2,RheMac2}, snpNNNorthoPanTro2RheMac2
gbdb files:
* /gbdb/DB/snp/snpNNN.fa
''hgdownload files: masked sequences (for human only?)''
=== Genome Browser track code ===
In all of these files, look for snp125*, not the corresponding snp* (older track) functions.
* inc/snp125Ui.h, lib/snp125Ui.c
* hgTrackUi/hgTrackUi.c
* hgTracks/variation.c
* hgc/hgc.c
''could say a lot more here about the UI filters, special names when orthos exist, trackDb settings, hgc details...''


== Overview of track build process ==
== Overview of track build process ==


planning to automate this...
''planning to automate this...'' to see how it was done for hg18, search for snp128 in makeDb/doc/hg18.txt.


The process of building the core SNP track follows these basic steps:
The process of building the core SNP track follows these basic steps:
# Download subset of files from NCBI
# Download fasta files and subset of database table dumps from dbSNP
# Create a temporary db on a workhorse machine and load (subset of) NCBI tables
# Create a temporary db on a workhorse machine and load (subset of) NCBI tables
# Extract the relevant fields of NCBI tables and fasta headers into files sorted and indexed by SNP ID.
# Extract the relevant fields of NCBI tables and fasta headers into files sorted and indexed by SNP ID.
Line 94: Line 104:
#* If necessary, work with NCBI to resolve any major issues discovered above.
#* If necessary, work with NCBI to resolve any major issues discovered above.
#* If necessary, update the Genome Browser CGIs to handle new values (e.g. new function annotations).
#* If necessary, update the Genome Browser CGIs to handle new values (e.g. new function annotations).
# Install sequence files in gbdb and load database tables.
# Install sequence file in /gbdb and load database tables.


That is just for the core track, so QA can get started -- the next steps are to generate masked SNP sequences and orthologous SNPs; more on those later.
For human, we also generate masked SNP sequences and orthologous SNP mappings, but QA can get started on those core tables in the meantime.


The first several steps are straightforward and scripted using good old unix commands like awk, sort and join, as well as hgsql to pull named fields from the NCBI tables.  The translation and encoding step is performed by kent/src/hg/snp/snpLoad/snpProcessRawData.c.   
The first several steps are straightforward and scripted using good old unix commands like awk, sort and join, as well as hgsql to pull named fields from the NCBI tables.  The translation and encoding step is performed by kent/src/hg/snp/snpLoad/snpNcbiToUcsc.c.   


== snpProcessRawData ==
== snpNcbiToUcsc ==


The most complex part of the process, and the most likely to require development work, is the translation of NCBI encodings into UCSC's format and consistency checks performed by snpProcessRawData.  NCBI has made some changes and extensions to dbSNP in the past several revisions, and that can be expected to continue, so our code (both snpProcessRawData and the CGIs that it feeds) must keep up.
The most complex part of the process, and the most likely to require development work, is the translation of NCBI encodings into UCSC's format and consistency checks performed by snpNcbiToUcsc.  NCBI has made some changes and extensions to dbSNP in the past several revisions, and that can be expected to continue, so our code (both snpNcbiToUcsc and the CGIs that it feeds) must keep up.


=== error-checking ===
Prior to snp128, about 20 programs in hg/snp/snpLoad/ were used to collect, translate and check the data (see snp126 construction in hg18.txt).  snpNcbiToUcsc was written to replace all of them (except the parts that were replaced by hgsql, awk, sort and join), in order to simplify maintenance of the code.  Side benefits include speedup (single pass over all 12M rows, takes 3.5min), improved checking of formats using the regex library, and auto-generation of snpNNN.sql and snpNNNExceptionDesc.tab.
 
/* ATTENTION DEVELOPERS
  *
  * snpNcbiToUcsc should fail if NCBI makes any significant changes to dbSNP.
  * If it fails, or if it skips any SNPs due to errors (other than missing
  * observed / deleted SNP), please investigate.  Will the change in dbSNP
  * require changes to our CGIs in addition to snpNcbiToUcsc?
  *
  * snpNcbiToUcsc.c has a lot of comments.  Please read them, and please
  * update them when making changes!
  */
 
=== Reformatting / adjustments to the data ===
NCBI uses a 0-based, fully closed coordinate system.  In most cases, this can be translated to our 0-based, half open system by adding 1 to the end coordinate.  However, they represent genomic insertion points as two bases long, with the insertion point between the bases.  To convert those to zero-base-long points in our coord system, we increment the start and leave the end alone. 
 
For several fields, we translate NCBI's numeric encodings into string values (represented as sets or enums in the snpNNN database table).  Many of these are recognizable as names that NCBI uses in dump files (*.bcp.gz) or used to use in ASN, but not always, especially for locType.  There is some history there and I have chosen to keep the same string values in snp128 and later that were used in snp125-127. 
 
=== Checks for errors or oddities ===
snpNcbiToUcsc handles unexpected conditions in several ways depending on severity:
*errAbort for problems that indicate wrong data file or need to update software
*errAbort for problems that indicate wrong data file or need to update software
*error file, and omit row, for serious data inconsistencies
*write line to snpNNNErrors.bed file, and omit row from snpNNN.bed, for serious data inconsistencies
*exception file for minor data inconsistencies
*write line to snpNNNExceptions.bed file for minor data inconsistencies or other conditions we want to mention in the Annotations section of the hgc details page
 
If there is an errAbort or error output, it probably means that dbSNP has changed something about how it encodes its data, not necessarily that there is a serious error in the data -- but always investigate to make sure.


=== reformatting ===
=== Exceptions ===
=== exceptions ===
snpNcbiToUcsc checks for ~18 unusual conditions, most (but not all) of which imply that the SNP might not be perfectly mapped to the genome.  These are referred to as exceptions in the code/database and "Annotations" in hgc.  When an exception is found, a line of bed4+ is written out to snpNNNExceptions.bed: chrom, start, end, rsId, and exception name.  snpNcbiToUcsc tallies of the counts of each type of exception, and upon completion, it writes out snpNNNExceptionDesc.tab; each row has exception name, count, and a description that appears in hgc.  The types of checks (each type of check might cover several different specific exceptions) are described in trackDb/snpNNN.html. 


''describe exceptions -- rationale, implications etc.''


== reporting problems to NCBI ==
== Reporting problems to NCBI ==




Line 119: Line 151:
=== comparison to previous version ===
=== comparison to previous version ===
=== summary? ===
=== summary? ===
maybe something like blood test results where you see measurement plus normal range, and flag things out of normal range?  Heather has various hints in the code for how many of each type there should be.  This could probably be done by the post-processing.
''maybe something like blood test results where you see measurement plus normal range, and flag things out of normal range?  Heather has various hints in the code for how many of each type there should be.  This could probably be done by the post-processing.''


== after loading the SNP track: ==
== after loading the SNP track: ==
=== make masked sequences ===
=== make masked sequences ===
=== update orthos ===
=== update orthos ===
both of those (especially orthos) are quite long & involved processes, probably worthy of separate automation and doc.
''both of those (especially orthos) are quite long & involved processes, probably worthy of separate automation and doc.''


== stats? ==
== stats? ==
in addition to all of the howto stuff... actual snp128 stats! :) maybe on a separate page. might be useful for reporting to NCBI.
''in addition to all of the howto stuff... actual snp128 stats! :) maybe on a separate page. might be useful for reporting to NCBI.''

Revision as of 05:34, 24 January 2008

This page is intended to be a start-up guide for developers about to build UCSC's SNP track, based on NCBI's dbSNP. It provides some background information about dbSNP and our build process -- good stuff to know in case you need to update our code to keep up with the dbSNP developers' changes.

This page was first written during the construction of the hg18 snp128 track, based on dbSNP version 128, circa Jan. 2008. If you are working on snp136 in 2011, and this page has not been updated since then, practice skepticism.


NCBI dbSNP

NCBI produces numbered releases of dbSNP about twice a year. dbSNP includes an enormous relational database, and large specially formatted fasta files: for each SNP, there is a detailed fasta header line, followed by the left flanking sequence, a single IUPAC ambiguous base representing the SNP on a line by itself, and the right flanking sequence. We download the fasta files and a subset of the dbSNP database, and then extract the pieces used by our SNP track (more below).

a bit more about their build process in general

links to NCBI docs, dbsnp-announce email list, ftp dirs

Subset of NCBI fields used to build snpNNN track

snpNNN field NCBI dbSNP table(s)/file
chrom ContigLoc / ContigInfo / liftUp
chromStart ContigLoc / liftUp; check vs phys_pos_from
chromEnd ContigLoc / liftUp
name rs + numeric snp_id that joins all the other sources
score 0
strand ContigLoc.orientation
refNCBI ContigLoc.allele
refUCSC ContigLoc.allele if insertion, othw. from genomic
observed fasta headers
molType fasta headers
class fasta headers
valid SNP
avHet SNP
avHetSE SNP
func ContigLocusId
locType ContigLoc
weight MapInfo


UCSC snpNNN track overview

UCSC's track/table corresponding to dbSNP release NNN is snpNNN; the shortLabel is "SNPs (NNN)".

Track tables and files

db tables:

  • core track: snpNNN, snpNNNSeq, snpNNNExceptions, snpNNNExceptionDesc
  • associated tables: snpNNNOrtho{PanTro2,RheMac2}, snpNNNorthoPanTro2RheMac2

gbdb files:

  • /gbdb/DB/snp/snpNNN.fa

hgdownload files: masked sequences (for human only?)

Genome Browser track code

In all of these files, look for snp125*, not the corresponding snp* (older track) functions.

  • inc/snp125Ui.h, lib/snp125Ui.c
  • hgTrackUi/hgTrackUi.c
  • hgTracks/variation.c
  • hgc/hgc.c

could say a lot more here about the UI filters, special names when orthos exist, trackDb settings, hgc details...

Overview of track build process

planning to automate this... to see how it was done for hg18, search for snp128 in makeDb/doc/hg18.txt.

The process of building the core SNP track follows these basic steps:

  1. Download fasta files and subset of database table dumps from dbSNP
  2. Create a temporary db on a workhorse machine and load (subset of) NCBI tables
  3. Extract the relevant fields of NCBI tables and fasta headers into files sorted and indexed by SNP ID.
  4. Use the SNP ID to join the separate files into a single file of NCBI's encoding of SNP data. Use liftUp to translate from contig coords to chrom coords.
  5. Translate NCBI's encoding of SNP data into UCSC's representation, and check for inconsistencies or other problems with the data.
    • If necessary, work with NCBI to resolve any major issues discovered above.
    • If necessary, update the Genome Browser CGIs to handle new values (e.g. new function annotations).
  6. Install sequence file in /gbdb and load database tables.

For human, we also generate masked SNP sequences and orthologous SNP mappings, but QA can get started on those core tables in the meantime.

The first several steps are straightforward and scripted using good old unix commands like awk, sort and join, as well as hgsql to pull named fields from the NCBI tables. The translation and encoding step is performed by kent/src/hg/snp/snpLoad/snpNcbiToUcsc.c.

snpNcbiToUcsc

The most complex part of the process, and the most likely to require development work, is the translation of NCBI encodings into UCSC's format and consistency checks performed by snpNcbiToUcsc. NCBI has made some changes and extensions to dbSNP in the past several revisions, and that can be expected to continue, so our code (both snpNcbiToUcsc and the CGIs that it feeds) must keep up.

Prior to snp128, about 20 programs in hg/snp/snpLoad/ were used to collect, translate and check the data (see snp126 construction in hg18.txt). snpNcbiToUcsc was written to replace all of them (except the parts that were replaced by hgsql, awk, sort and join), in order to simplify maintenance of the code. Side benefits include speedup (single pass over all 12M rows, takes 3.5min), improved checking of formats using the regex library, and auto-generation of snpNNN.sql and snpNNNExceptionDesc.tab.

/* ATTENTION DEVELOPERS
 *
 * snpNcbiToUcsc should fail if NCBI makes any significant changes to dbSNP.
 * If it fails, or if it skips any SNPs due to errors (other than missing
 * observed / deleted SNP), please investigate.  Will the change in dbSNP 
 * require changes to our CGIs in addition to snpNcbiToUcsc?
 *
 * snpNcbiToUcsc.c has a lot of comments.  Please read them, and please
 * update them when making changes!
 */

Reformatting / adjustments to the data

NCBI uses a 0-based, fully closed coordinate system. In most cases, this can be translated to our 0-based, half open system by adding 1 to the end coordinate. However, they represent genomic insertion points as two bases long, with the insertion point between the bases. To convert those to zero-base-long points in our coord system, we increment the start and leave the end alone.

For several fields, we translate NCBI's numeric encodings into string values (represented as sets or enums in the snpNNN database table). Many of these are recognizable as names that NCBI uses in dump files (*.bcp.gz) or used to use in ASN, but not always, especially for locType. There is some history there and I have chosen to keep the same string values in snp128 and later that were used in snp125-127.

Checks for errors or oddities

snpNcbiToUcsc handles unexpected conditions in several ways depending on severity:

  • errAbort for problems that indicate wrong data file or need to update software
  • write line to snpNNNErrors.bed file, and omit row from snpNNN.bed, for serious data inconsistencies
  • write line to snpNNNExceptions.bed file for minor data inconsistencies or other conditions we want to mention in the Annotations section of the hgc details page

If there is an errAbort or error output, it probably means that dbSNP has changed something about how it encodes its data, not necessarily that there is a serious error in the data -- but always investigate to make sure.

Exceptions

snpNcbiToUcsc checks for ~18 unusual conditions, most (but not all) of which imply that the SNP might not be perfectly mapped to the genome. These are referred to as exceptions in the code/database and "Annotations" in hgc. When an exception is found, a line of bed4+ is written out to snpNNNExceptions.bed: chrom, start, end, rsId, and exception name. snpNcbiToUcsc tallies of the counts of each type of exception, and upon completion, it writes out snpNNNExceptionDesc.tab; each row has exception name, count, and a description that appears in hgc. The types of checks (each type of check might cover several different specific exceptions) are described in trackDb/snpNNN.html.

describe exceptions -- rationale, implications etc.

Reporting problems to NCBI

other sanity checks

comparison to previous version

summary?

maybe something like blood test results where you see measurement plus normal range, and flag things out of normal range? Heather has various hints in the code for how many of each type there should be. This could probably be done by the post-processing.

after loading the SNP track:

make masked sequences

update orthos

both of those (especially orthos) are quite long & involved processes, probably worthy of separate automation and doc.

stats?

in addition to all of the howto stuff... actual snp128 stats! :) maybe on a separate page. might be useful for reporting to NCBI.