Conservation Track: Difference between revisions

From genomewiki
Jump to navigationJump to search
(added my notes from Kate's talk)
mNo edit summary
Line 14: Line 14:
         /gbdb/<db>/multizNway/*.maf  (multiz table uses this file)
         /gbdb/<db>/multizNway/*.maf  (multiz table uses this file)
         /gbdb/<db>/phastConsNway/*.wib  (phastCons table uses this file)
         /gbdb/<db>/phastConsNway/*.wib  (phastCons table uses this file)
The actual data values are not in the database tables.  This data is in a binary compressed format with a single byte in the phastCons17.wib file interpreted to determine the actual data value at each position in the chromosome.  The table you are looking at is merely the indexing mechanism into the single-byte .wib file.
The phatCons17way/chr*.gz files contain the per-base scores generated by the phastCons program.  These scores are then compressed & encoded for display as a wiggle (by wigEncode) to produce two files -- a .wig file that is loaded (by hgWiggle), and the .wib file, which is referenced by the values in the table.  The original scores (or very close to the original scores) can be extracted from the wiggle by the utility "hgWiggle".


   * Downloads:
   * Downloads:

Revision as of 20:47, 1 August 2006

Conservation Track Implementation Notes

1) Track Components: Tables

   multizNway: scored ref, index into maf files (via extFile)
   multizNwaySummary: added to improve performance when the display is > 1 million bases
   multizNwayFrames: Mark D's codon frames, Brian R's gap annotation
   phastConsNway: wiggle, one score per base in genome. provides index into wib file.  based on percent (0..1)


2) Track Components: Files

 * Display:
       /gbdb/<db>/multizNway/*.maf  (multiz table uses this file)
       /gbdb/<db>/phastConsNway/*.wib  (phastCons table uses this file)

The actual data values are not in the database tables. This data is in a binary compressed format with a single byte in the phastCons17.wib file interpreted to determine the actual data value at each position in the chromosome. The table you are looking at is merely the indexing mechanism into the single-byte .wib file.

The phatCons17way/chr*.gz files contain the per-base scores generated by the phastCons program. These scores are then compressed & encoded for display as a wiggle (by wigEncode) to produce two files -- a .wig file that is loaded (by hgWiggle), and the .wib file, which is referenced by the values in the table. The original scores (or very close to the original scores) can be extracted from the wiggle by the utility "hgWiggle".

 * Downloads:
       goldenPath/<db>/multizNway/chr*.maf
       goldenPath/<db>/multizNway/upstream*.maf
       goldenPath/<db>/phastConsNway/*  (compressed, per chrom)

3) Track Components: TrackDb

 * Required:
       type wigMaf  (track type)
       wiggle  (wiggle table)
 * Optional:
       speciesOrder (this is the order that the species will appear on the track control page and in the browser -- should be in phylo order)
       speciesGroups (these are the groups into which the species are split (e.g. vertebrate, mammals))
       summary (points to multizXwaySummary table)
       frames (points to multizXwayFrames table)

4) Most Conserved Track

 * Table:
       phastConsNwayElements (BED of scored elements)
 * Files:
       NONE

5) Track Construction: Overview

 1. Create single-coverage pairwise alignments (axtNet)
 2. Create multiple alignment
 3. Generate conservation scores and conserved elements (phastCons)
 4. Add gap annotation to multiple alignment (Brian R's gap annotation software)
 5. Create multiple alignment summary
 6. Create frame tables for multiple alignment


6) Pairwise Alignments: Procedure

 1. Blastz Alignment (blastz, lavToPsl)  (this generates a set of alignments in psl (these are close enough so that you can swap species1 <-> species2))
 2. Chaining (axtChain, chainMergeSort, chainAntiRepeat)
 3. Netting (chainNet, netFilter)
 4. Extraction of single-coverage alignments from the net (netToAxt) (net chooses single best chain for Level 1)  (can't simply swap nets like you can chains)  (feed netAxt into MULTIZ)
 *  All automated by doBlastzChainNet.pl
  (Thanks, Angie!!)


7) Pairwise Alignments: Parameters

   Blastz scoring matrix (this is the $matrix that shows up on the chain description page)
   Blastz gap penalties, misc
   Lineage-specific repeat abridging (give BLASTZ masked sequence, BLASTZ aviods starting in a repeat, but will continue through one)
   Chaining min score, linear gap


8) Multiple Alignment

 * Inputs:
       1. Single-coverage pairwise alignments
       2. Species tree (phastCons "make tree")
 * Aligner:
       multiz (with autoMZ driver) (feed it the tree, and it does the multiple alignment)
       or
       TBA (Threaded Blockset Aligner) (ENCODE uses this)


9) Conservation Scoring with PhastCons (Adam S's phylogenetic HMM)

 * Inputs:
       Multiple alignment
       Species tree with branch lengths
        (optionally two trees)
 * Parameters:  rho, expected-len, target-coverage
 * Output:
       Per-base probability
       Conserved elements

(our goal is to get 5% of genome in conserved elements -- the params are tweaked until we hit this)

10) Multiple Alignment Summary and Annotations

   Gap Annotation (mafAddIRows)
   Summary table (hgLoadMafSummary)
   Coding frames (getFrames, etc.)