Conservation Track

From genomewiki
Revision as of 20:51, 14 November 2007 by AngieHinrichs (talk | contribs) (Added links to Blastz and Chains_Nets; wiki-formatted.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Conservation Track Implementation Notes (Kate's slides)

Track Components: Tables

  • multizNway: scored ref, index into maf files (via extFile)
  • multizNwaySummary: added to improve performance when the display is > 1 million bases
  • multizNwayFrames: Mark D's codon frames, Brian R's gap annotation
  • phastConsNway: wiggle, one score per base in genome. provides index into wib file. based on percent (0..1)


Track Components: Files

  • Display:
    • /gbdb/<db>/multizNway/*.maf (multiz table uses this file)
    • /gbdb/<db>/phastConsNway/*.wib (phastCons table uses this file)

The actual data values are not in the database tables. This data is in a binary compressed format with a single byte in the phastCons17.wib file interpreted to determine the actual data value at each position in the chromosome. The table you are looking at is merely the indexing mechanism into the single-byte .wib file.

The phastCons17way/chr*.gz files contain the per-base scores generated by the phastCons program. These scores are then compressed & encoded for display as a wiggle (by wigEncode) to produce two files -- a .wig file that is loaded (by hgWiggle), and the .wib file, which is referenced by the values in the table. The original scores (or very close to the original scores) can be extracted from the wiggle by the utility "hgWiggle".

  • Downloads:
    • goldenPath/<db>/multizNway/chr*.maf
    • goldenPath/<db>/multizNway/upstream*.maf
    • goldenPath/<db>/phastConsNway/* (compressed, per chrom)


Track Components: TrackDb

  • Required:
    • type wigMaf (track type)
    • wiggle (wiggle table)
  • Optional:
    • speciesOrder (this is the order that the species will appear on the track control page and in the browser -- should be in phylo order)
    • speciesGroups (these are the groups into which the species are split (e.g. vertebrate, mammals))
    • summary (points to multizXwaySummary table)
    • frames (points to multizXwayFrames table)


Most Conserved Track

  • Table:
    • phastConsNwayElements (BED of scored elements)
  • Files:
    • NONE


Track Construction: Overview

  1. Create single-coverage pairwise alignments (axtNet)
  2. Create multiple alignment
  3. Generate conservation scores and conserved elements (phastCons)
  4. Add gap annotation to multiple alignment (Brian R's gap annotation software)
  5. Create multiple alignment summary
  6. Create frame tables for multiple alignment


Pairwise Alignments: Procedure

See Blastz and Chains_Nets

  1. Blastz Alignment (blastz, lavToPsl) (this generates a set of alignments in psl (these are close enough so that you can swap species1 <-> species2))
  2. Chaining (axtChain, chainMergeSort, chainAntiRepeat)
  3. Netting (chainNet, netFilter)
  4. Extraction of single-coverage alignments from the net (netToAxt) (net chooses single best chain for Level 1) (can't simply swap nets like you can chains) (feed netAxt into MULTIZ)
  • All automated by doBlastzChainNet.pl


Pairwise Alignments: Parameters

See Blastz and Chains_Nets

  • Blastz scoring matrix (this is the $matrix that shows up on the chain description page)
  • Blastz gap penalties, misc
  • Lineage-specific repeat abridging (run RepeatMasker/DateRepeats on target and query .out's; wrapper scripts snip out sequences before blastz is run and adjust alignment coords afterwards)
  • Chaining min score, linear gap


Multiple Alignment

  • Inputs:
    • 1. Single-coverage pairwise alignments
    • 2. Species tree (phastCons "make tree")
  • Aligner:
    • multiz (with autoMZ driver) (feed it the tree, and it does the multiple alignment)
    • or
    • TBA (Threaded Blockset Aligner) (ENCODE uses this)


Conservation Scoring with PhastCons (Adam S's phylogenetic HMM)

  • Inputs:
    • Multiple alignment
    • Species tree with branch lengths
    • (optionally two trees)
  • Parameters: rho, expected-len, target-coverage
  • Output:
    • Per-base probability
    • Conserved elements

(our goal is to get 5% of genome in conserved elements -- the params are tweaked until we hit this)


Multiple Alignment Summary and Annotations

  • Gap Annotation (mafAddIRows)
  • Summary table (hgLoadMafSummary)
  • Coding frames (getFrames, etc.)