Conservation Track: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
 
(Added links to Blastz and Chains_Nets; wiki-formatted.)
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<H2> Conservation Track Implementation Notes </H2>
= Conservation Track Implementation Notes (Kate's slides) =


1) Track Components: Tables
== Track Components: Tables ==


    multizNway
* multizNway: scored ref, index into maf files (via extFile)
    multizNwaySummary
* multizNwaySummary: added to improve performance when the display is > 1 million bases
    multizNwayFrames
* multizNwayFrames: Mark D's codon frames, Brian R's gap annotation
    phastConsNway
* phastConsNway: wiggle, one score per base in genome. provides index into wib file.  based on percent (0..1)




2) Track Components: Files
== Track Components: Files ==


  * Display:
* Display:
        /gbdb/<db>/multizNway/*.maf
**/gbdb/<db>/multizNway/*.maf (multiz table uses this file)
        /gbdb/<db>/phastConsNway/*.wib
**/gbdb/<db>/phastConsNway/*.wib (phastCons table uses this file)
The actual data values are not in the database tables.  This data is in a binary compressed format with a single byte in the phastCons17.wib file interpreted to determine the actual data value at each position in the chromosome.  The table you are looking at is merely the indexing mechanism into the single-byte .wib file.


  * Downloads:
The phastCons17way/chr*.gz files contain the per-base scores generated by the phastCons program.  These scores are then compressed & encoded for display as a wiggle (by wigEncode) to produce two files -- a .wig file that is loaded (by hgWiggle), and the .wib file, which is referenced by the values in the table.  The original scores (or very close to the original scores) can be extracted from the wiggle by the utility "hgWiggle".
        goldenPath/<db>/multizNway/chr*.maf
        goldenPath/<db>/multizNway/upstream*.maf
        goldenPath/<db>/phastConsNway/*


3) Track Components: TrackDb
* Downloads:
**goldenPath/<db>/multizNway/chr*.maf
**goldenPath/<db>/multizNway/upstream*.maf
**goldenPath/<db>/phastConsNway/*  (compressed, per chrom)


  * Required:
        type wigMaf
        wiggle


  * Optional:
== Track Components: TrackDb ==
        speciesOrder
        speciesGroups
        summary
        frames


4) Most Conserved Track
* Required:
**type wigMaf  (track type)
**wiggle  (wiggle table)


  * Table:
* Optional:
        phastConsNwayElements
**speciesOrder (this is the order that the species will appear on the track control page and in the browser -- should be in phylo order)
**speciesGroups (these are the groups into which the species are split (e.g. vertebrate, mammals))
**summary (points to multizXwaySummary table)
**frames (points to multizXwayFrames table)


  * Files:
        NONE
5) Track Construction: Overview


  1. Create single-coverage pairwise alignments (axtNet)
== Most Conserved Track ==
  2. Create multiple alignment
  3. Generate conservation scores and conserved elements
  4. Add gap annotation to multiple alignment
  5. Create multiple alignment summary
  6. Create frame tables for multiple alignment


* Table:
**phastConsNwayElements (BED of scored elements)


6) Pairwise Alignments: Procedure
* Files:
**NONE


  1. Blastz Alignment (blastz, lavToPsl)
  2. Chaining (axtChain, chainMergeSort, chainAntiRepeat)
  3. Netting (chainNet, netFilter)
  4. Extraction of single-coverage alignments from the net (netToAxt)


  *  All automated by doBlastzChainNet.pl
== Track Construction: Overview ==
  (Thanks, Angie!!)


# Create single-coverage pairwise alignments (axtNet)
# Create multiple alignment
# Generate conservation scores and conserved elements (phastCons)
# Add gap annotation to multiple alignment (Brian R's gap annotation software)
# Create multiple alignment summary
# Create frame tables for multiple alignment


7) Pairwise Alignments: Parameters


    Blastz scoring matrix
== Pairwise Alignments: Procedure ==
    Blastz gap penalties, misc
See [[Blastz]] and [[Chains_Nets]]
    Lineage-specific repeat abridging
    Chaining min score, linear gap


# Blastz Alignment (blastz, lavToPsl)  (this generates a set of alignments in psl (these are close enough so that you can swap species1 <-> species2))
# Chaining (axtChain, chainMergeSort, chainAntiRepeat)
# Netting (chainNet, netFilter)
# Extraction of single-coverage alignments from the net (netToAxt) (net chooses single best chain for Level 1)  (can't simply swap nets like you can chains)  (feed netAxt into MULTIZ)


8) Multiple Alignment
* All automated by doBlastzChainNet.pl


  * Inputs:
        1. Single-coverage pairwise alignments
        2. Species tree


  * Aligner:
== Pairwise Alignments: Parameters ==
        multiz (with autoMZ driver) or
See [[Blastz]] and [[Chains_Nets]]
        TBA (Threaded Blockset Aligner)


*Blastz scoring matrix (this is the $matrix that shows up on the chain description page)
*Blastz gap penalties, misc
*Lineage-specific repeat abridging (run RepeatMasker/DateRepeats on target and query .out's; wrapper scripts snip out sequences before blastz is run and adjust alignment coords afterwards)
*Chaining min score, linear gap


9) Conservation Scoring with PhastCons


  * Inputs:
== Multiple Alignment ==
        Multiple alignment
        Species tree with branch lengths
        (optionally two trees)


  * Parameters: rho, expected-len, target-coverage
* Inputs:
**1. Single-coverage pairwise alignments
**2. Species tree (phastCons "make tree")


  * Output:
* Aligner:
        Per-base probability
**multiz (with autoMZ driver) (feed it the tree, and it does the multiple alignment)
        Conserved elements
**or
**TBA (Threaded Blockset Aligner) (ENCODE uses this)




10) Multiple Alignment Summary and Annotations
== Conservation Scoring with PhastCons  (Adam S's phylogenetic HMM) ==


    Gap Annotation (mafAddIRows)
* Inputs:
    Summary table (hgLoadMafSummary)
**Multiple alignment
    Coding frames (getFrames, etc.)
**Species tree with branch lengths
** (optionally two trees)
 
* Parameters:  rho, expected-len, target-coverage
 
* Output:
**Per-base probability
**Conserved elements
 
(our goal is to get 5% of genome in conserved elements -- the params are tweaked until we hit this)
 
 
== Multiple Alignment Summary and Annotations ==
 
*Gap Annotation (mafAddIRows)
*Summary table (hgLoadMafSummary)
*Coding frames (getFrames, etc.)
 
 
[[Category:Technical FAQ]]
[[Category:Comparative Genomics]]

Latest revision as of 20:51, 14 November 2007

Conservation Track Implementation Notes (Kate's slides)

Track Components: Tables

  • multizNway: scored ref, index into maf files (via extFile)
  • multizNwaySummary: added to improve performance when the display is > 1 million bases
  • multizNwayFrames: Mark D's codon frames, Brian R's gap annotation
  • phastConsNway: wiggle, one score per base in genome. provides index into wib file. based on percent (0..1)


Track Components: Files

  • Display:
    • /gbdb/<db>/multizNway/*.maf (multiz table uses this file)
    • /gbdb/<db>/phastConsNway/*.wib (phastCons table uses this file)

The actual data values are not in the database tables. This data is in a binary compressed format with a single byte in the phastCons17.wib file interpreted to determine the actual data value at each position in the chromosome. The table you are looking at is merely the indexing mechanism into the single-byte .wib file.

The phastCons17way/chr*.gz files contain the per-base scores generated by the phastCons program. These scores are then compressed & encoded for display as a wiggle (by wigEncode) to produce two files -- a .wig file that is loaded (by hgWiggle), and the .wib file, which is referenced by the values in the table. The original scores (or very close to the original scores) can be extracted from the wiggle by the utility "hgWiggle".

  • Downloads:
    • goldenPath/<db>/multizNway/chr*.maf
    • goldenPath/<db>/multizNway/upstream*.maf
    • goldenPath/<db>/phastConsNway/* (compressed, per chrom)


Track Components: TrackDb

  • Required:
    • type wigMaf (track type)
    • wiggle (wiggle table)
  • Optional:
    • speciesOrder (this is the order that the species will appear on the track control page and in the browser -- should be in phylo order)
    • speciesGroups (these are the groups into which the species are split (e.g. vertebrate, mammals))
    • summary (points to multizXwaySummary table)
    • frames (points to multizXwayFrames table)


Most Conserved Track

  • Table:
    • phastConsNwayElements (BED of scored elements)
  • Files:
    • NONE


Track Construction: Overview

  1. Create single-coverage pairwise alignments (axtNet)
  2. Create multiple alignment
  3. Generate conservation scores and conserved elements (phastCons)
  4. Add gap annotation to multiple alignment (Brian R's gap annotation software)
  5. Create multiple alignment summary
  6. Create frame tables for multiple alignment


Pairwise Alignments: Procedure

See Blastz and Chains_Nets

  1. Blastz Alignment (blastz, lavToPsl) (this generates a set of alignments in psl (these are close enough so that you can swap species1 <-> species2))
  2. Chaining (axtChain, chainMergeSort, chainAntiRepeat)
  3. Netting (chainNet, netFilter)
  4. Extraction of single-coverage alignments from the net (netToAxt) (net chooses single best chain for Level 1) (can't simply swap nets like you can chains) (feed netAxt into MULTIZ)
  • All automated by doBlastzChainNet.pl


Pairwise Alignments: Parameters

See Blastz and Chains_Nets

  • Blastz scoring matrix (this is the $matrix that shows up on the chain description page)
  • Blastz gap penalties, misc
  • Lineage-specific repeat abridging (run RepeatMasker/DateRepeats on target and query .out's; wrapper scripts snip out sequences before blastz is run and adjust alignment coords afterwards)
  • Chaining min score, linear gap


Multiple Alignment

  • Inputs:
    • 1. Single-coverage pairwise alignments
    • 2. Species tree (phastCons "make tree")
  • Aligner:
    • multiz (with autoMZ driver) (feed it the tree, and it does the multiple alignment)
    • or
    • TBA (Threaded Blockset Aligner) (ENCODE uses this)


Conservation Scoring with PhastCons (Adam S's phylogenetic HMM)

  • Inputs:
    • Multiple alignment
    • Species tree with branch lengths
    • (optionally two trees)
  • Parameters: rho, expected-len, target-coverage
  • Output:
    • Per-base probability
    • Conserved elements

(our goal is to get 5% of genome in conserved elements -- the params are tweaked until we hit this)


Multiple Alignment Summary and Annotations

  • Gap Annotation (mafAddIRows)
  • Summary table (hgLoadMafSummary)
  • Coding frames (getFrames, etc.)