Conservation Track: Difference between revisions

From genomewiki
Jump to navigationJump to search
mNo edit summary
(Added links to Blastz and Chains_Nets; wiki-formatted.)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
<H2> Conservation Track Implementation Notes </H2>
= Conservation Track Implementation Notes (Kate's slides) =


1) Track Components: Tables
== Track Components: Tables ==


    multizNway: scored ref, index into maf files (via extFile)
* multizNway: scored ref, index into maf files (via extFile)
    multizNwaySummary: added to improve performance when the display is > 1 million bases
* multizNwaySummary: added to improve performance when the display is > 1 million bases
    multizNwayFrames: Mark D's codon frames, Brian R's gap annotation
* multizNwayFrames: Mark D's codon frames, Brian R's gap annotation
    phastConsNway: wiggle, one score per base in genome. provides index into wib file.  based on percent (0..1)
* phastConsNway: wiggle, one score per base in genome. provides index into wib file.  based on percent (0..1)




2) Track Components: Files
== Track Components: Files ==


  * Display:
* Display:
        /gbdb/<db>/multizNway/*.maf  (multiz table uses this file)
**/gbdb/<db>/multizNway/*.maf  (multiz table uses this file)
        /gbdb/<db>/phastConsNway/*.wib  (phastCons table uses this file)
**/gbdb/<db>/phastConsNway/*.wib  (phastCons table uses this file)
The actual data values are not in the database tables.  This data is in a binary compressed format with a single byte in the phastCons17.wib file interpreted to determine the actual data value at each position in the chromosome.  The table you are looking at is merely the indexing mechanism into the single-byte .wib file.
The actual data values are not in the database tables.  This data is in a binary compressed format with a single byte in the phastCons17.wib file interpreted to determine the actual data value at each position in the chromosome.  The table you are looking at is merely the indexing mechanism into the single-byte .wib file.


The phatCons17way/chr*.gz files contain the per-base scores generated by the phastCons program.  These scores are then compressed & encoded for display as a wiggle (by wigEncode) to produce two files -- a .wig file that is loaded (by hgWiggle), and the .wib file, which is referenced by the values in the table.  The original scores (or very close to the original scores) can be extracted from the wiggle by the utility "hgWiggle".
The phastCons17way/chr*.gz files contain the per-base scores generated by the phastCons program.  These scores are then compressed & encoded for display as a wiggle (by wigEncode) to produce two files -- a .wig file that is loaded (by hgWiggle), and the .wib file, which is referenced by the values in the table.  The original scores (or very close to the original scores) can be extracted from the wiggle by the utility "hgWiggle".


  * Downloads:
* Downloads:
        goldenPath/<db>/multizNway/chr*.maf
**goldenPath/<db>/multizNway/chr*.maf
        goldenPath/<db>/multizNway/upstream*.maf
**goldenPath/<db>/multizNway/upstream*.maf
        goldenPath/<db>/phastConsNway/*  (compressed, per chrom)
**goldenPath/<db>/phastConsNway/*  (compressed, per chrom)


3) Track Components: TrackDb


  * Required:
== Track Components: TrackDb ==
        type wigMaf  (track type)
        wiggle  (wiggle table)


  * Optional:
* Required:
        speciesOrder (this is the order that the species will appear on the track control page and in the browser -- should be in phylo order)
**type wigMaf  (track type)
        speciesGroups (these are the groups into which the species are split (e.g. vertebrate, mammals))
**wiggle  (wiggle table)
        summary (points to multizXwaySummary table)
        frames (points to multizXwayFrames table)


4) Most Conserved Track
* Optional:
**speciesOrder (this is the order that the species will appear on the track control page and in the browser -- should be in phylo order)
**speciesGroups (these are the groups into which the species are split (e.g. vertebrate, mammals))
**summary (points to multizXwaySummary table)
**frames (points to multizXwayFrames table)


  * Table:
        phastConsNwayElements (BED of scored elements)


  * Files:
== Most Conserved Track ==
        NONE
5) Track Construction: Overview


  1. Create single-coverage pairwise alignments (axtNet)
* Table:
  2. Create multiple alignment
**phastConsNwayElements (BED of scored elements)
  3. Generate conservation scores and conserved elements (phastCons)
  4. Add gap annotation to multiple alignment (Brian R's gap annotation software)
  5. Create multiple alignment summary
  6. Create frame tables for multiple alignment


* Files:
**NONE


6) Pairwise Alignments: Procedure


  1. Blastz Alignment (blastz, lavToPsl)  (this generates a set of alignments in psl (these are close enough so that you can swap species1 <-> species2))
== Track Construction: Overview ==
  2. Chaining (axtChain, chainMergeSort, chainAntiRepeat)
  3. Netting (chainNet, netFilter)
  4. Extraction of single-coverage alignments from the net (netToAxt) (net chooses single best chain for Level 1)  (can't simply swap nets like you can chains)  (feed netAxt into MULTIZ)


  *  All automated by doBlastzChainNet.pl
# Create single-coverage pairwise alignments (axtNet)
  (Thanks, Angie!!)
# Create multiple alignment
# Generate conservation scores and conserved elements (phastCons)
# Add gap annotation to multiple alignment (Brian R's gap annotation software)
# Create multiple alignment summary
# Create frame tables for multiple alignment




7) Pairwise Alignments: Parameters
== Pairwise Alignments: Procedure ==
See [[Blastz]] and [[Chains_Nets]]


    Blastz scoring matrix (this is the $matrix that shows up on the chain description page)
# Blastz Alignment (blastz, lavToPsl)  (this generates a set of alignments in psl (these are close enough so that you can swap species1 <-> species2))
    Blastz gap penalties, misc
# Chaining (axtChain, chainMergeSort, chainAntiRepeat)
    Lineage-specific repeat abridging (give BLASTZ masked sequence, BLASTZ aviods starting in a repeat, but will continue through one)
# Netting (chainNet, netFilter)
    Chaining min score, linear gap
# Extraction of single-coverage alignments from the net (netToAxt) (net chooses single best chain for Level 1)  (can't simply swap nets like you can chains)  (feed netAxt into MULTIZ)


* All automated by doBlastzChainNet.pl


8) Multiple Alignment


  * Inputs:
== Pairwise Alignments: Parameters ==
        1. Single-coverage pairwise alignments
See [[Blastz]] and [[Chains_Nets]]
        2. Species tree (phastCons "make tree")


  * Aligner:
*Blastz scoring matrix (this is the $matrix that shows up on the chain description page)
        multiz (with autoMZ driver) (feed it the tree, and it does the multiple alignment)
*Blastz gap penalties, misc
        or
*Lineage-specific repeat abridging (run RepeatMasker/DateRepeats on target and query .out's; wrapper scripts snip out sequences before blastz is run and adjust alignment coords afterwards)
        TBA (Threaded Blockset Aligner) (ENCODE uses this)
*Chaining min score, linear gap




9) Conservation Scoring with PhastCons  (Adam S's phylogenetic HMM)
== Multiple Alignment ==


  * Inputs:
* Inputs:
        Multiple alignment
**1. Single-coverage pairwise alignments
        Species tree with branch lengths
**2. Species tree (phastCons "make tree")
        (optionally two trees)


  * Parameters: rho, expected-len, target-coverage
* Aligner:
**multiz (with autoMZ driver) (feed it the tree, and it does the multiple alignment)
**or
**TBA (Threaded Blockset Aligner) (ENCODE uses this)


  * Output:
 
        Per-base probability
== Conservation Scoring with PhastCons  (Adam S's phylogenetic HMM) ==
        Conserved elements
 
* Inputs:
**Multiple alignment
**Species tree with branch lengths
** (optionally two trees)
 
* Parameters:  rho, expected-len, target-coverage
 
* Output:
**Per-base probability
**Conserved elements


(our goal is to get 5% of genome in conserved elements -- the params are tweaked until we hit this)
(our goal is to get 5% of genome in conserved elements -- the params are tweaked until we hit this)


10) Multiple Alignment Summary and Annotations


    Gap Annotation (mafAddIRows)
== Multiple Alignment Summary and Annotations ==
    Summary table (hgLoadMafSummary)
 
    Coding frames (getFrames, etc.)
*Gap Annotation (mafAddIRows)
*Summary table (hgLoadMafSummary)
*Coding frames (getFrames, etc.)




[[Category:Technical FAQ]]
[[Category:Technical FAQ]]
[[Category:Comparative Genomics]]

Latest revision as of 20:51, 14 November 2007

Conservation Track Implementation Notes (Kate's slides)

Track Components: Tables

  • multizNway: scored ref, index into maf files (via extFile)
  • multizNwaySummary: added to improve performance when the display is > 1 million bases
  • multizNwayFrames: Mark D's codon frames, Brian R's gap annotation
  • phastConsNway: wiggle, one score per base in genome. provides index into wib file. based on percent (0..1)


Track Components: Files

  • Display:
    • /gbdb/<db>/multizNway/*.maf (multiz table uses this file)
    • /gbdb/<db>/phastConsNway/*.wib (phastCons table uses this file)

The actual data values are not in the database tables. This data is in a binary compressed format with a single byte in the phastCons17.wib file interpreted to determine the actual data value at each position in the chromosome. The table you are looking at is merely the indexing mechanism into the single-byte .wib file.

The phastCons17way/chr*.gz files contain the per-base scores generated by the phastCons program. These scores are then compressed & encoded for display as a wiggle (by wigEncode) to produce two files -- a .wig file that is loaded (by hgWiggle), and the .wib file, which is referenced by the values in the table. The original scores (or very close to the original scores) can be extracted from the wiggle by the utility "hgWiggle".

  • Downloads:
    • goldenPath/<db>/multizNway/chr*.maf
    • goldenPath/<db>/multizNway/upstream*.maf
    • goldenPath/<db>/phastConsNway/* (compressed, per chrom)


Track Components: TrackDb

  • Required:
    • type wigMaf (track type)
    • wiggle (wiggle table)
  • Optional:
    • speciesOrder (this is the order that the species will appear on the track control page and in the browser -- should be in phylo order)
    • speciesGroups (these are the groups into which the species are split (e.g. vertebrate, mammals))
    • summary (points to multizXwaySummary table)
    • frames (points to multizXwayFrames table)


Most Conserved Track

  • Table:
    • phastConsNwayElements (BED of scored elements)
  • Files:
    • NONE


Track Construction: Overview

  1. Create single-coverage pairwise alignments (axtNet)
  2. Create multiple alignment
  3. Generate conservation scores and conserved elements (phastCons)
  4. Add gap annotation to multiple alignment (Brian R's gap annotation software)
  5. Create multiple alignment summary
  6. Create frame tables for multiple alignment


Pairwise Alignments: Procedure

See Blastz and Chains_Nets

  1. Blastz Alignment (blastz, lavToPsl) (this generates a set of alignments in psl (these are close enough so that you can swap species1 <-> species2))
  2. Chaining (axtChain, chainMergeSort, chainAntiRepeat)
  3. Netting (chainNet, netFilter)
  4. Extraction of single-coverage alignments from the net (netToAxt) (net chooses single best chain for Level 1) (can't simply swap nets like you can chains) (feed netAxt into MULTIZ)
  • All automated by doBlastzChainNet.pl


Pairwise Alignments: Parameters

See Blastz and Chains_Nets

  • Blastz scoring matrix (this is the $matrix that shows up on the chain description page)
  • Blastz gap penalties, misc
  • Lineage-specific repeat abridging (run RepeatMasker/DateRepeats on target and query .out's; wrapper scripts snip out sequences before blastz is run and adjust alignment coords afterwards)
  • Chaining min score, linear gap


Multiple Alignment

  • Inputs:
    • 1. Single-coverage pairwise alignments
    • 2. Species tree (phastCons "make tree")
  • Aligner:
    • multiz (with autoMZ driver) (feed it the tree, and it does the multiple alignment)
    • or
    • TBA (Threaded Blockset Aligner) (ENCODE uses this)


Conservation Scoring with PhastCons (Adam S's phylogenetic HMM)

  • Inputs:
    • Multiple alignment
    • Species tree with branch lengths
    • (optionally two trees)
  • Parameters: rho, expected-len, target-coverage
  • Output:
    • Per-base probability
    • Conserved elements

(our goal is to get 5% of genome in conserved elements -- the params are tweaked until we hit this)


Multiple Alignment Summary and Annotations

  • Gap Annotation (mafAddIRows)
  • Summary table (hgLoadMafSummary)
  • Coding frames (getFrames, etc.)