Chains Nets: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
 
No edit summary
Line 7: Line 7:


chains in a nutshell:
chains in a nutshell:
- a chain is a sequence of gapless aligned blocks, where there must be  
* a chain is a sequence of gapless aligned blocks, where there must be  
  no overlaps of blocks' target or query coords within the chain.   
no overlaps of blocks' target or query coords within the chain.   
  Within a chain, target and query coords are monotonically  
Within a chain, target and query coords are monotonically  
  non-decreasing.  (i.e. always increasing or flat)
non-decreasing.  (i.e. always increasing or flat)
- double-sided gaps are a new capability (blastz can't do that)
* double-sided gaps are a new capability (blastz can't do that)
  that allow extremely long chains to be constructed.
that allow extremely long chains to be constructed.
- not just orthologs, but paralogs too, can result in good chains.
* not just orthologs, but paralogs too, can result in good chains.
  but that's useful!
but that's useful!
- chains should be symmetrical -- e.g. swap human-mouse -> mouse-human
* chains should be symmetrical -- e.g. swap human-mouse -> mouse-human
  chains, and you should get approx. the same chains as if you chain  
chains, and you should get approx. the same chains as if you chain  
  swapped mouse-human blastz alignments.   
swapped mouse-human blastz alignments.   
- chained blastz alignments are not single-coverage in either target  
* chained blastz alignments are not single-coverage in either target  
  or query unless some subsequent filtering (like netting) is done.   
or query unless some subsequent filtering (like netting) is done.   
- chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query.  Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).
* chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query.  Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).


And nets:
And nets:
- a net is a hierarchical collection of chains, with the  
* a net is a hierarchical collection of chains, with the  
  highest-scoring non-overlapping chains on top, and their gaps filled  
highest-scoring non-overlapping chains on top, and their gaps filled  
  in where possible by lower-scoring chains, for several levels.   
in where possible by lower-scoring chains, for several levels.   
  I think a chain's qName also helps to determine which level it lands  
I think a chain's qName also helps to determine which level it lands  
  in, i.e. it makes a difference whether a chain's qName is the same  
in, i.e. it makes a difference whether a chain's qName is the same  
  as the top-level chain's qName or not, because the levels have  
as the top-level chain's qName or not, because the levels have  
  meanings associated with them -- see details page.   
meanings associated with them -- see details page.   
- a net is single-coverage for target but not for query.
* a net is single-coverage for target but not for query.
- because it's single-coverage in the target, it's no longer  
* because it's single-coverage in the target, it's no longer  
  symmetrical.
symmetrical.
- the netter has two outputs, one of which we usually ignore: the  
* the netter has two outputs, one of which we usually ignore: the  
  target-centric net in query coordinates.  The reciprocal best  
target-centric net in query coordinates.  The reciprocal best  
  process uses that output: the query-referenced (but target-centric /  
process uses that output: the query-referenced (but target-centric /  
  target single-cov) net is turned back into component chains, and  
target single-cov) net is turned back into component chains, and  
  then those are netted to get single coverage in the query too;  
then those are netted to get single coverage in the query too;  
  the two outputs of that netting are reciprocal-best in query and  
the two outputs of that netting are reciprocal-best in query and  
  target coords.  Reciprocal-best nets are symmetrical again.   
target coords.  Reciprocal-best nets are symmetrical again.   
- nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.
* nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.
 
Navigation: back to [[Implementation_Notes]]

Revision as of 23:02, 7 April 2006

Chains and nets are Jim Kent's brainchild, published here: [[1]]

They used to be generated by a long manual process documented in some of our older make*.doc files, but are now generated by the script kent/src/utils/doBlastzChainNet.pl .

Here are some musings on chains and nets -- these are from Angie's mental model of chains and nets and represent opinions which may be outdated or plain old incorrect. The source code, and the results that we get by running these programs on real data, are the ultimate source of truth about chains and nets.

chains in a nutshell:

  • a chain is a sequence of gapless aligned blocks, where there must be

no overlaps of blocks' target or query coords within the chain. Within a chain, target and query coords are monotonically non-decreasing. (i.e. always increasing or flat)

  • double-sided gaps are a new capability (blastz can't do that)

that allow extremely long chains to be constructed.

  • not just orthologs, but paralogs too, can result in good chains.

but that's useful!

  • chains should be symmetrical -- e.g. swap human-mouse -> mouse-human

chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments.

  • chained blastz alignments are not single-coverage in either target

or query unless some subsequent filtering (like netting) is done.

  • chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).

And nets:

  • a net is a hierarchical collection of chains, with the

highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels. I think a chain's qName also helps to determine which level it lands in, i.e. it makes a difference whether a chain's qName is the same as the top-level chain's qName or not, because the levels have meanings associated with them -- see details page.

  • a net is single-coverage for target but not for query.
  • because it's single-coverage in the target, it's no longer

symmetrical.

  • the netter has two outputs, one of which we usually ignore: the

target-centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again.

  • nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.

Navigation: back to Implementation_Notes