Chains Nets: Difference between revisions

From genomewiki
Jump to navigationJump to search
m (Added Category:Comparative Genomics)
(Split into sections, wrote an intro and basic definitions. The Nets section needs more work, but at least this is an improvement.)
Line 1: Line 1:
Chains and nets are [[User:Jimkent|Jim Kent]]'s brainchild, published here:
Chains and nets are higher-level collections of basic pairwise sequence alignments.  Cross-species nets are used to make a single-coverage (on the reference genome) collection of pairwise alignments that are the bases of our Multiz multi-species alignments in the Conservation track.  The chain and net algorithms, as well as results from human-mouse alignments, were [[http://www.pnas.org/cgi/content/full/100/20/11484 published]] in 2002. They are generated from genomic local alignments computed by [[Blastz]] (2002-2008) or [[Lastz]] (2008-) post-processed by a series of UCSC programs, most notably axtChain, chainNet and netFilter.
[[http://www.pnas.org/cgi/content/full/100/20/11484 http://www.pnas.org/cgi/content/full/100/20/11484]]  They are generated from genomic local alignments computed by [[Blastz]].


They used to be generated by a long manual process documented in some of our older makeDb/doc/*.txt files, but are now generated by the script kent/src/hg/utils/automation/doBlastzChainNet.pl .
The contents of this page are from [[User:AngieHinrichs|Angie]]'s mental model of chains and nets and represent opinions which may be outdated or plain old incorrect.  The source code, and the results that we get by running these programs on real data, are the ultimate source of truth about chains and nets.


Here are some musings on the fine points of chains and nets -- these are from [[User:AngieHinrichs|Angie]]'s mental model of chains and nets and represent opinions which may be outdated or plain old incorrectThe source code, and the results that we get by running these programs on real data, are the ultimate source of truth about chains and nets.
Please keep in mind that the outputs of any alignment algorithm are not the final Truth about homology between sequences.  The scoring system and other parameters of any alignment algorithm are designed to produce high scores for similarities that would likely result from some model of nucleotide-level evolution; tweaking a parameter can change the results significantly.  The quality and completeness of the reference assemblies also affect alignment results.  That said, chains and nets are powerful constructs for identifying similarities over very large regions of the genome, and inferring chromosomal rearrangements that may have occurred as the two sequences diverged from a common ancestral sequence.
 
== Basic definitions ==
 
In chain and net lingo, the '''target''' is the reference genome sequence and the '''query''' is some other genome sequence.  For example, if you are viewing Human-Mouse alignments in the Human genome browser, human is the target and mouse is the query.
 
A '''gapless block''' is a base-for-base alignment between part of the target and part of the query, possibly including mismatching bases.  It has the same length in bases on the target and the query.  This is the output of the most primitive alignment algorithms. 
 
A '''gap''' is a link between two gapless blocks, indicating that the target or the query has sequence that should be skipped over in order to make the best-scoring alignment.  In other words, the scoring penalty for skipping over one or more bases is less than the penalty for continuing to align the sequences without skipping. 
 
A '''single-sided gap''' is a gap in which sequence in either target or query must be skipped over.  A plausible explanation for needing to skip over a base in the target while not skipping a base in the query is that either the target has an inserted base or the query has a deleted base.  Many alignment tools produce alignments with single-sided gaps between gapless blocks. 
 
A '''double-sided gap''' skips over sequence in both target and query because the sum of penalties for mismatching bases exceeds the penalty for extending a gap across themThis is possible only when the penalty for extending a gap is less than the penalty for creating a new gap and less than the penalty for a mismatch, and when the alignment algorithm is capable of considering double-sided gaps. 
 
== Chains in a nutshell ==
 
A '''chain''' is a sequence of non-overlapping gapless blocks, with single- or double-sided gaps between blocks.  Within a chain, target and query coords are monotonically non-decreasing (i.e. always increasing or flat).  Chains are constructed by the axtChain program which finds pairwise alignments with the same target and query sequence, on the same strand, that can be merged if overlapping and joined into one longer alignment with a higher score under an affine gap-scoring system (progressively decreasing penalties for longer gaps).


Chains in a nutshell:
* a chain is a sequence of gapless aligned blocks, where there must be no overlaps of blocks' target or query coords within the chain.  Within a chain, target and query coords are monotonically non-decreasing.  (i.e. always increasing or flat)
* double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed.
* double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed.
* not just orthologs, but paralogs too, can result in good chains.  but that's useful!
* not just orthologs, but paralogs too, can result in good chains.  but that's useful!
Line 14: Line 27:
* chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query.  Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).
* chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query.  Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).


And nets:
== Nets in a nutshell ==
* a net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels.  I think a chain's qName also helps to determine which level it lands in, i.e. it makes a difference whether a chain's qName is the same as the top-level chain's qName or not, because the levels have  meanings associated with them -- see details page.   
 
* a net is single-coverage for target but not for query.
A '''net''' is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, which in turn may have gaps filled in by lower-level chains and so on.   
 
* I think a chain's qName also helps to determine which level it lands in, i.e. it makes a difference whether a chain's qName is the same as the top-level chain's qName or not, because the levels have  meanings associated with them -- see details page.   
* a net is single-coverage for target but not for query, unless it has been filtered to be single-coverage on both target and query.  By convention we add "rbest" to the net filename in that case.
* because it's single-coverage in the target, it's no longer symmetrical.
* because it's single-coverage in the target, it's no longer symmetrical.
* the netter has two outputs, one of which we usually ignore: the target-centric net in query coordinates.  The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and  target coords.  Reciprocal-best nets are symmetrical again.   
* the netter has two outputs, one of which we usually ignore: the target-centric net in query coordinates.  The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and  target coords.  Reciprocal-best nets are symmetrical again.   
* nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.
* nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.


"LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process.  Same-species liftOver chains use blat -fastMap as the alignment method, and are generated by kent/src/hg/utils/automation/doSameSpeciesLiftOver.pl, based on a series of scripts that [[User:Kate|Kate]] wrote in kent/src/hg/makeDb/makeLoChain/. Cross-species liftOver chains are generated by doBlastzChainNet.pl.
"LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process.   
 
== History ==
 
Chains and nets are [[User:Jimkent|Jim Kent]]'s brainchild, building on joint work with blastz author Scott Schwartz. 
 
Cross-species chains and nets used to be generated by a long manual process documented in some of our older makeDb/doc/*.txt files, but since ~2006 they have been generated by the script kent/src/hg/utils/automation/doBlastzChainNet.pl .
 
Same-species liftOver chains use blat -fastMap as the alignment method, and are generated by kent/src/hg/utils/automation/doSameSpeciesLiftOver.pl, based on a series of scripts that [[User:Kate|Kate]] wrote in kent/src/hg/makeDb/makeLoChain/.
 


Navigation: back to [[Implementation_Notes]]
Navigation: back to [[Implementation_Notes]]


[[Category:Technical FAQ]]
[[Category:Technical FAQ]]
[[Category:Comparative Genomics]]
[[Category:Comparative Genomics]]

Revision as of 19:10, 16 April 2015

Chains and nets are higher-level collections of basic pairwise sequence alignments. Cross-species nets are used to make a single-coverage (on the reference genome) collection of pairwise alignments that are the bases of our Multiz multi-species alignments in the Conservation track. The chain and net algorithms, as well as results from human-mouse alignments, were [published] in 2002. They are generated from genomic local alignments computed by Blastz (2002-2008) or Lastz (2008-) post-processed by a series of UCSC programs, most notably axtChain, chainNet and netFilter.

The contents of this page are from Angie's mental model of chains and nets and represent opinions which may be outdated or plain old incorrect. The source code, and the results that we get by running these programs on real data, are the ultimate source of truth about chains and nets.

Please keep in mind that the outputs of any alignment algorithm are not the final Truth about homology between sequences. The scoring system and other parameters of any alignment algorithm are designed to produce high scores for similarities that would likely result from some model of nucleotide-level evolution; tweaking a parameter can change the results significantly. The quality and completeness of the reference assemblies also affect alignment results. That said, chains and nets are powerful constructs for identifying similarities over very large regions of the genome, and inferring chromosomal rearrangements that may have occurred as the two sequences diverged from a common ancestral sequence.

Basic definitions

In chain and net lingo, the target is the reference genome sequence and the query is some other genome sequence. For example, if you are viewing Human-Mouse alignments in the Human genome browser, human is the target and mouse is the query.

A gapless block is a base-for-base alignment between part of the target and part of the query, possibly including mismatching bases. It has the same length in bases on the target and the query. This is the output of the most primitive alignment algorithms.

A gap is a link between two gapless blocks, indicating that the target or the query has sequence that should be skipped over in order to make the best-scoring alignment. In other words, the scoring penalty for skipping over one or more bases is less than the penalty for continuing to align the sequences without skipping.

A single-sided gap is a gap in which sequence in either target or query must be skipped over. A plausible explanation for needing to skip over a base in the target while not skipping a base in the query is that either the target has an inserted base or the query has a deleted base. Many alignment tools produce alignments with single-sided gaps between gapless blocks.

A double-sided gap skips over sequence in both target and query because the sum of penalties for mismatching bases exceeds the penalty for extending a gap across them. This is possible only when the penalty for extending a gap is less than the penalty for creating a new gap and less than the penalty for a mismatch, and when the alignment algorithm is capable of considering double-sided gaps.

Chains in a nutshell

A chain is a sequence of non-overlapping gapless blocks, with single- or double-sided gaps between blocks. Within a chain, target and query coords are monotonically non-decreasing (i.e. always increasing or flat). Chains are constructed by the axtChain program which finds pairwise alignments with the same target and query sequence, on the same strand, that can be merged if overlapping and joined into one longer alignment with a higher score under an affine gap-scoring system (progressively decreasing penalties for longer gaps).

  • double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed.
  • not just orthologs, but paralogs too, can result in good chains. but that's useful!
  • chains should be symmetrical -- e.g. swap human-mouse -> mouse-human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments. However, Blastz's dynamic masking is asymmetrical, so in practice those results are not exactly symmetrical. Also, dynamic masking in conjunction with changed chunk sizes can cause differences in results from one run to the next.
  • chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done.
  • chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).

Nets in a nutshell

A net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, which in turn may have gaps filled in by lower-level chains and so on.

  • I think a chain's qName also helps to determine which level it lands in, i.e. it makes a difference whether a chain's qName is the same as the top-level chain's qName or not, because the levels have meanings associated with them -- see details page.
  • a net is single-coverage for target but not for query, unless it has been filtered to be single-coverage on both target and query. By convention we add "rbest" to the net filename in that case.
  • because it's single-coverage in the target, it's no longer symmetrical.
  • the netter has two outputs, one of which we usually ignore: the target-centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again.
  • nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.

"LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process.

History

Chains and nets are Jim Kent's brainchild, building on joint work with blastz author Scott Schwartz.

Cross-species chains and nets used to be generated by a long manual process documented in some of our older makeDb/doc/*.txt files, but since ~2006 they have been generated by the script kent/src/hg/utils/automation/doBlastzChainNet.pl .

Same-species liftOver chains use blat -fastMap as the alignment method, and are generated by kent/src/hg/utils/automation/doSameSpeciesLiftOver.pl, based on a series of scripts that Kate wrote in kent/src/hg/makeDb/makeLoChain/.


Navigation: back to Implementation_Notes