Chains Nets: Difference between revisions

From genomewiki
Jump to navigationJump to search
(Split into sections, wrote an intro and basic definitions. The Nets section needs more work, but at least this is an improvement.)
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 26: Line 26:
* chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done.   
* chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done.   
* chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query.  Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).
* chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query.  Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).
== chains and the strand ==
Chains are unusual in that the position is specified from the end of the chromosome, rather than the start, when they're on the negative strand. This has led to a lot of grief over the years and confuses most programmers when they start coding with chain files.
Also, the "chain" database tables are not exactly like the chain file. Brian in #19186: In our databases, chains are assumed to be on the + strand on the target/reference, that is, we don't save the target strand in the database schemas for chains. The chain file format, which is something different than the schemas we use for the TWO tables that make up each chain track, explicitly stores the target strand so it could theoretically be -.
In both formats the blocks that make up a single chain are defined to have the same target and query strand.


== Nets in a nutshell ==
== Nets in a nutshell ==
Line 47: Line 55:
Same-species liftOver chains use blat -fastMap as the alignment method, and are generated by kent/src/hg/utils/automation/doSameSpeciesLiftOver.pl, based on a series of scripts that [[User:Kate|Kate]] wrote in kent/src/hg/makeDb/makeLoChain/.
Same-species liftOver chains use blat -fastMap as the alignment method, and are generated by kent/src/hg/utils/automation/doSameSpeciesLiftOver.pl, based on a series of scripts that [[User:Kate|Kate]] wrote in kent/src/hg/makeDb/makeLoChain/.


== FAQs? ==
The original publication describing chains and net, Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes (Kent et al., PNAS September 30, 2003 100 (20) 11484-11489), might be helpful for understanding the rationale behind the process.
> It’s somehow opposite with the alignment file, right? For example, the psl file records info of sequences from query assembly mapping agains sequences
> from target assembly.
The PSL format lists query coords before target coords, but the UCSC convention is that the target is the reference genome (on which the Genome Browser displays) and the query is the 'other' (the query could be mRNA from the same species as the target/reference genome, or could be another species). The PSL format is the native output format of BLAT, which differs from BLAST in that BLAT indexes the target and scans the query while BLAST indexes the query and scans the target. Those are pretty esoteric differences, but I think they help explain why the UCSC view of target and query might be the opposite of some others.
> And then collapse the chains to one.
The chains are not collapsed to one; that would violate the constraint on a chain that it has monotonically increasing coordinates on the same target and query sequences in the same orientation. In the net, the two chains are retained at different levels (the primary alignment is at the top level and the secondary alignment is at a lower level). When the netChainSubset program extracts the liftOver chains from the net and full set of chains, it outputs the complete primary chain, but it outputs only a portion of the secondary chain.
> I don’t understand why use the secondary alignment to fill the gap in the primary alignment.
Chains and nets were designed to capture medium-to-large-scale rearrangements during evolution of species from a common ancestor: duplications, inversions, translocations. For example, in the case of an inversion, we would expect a gap in the top-level alignment to coincide with an alignment to the opposite strand of the same sequence (and similar breakpoints). With a translocation, a gap in the top-level alignment might be "filled" by an alignment to some other chromosome. When there is a duplication in the target, the target might have two (diverged) copies of the ancestral sequence that align to the same un-duplicated location in the query. In that case, the top-level chain would have a gap that is filled by an alignment to the same location on the query (single-coverage on the target, but multiple coverage on the query).
> I suppose chain should reflect the true difference between two assembly. Say the contig a is actually corresponding to the primary hit region in
> hg19. Here if the gap is filled as described above to generate a chain, wouldn’t that cause the gapped bases from hg19 being lifted to
> a false corresponding region in contig?
All bioinformatics algorithms are attempts to approximate the truth, and they fall short, especially when their assumptions and parameters are not tuned exactly right for the question at hand. If alignment parameters are overly sensitive for the actual divergence of target and query, then yes, spurious alignments will probably appear in the results. But there may be other explanations for unexpected alignments. (Assembly errors, sequencing errors, unexpected variation, some new discovery for you to make....) There is no simple answer that applies to all situations, and no substitute for trying different parameters and methods and carefully examining the results to see what works best for your particular application. It may help to make custom tracks using our bigChain format so you can examine and compare results in the Genome Browser.
> I’ve recently discovered that we can use minimap2 to generate alignment file and convert to psl and use that psl to generate chain file. If I can
> solve the multiple-mapping issue in the alignment file level, I don’t need to perform netChainSubset right?
Perhaps -- that is up to you to determine. Best wishes for your research! If you know results of an evaluation using minimap2 versus lastz, please contact us at genome@soe.ucsc.edu.


Navigation: back to [[Implementation_Notes]]
Navigation: back to [[Implementation_Notes]]

Latest revision as of 13:32, 18 September 2023

Chains and nets are higher-level collections of basic pairwise sequence alignments. Cross-species nets are used to make a single-coverage (on the reference genome) collection of pairwise alignments that are the bases of our Multiz multi-species alignments in the Conservation track. The chain and net algorithms, as well as results from human-mouse alignments, were [published] in 2002. They are generated from genomic local alignments computed by Blastz (2002-2008) or Lastz (2008-) post-processed by a series of UCSC programs, most notably axtChain, chainNet and netFilter.

The contents of this page are from Angie's mental model of chains and nets and represent opinions which may be outdated or plain old incorrect. The source code, and the results that we get by running these programs on real data, are the ultimate source of truth about chains and nets.

Please keep in mind that the outputs of any alignment algorithm are not the final Truth about homology between sequences. The scoring system and other parameters of any alignment algorithm are designed to produce high scores for similarities that would likely result from some model of nucleotide-level evolution; tweaking a parameter can change the results significantly. The quality and completeness of the reference assemblies also affect alignment results. That said, chains and nets are powerful constructs for identifying similarities over very large regions of the genome, and inferring chromosomal rearrangements that may have occurred as the two sequences diverged from a common ancestral sequence.

Basic definitions

In chain and net lingo, the target is the reference genome sequence and the query is some other genome sequence. For example, if you are viewing Human-Mouse alignments in the Human genome browser, human is the target and mouse is the query.

A gapless block is a base-for-base alignment between part of the target and part of the query, possibly including mismatching bases. It has the same length in bases on the target and the query. This is the output of the most primitive alignment algorithms.

A gap is a link between two gapless blocks, indicating that the target or the query has sequence that should be skipped over in order to make the best-scoring alignment. In other words, the scoring penalty for skipping over one or more bases is less than the penalty for continuing to align the sequences without skipping.

A single-sided gap is a gap in which sequence in either target or query must be skipped over. A plausible explanation for needing to skip over a base in the target while not skipping a base in the query is that either the target has an inserted base or the query has a deleted base. Many alignment tools produce alignments with single-sided gaps between gapless blocks.

A double-sided gap skips over sequence in both target and query because the sum of penalties for mismatching bases exceeds the penalty for extending a gap across them. This is possible only when the penalty for extending a gap is less than the penalty for creating a new gap and less than the penalty for a mismatch, and when the alignment algorithm is capable of considering double-sided gaps.

Chains in a nutshell

A chain is a sequence of non-overlapping gapless blocks, with single- or double-sided gaps between blocks. Within a chain, target and query coords are monotonically non-decreasing (i.e. always increasing or flat). Chains are constructed by the axtChain program which finds pairwise alignments with the same target and query sequence, on the same strand, that can be merged if overlapping and joined into one longer alignment with a higher score under an affine gap-scoring system (progressively decreasing penalties for longer gaps).

  • double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed.
  • not just orthologs, but paralogs too, can result in good chains. but that's useful!
  • chains should be symmetrical -- e.g. swap human-mouse -> mouse-human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments. However, Blastz's dynamic masking is asymmetrical, so in practice those results are not exactly symmetrical. Also, dynamic masking in conjunction with changed chunk sizes can cause differences in results from one run to the next.
  • chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done.
  • chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).

chains and the strand

Chains are unusual in that the position is specified from the end of the chromosome, rather than the start, when they're on the negative strand. This has led to a lot of grief over the years and confuses most programmers when they start coding with chain files.

Also, the "chain" database tables are not exactly like the chain file. Brian in #19186: In our databases, chains are assumed to be on the + strand on the target/reference, that is, we don't save the target strand in the database schemas for chains. The chain file format, which is something different than the schemas we use for the TWO tables that make up each chain track, explicitly stores the target strand so it could theoretically be -.

In both formats the blocks that make up a single chain are defined to have the same target and query strand.

Nets in a nutshell

A net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, which in turn may have gaps filled in by lower-level chains and so on.

  • I think a chain's qName also helps to determine which level it lands in, i.e. it makes a difference whether a chain's qName is the same as the top-level chain's qName or not, because the levels have meanings associated with them -- see details page.
  • a net is single-coverage for target but not for query, unless it has been filtered to be single-coverage on both target and query. By convention we add "rbest" to the net filename in that case.
  • because it's single-coverage in the target, it's no longer symmetrical.
  • the netter has two outputs, one of which we usually ignore: the target-centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again.
  • nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.

"LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process.

History

Chains and nets are Jim Kent's brainchild, building on joint work with blastz author Scott Schwartz.

Cross-species chains and nets used to be generated by a long manual process documented in some of our older makeDb/doc/*.txt files, but since ~2006 they have been generated by the script kent/src/hg/utils/automation/doBlastzChainNet.pl .

Same-species liftOver chains use blat -fastMap as the alignment method, and are generated by kent/src/hg/utils/automation/doSameSpeciesLiftOver.pl, based on a series of scripts that Kate wrote in kent/src/hg/makeDb/makeLoChain/.

FAQs?

The original publication describing chains and net, Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes (Kent et al., PNAS September 30, 2003 100 (20) 11484-11489), might be helpful for understanding the rationale behind the process.

> It’s somehow opposite with the alignment file, right? For example, the psl file records info of sequences from query assembly mapping agains sequences > from target assembly.

The PSL format lists query coords before target coords, but the UCSC convention is that the target is the reference genome (on which the Genome Browser displays) and the query is the 'other' (the query could be mRNA from the same species as the target/reference genome, or could be another species). The PSL format is the native output format of BLAT, which differs from BLAST in that BLAT indexes the target and scans the query while BLAST indexes the query and scans the target. Those are pretty esoteric differences, but I think they help explain why the UCSC view of target and query might be the opposite of some others.

> And then collapse the chains to one.

The chains are not collapsed to one; that would violate the constraint on a chain that it has monotonically increasing coordinates on the same target and query sequences in the same orientation. In the net, the two chains are retained at different levels (the primary alignment is at the top level and the secondary alignment is at a lower level). When the netChainSubset program extracts the liftOver chains from the net and full set of chains, it outputs the complete primary chain, but it outputs only a portion of the secondary chain.

> I don’t understand why use the secondary alignment to fill the gap in the primary alignment.

Chains and nets were designed to capture medium-to-large-scale rearrangements during evolution of species from a common ancestor: duplications, inversions, translocations. For example, in the case of an inversion, we would expect a gap in the top-level alignment to coincide with an alignment to the opposite strand of the same sequence (and similar breakpoints). With a translocation, a gap in the top-level alignment might be "filled" by an alignment to some other chromosome. When there is a duplication in the target, the target might have two (diverged) copies of the ancestral sequence that align to the same un-duplicated location in the query. In that case, the top-level chain would have a gap that is filled by an alignment to the same location on the query (single-coverage on the target, but multiple coverage on the query).

> I suppose chain should reflect the true difference between two assembly. Say the contig a is actually corresponding to the primary hit region in > hg19. Here if the gap is filled as described above to generate a chain, wouldn’t that cause the gapped bases from hg19 being lifted to > a false corresponding region in contig?

All bioinformatics algorithms are attempts to approximate the truth, and they fall short, especially when their assumptions and parameters are not tuned exactly right for the question at hand. If alignment parameters are overly sensitive for the actual divergence of target and query, then yes, spurious alignments will probably appear in the results. But there may be other explanations for unexpected alignments. (Assembly errors, sequencing errors, unexpected variation, some new discovery for you to make....) There is no simple answer that applies to all situations, and no substitute for trying different parameters and methods and carefully examining the results to see what works best for your particular application. It may help to make custom tracks using our bigChain format so you can examine and compare results in the Genome Browser.

> I’ve recently discovered that we can use minimap2 to generate alignment file and convert to psl and use that psl to generate chain file. If I can > solve the multiple-mapping issue in the alignment file level, I don’t need to perform netChainSubset right?

Perhaps -- that is up to you to determine. Best wishes for your research! If you know results of an evaluation using minimap2 versus lastz, please contact us at genome@soe.ucsc.edu.

Navigation: back to Implementation_Notes