Chains Nets

From genomewiki
Jump to navigationJump to search

Chains and nets are higher-level collections of basic pairwise sequence alignments. Cross-species nets are used to make a single-coverage (on the reference genome) collection of pairwise alignments that are the bases of our Multiz multi-species alignments in the Conservation track. The chain and net algorithms, as well as results from human-mouse alignments, were [published] in 2002. They are generated from genomic local alignments computed by Blastz (2002-2008) or Lastz (2008-) post-processed by a series of UCSC programs, most notably axtChain, chainNet and netFilter.

The contents of this page are from Angie's mental model of chains and nets and represent opinions which may be outdated or plain old incorrect. The source code, and the results that we get by running these programs on real data, are the ultimate source of truth about chains and nets.

Please keep in mind that the outputs of any alignment algorithm are not the final Truth about homology between sequences. The scoring system and other parameters of any alignment algorithm are designed to produce high scores for similarities that would likely result from some model of nucleotide-level evolution; tweaking a parameter can change the results significantly. The quality and completeness of the reference assemblies also affect alignment results. That said, chains and nets are powerful constructs for identifying similarities over very large regions of the genome, and inferring chromosomal rearrangements that may have occurred as the two sequences diverged from a common ancestral sequence.

Basic definitions

In chain and net lingo, the target is the reference genome sequence and the query is some other genome sequence. For example, if you are viewing Human-Mouse alignments in the Human genome browser, human is the target and mouse is the query.

A gapless block is a base-for-base alignment between part of the target and part of the query, possibly including mismatching bases. It has the same length in bases on the target and the query. This is the output of the most primitive alignment algorithms.

A gap is a link between two gapless blocks, indicating that the target or the query has sequence that should be skipped over in order to make the best-scoring alignment. In other words, the scoring penalty for skipping over one or more bases is less than the penalty for continuing to align the sequences without skipping.

A single-sided gap is a gap in which sequence in either target or query must be skipped over. A plausible explanation for needing to skip over a base in the target while not skipping a base in the query is that either the target has an inserted base or the query has a deleted base. Many alignment tools produce alignments with single-sided gaps between gapless blocks.

A double-sided gap skips over sequence in both target and query because the sum of penalties for mismatching bases exceeds the penalty for extending a gap across them. This is possible only when the penalty for extending a gap is less than the penalty for creating a new gap and less than the penalty for a mismatch, and when the alignment algorithm is capable of considering double-sided gaps.

Chains in a nutshell

A chain is a sequence of non-overlapping gapless blocks, with single- or double-sided gaps between blocks. Within a chain, target and query coords are monotonically non-decreasing (i.e. always increasing or flat). Chains are constructed by the axtChain program which finds pairwise alignments with the same target and query sequence, on the same strand, that can be merged if overlapping and joined into one longer alignment with a higher score under an affine gap-scoring system (progressively decreasing penalties for longer gaps).

  • double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed.
  • not just orthologs, but paralogs too, can result in good chains. but that's useful!
  • chains should be symmetrical -- e.g. swap human-mouse -> mouse-human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments. However, Blastz's dynamic masking is asymmetrical, so in practice those results are not exactly symmetrical. Also, dynamic masking in conjunction with changed chunk sizes can cause differences in results from one run to the next.
  • chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done.
  • chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs).

chains and the strand

Chains are unusual in that the position is specified from the end of the chromosome, rather than the start, when they're on the negative strand. This has led to a lot of grief over the years and confuses most programmers when they start coding with chain files.

Also, the "chain" database tables are not exactly like the chain file. Brian in #19186: In our databases, chains are assumed to be on the + strand on the target/reference, that is, we don't save the target strand in the database schemas for chains. The chain file format, which is something different than the schemas we use for the TWO tables that make up each chain track, explicitly stores the target strand so it could theoretically be -.

In both formats the blocks that make up a single chain are defined to have the same target and query strand.

Nets in a nutshell

A net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, which in turn may have gaps filled in by lower-level chains and so on.

  • I think a chain's qName also helps to determine which level it lands in, i.e. it makes a difference whether a chain's qName is the same as the top-level chain's qName or not, because the levels have meanings associated with them -- see details page.
  • a net is single-coverage for target but not for query, unless it has been filtered to be single-coverage on both target and query. By convention we add "rbest" to the net filename in that case.
  • because it's single-coverage in the target, it's no longer symmetrical.
  • the netter has two outputs, one of which we usually ignore: the target-centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again.
  • nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.

"LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process.

Net construction example

> If customAssembly has two contigs a b and both have two alignments (1 primary and 1 secondary) against hg19, will the net be a collection of two chains? Like the first chain is two primary alignments of contig a and contig b, respectively. And the second chain is two secondary alignments of contig a and contig b?

The net is constructed to be single-coverage on the target assembly. First, the highest-scoring chain for a target sequence is added to the net; whatever parts of the target sequence are covered by that top-scoring chain cannot be covered by any other chain. Then, the next highest-scoring chain for that target sequence is selected. If it falls before or after the highest-scoring chain on the target sequence, or if it falls entirely within a gap in the highest-scoring chain, then it is added to the net without any modifications. However, if it overlaps the highest-scoring chain on the target sequence, then the overlapping part is removed from it, and if the remaining part of the chain is high-scoring enough then the remaining part is added. Then that process repeats with the next highest-scoring chain, and so on. When a chain is added, any part of the chain that overlaps any part of any chain already in the net along the target sequence is first removed.

If contig a and contig b are separate sequences in the custom assembly, then they will not form a single chain, even if they align to adjacent regions of hg19. There will be one chain on contig a and another chain on contig b, and separate nets will be constructed on a and b. So, in this example there will be four chains:

contig a <--> hg19 primary contig a <--> hg19 secondary contig b <--> hg19 primary contig b <--> hg19 secondary

Assuming the primary hg19 chains have higher scores than the secondary hg19 chains, the primary chains will form the top level of the nets for contig a and contig b. If there are gaps in the primary chains along contig a or b, but the secondary chains align to those regions in a or b, then parts of the secondary chains might be included in a lower level of the net. For example, if the chains look like this:

contig a : hg19 primary     XXXXXXXXXXX-------XXXXXXXXXXXXXXXXXXXXXXX
contig a : hg19 secondary          XXXXXXXX------XXXXXXXXXXXXXXXXXXXXX

Then the contig a net will look like this if the part of the secondary chain that fills the gap in the primary chain has a score >= the -minScore option of chainNet (default 2000):

contig a net top level      XXXXXXXXXXX-------XXXXXXXXXXXXXXXXXXXXXXX
contig a net second level              XXXX

Then the liftOver chains extracted from the net would include the hg19 primary chain, and only the little gap-filling piece of the secondary chain.

> What evidence is used to collapse the sequence alignment in one direction? Simply by comparing the alignment score?

Yes, chainNet compares the scores computed by axtChain according to its -scoreScheme and -linearGap parameters. When chainNet removes parts of a chain that overlaps parts of the target sequence that are already covered by higher-scoring chains in the net, it does not fully recompute the score on the remaining part of the chain (that would be too slow); it scales down the score based on the ratio of remaining aligned bases to original aligned bases.

There are pros and cons to liftOver being single-coverage on the "from"/source assembly vs. single-coverage on the "to"/destination assembly. It helps to draw diagrams of different scenarios (duplication on "from" assembly, duplication on "to" assembly), and consider how the single-coverage restriction on one side or the other would affect your results.

If single-coverage on the "from" assembly is better for your purposes, then you will want to run chainNet on chains that have the "from" assembly as target and "to" assembly as query, and then run netChainSubset on the net and chains to get liftOver chains.

However, if you prefer single-coverage on the "to" assembly, it is a little more complicated. You can run chainNet on chains that have the "to" assembly as target and "from" assembly as query (you can use the chainSwap program to make those if you already have chains with "from" as target). Then run netChainSubset (with "to" as target) to get reverse liftOver chains. Finally, run chainSwap on those chains to get liftOver chains with "from" as target, and run liftOver with the -multiple option to tell it that the chains are not single-coverage on target.

History

Chains and nets are Jim Kent's brainchild, building on joint work with blastz author Scott Schwartz.

Cross-species chains and nets used to be generated by a long manual process documented in some of our older makeDb/doc/*.txt files, but since ~2006 they have been generated by the script kent/src/hg/utils/automation/doBlastzChainNet.pl .

Same-species liftOver chains use blat -fastMap as the alignment method, and are generated by kent/src/hg/utils/automation/doSameSpeciesLiftOver.pl, based on a series of scripts that Kate wrote in kent/src/hg/makeDb/makeLoChain/.

FAQs?

The original publication describing chains and net, Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes (Kent et al., PNAS September 30, 2003 100 (20) 11484-11489), might be helpful for understanding the rationale behind the process.

> It’s somehow opposite with the alignment file, right? For example, the psl file records info of sequences from query assembly mapping agains sequences > from target assembly.

The PSL format lists query coords before target coords, but the UCSC convention is that the target is the reference genome (on which the Genome Browser displays) and the query is the 'other' (the query could be mRNA from the same species as the target/reference genome, or could be another species). The PSL format is the native output format of BLAT, which differs from BLAST in that BLAT indexes the target and scans the query while BLAST indexes the query and scans the target. Those are pretty esoteric differences, but I think they help explain why the UCSC view of target and query might be the opposite of some others.

> And then collapse the chains to one.

The chains are not collapsed to one; that would violate the constraint on a chain that it has monotonically increasing coordinates on the same target and query sequences in the same orientation. In the net, the two chains are retained at different levels (the primary alignment is at the top level and the secondary alignment is at a lower level). When the netChainSubset program extracts the liftOver chains from the net and full set of chains, it outputs the complete primary chain, but it outputs only a portion of the secondary chain.

> I don’t understand why use the secondary alignment to fill the gap in the primary alignment.

Chains and nets were designed to capture medium-to-large-scale rearrangements during evolution of species from a common ancestor: duplications, inversions, translocations. For example, in the case of an inversion, we would expect a gap in the top-level alignment to coincide with an alignment to the opposite strand of the same sequence (and similar breakpoints). With a translocation, a gap in the top-level alignment might be "filled" by an alignment to some other chromosome. When there is a duplication in the target, the target might have two (diverged) copies of the ancestral sequence that align to the same un-duplicated location in the query. In that case, the top-level chain would have a gap that is filled by an alignment to the same location on the query (single-coverage on the target, but multiple coverage on the query).

> I suppose chain should reflect the true difference between two assembly. Say the contig a is actually corresponding to the primary hit region in > hg19. Here if the gap is filled as described above to generate a chain, wouldn’t that cause the gapped bases from hg19 being lifted to > a false corresponding region in contig?

All bioinformatics algorithms are attempts to approximate the truth, and they fall short, especially when their assumptions and parameters are not tuned exactly right for the question at hand. If alignment parameters are overly sensitive for the actual divergence of target and query, then yes, spurious alignments will probably appear in the results. But there may be other explanations for unexpected alignments. (Assembly errors, sequencing errors, unexpected variation, some new discovery for you to make....) There is no simple answer that applies to all situations, and no substitute for trying different parameters and methods and carefully examining the results to see what works best for your particular application. It may help to make custom tracks using our bigChain format so you can examine and compare results in the Genome Browser.

> I’ve recently discovered that we can use minimap2 to generate alignment file and convert to psl and use that psl to generate chain file. If I can > solve the multiple-mapping issue in the alignment file level, I don’t need to perform netChainSubset right?

Perhaps -- that is up to you to determine. Best wishes for your research! If you know results of an evaluation using minimap2 versus lastz, please contact us at genome@soe.ucsc.edu.

Navigation: back to Implementation_Notes