Lastz/chain/net/multiz considerations/caveats/restrictions/limitations

From genomewiki
Jump to navigationJump to search

Introduction

The lastz/chain/net/multiz processing pipeline is the primary alignment procedure used at the U.C. Santa Cruz Genomics Institute to produce multiple alignments. There are a number of considerations that should be taken into account by consumers of the resulting data that could certainly affect conclusions drawn from such analysis.

lastz

As with any alignment algorithm, the choice of parameters for lastz is critical to the results produced by this alignment program. Typically the parameters chosen fall into three categories based on the phylogenetic distance and/or clade relationship of target and query sequences.

  1. primate to primate alignments (e.g. human<->chimp)
  2. closer phylogenetic relationship (e.g. human<->mouse)
  3. more distant phylgenetic relationship (e.g. human<->fish)

Parameters used for lastz are recorded in the UCSC Multiple Alignments page. See also: lastz documentation.

One specific parameter that is interesting is the --masking=<count> (aka M=<count>) ususlly set at a count of 254. From the lastz documentation:

 Dynamically mask the target sequence by excluding any positions that appear in too many alignments from further consideration for seeds.

 Specifically, a cumulative count is maintained of the number of times each target location is aligned. After each query sequence and strand is
 processed, any locations that have been output in at least <count> alignment blocks are masked, so they will be excluded from the seeding stage
 for subsequent query sequences. Since repetition discovered while processing one sequence strand is only masked for subsequent sequence
 strands, this option has no effect on the first strand of the first sequence in the query file.

Therefore, any target sequence that has more than 254 matching alignments to the query sequence will output only those first 254 alignments. Query matches beyond that number will not be included in the output.

chain

Please note previous Chains Nets discussion.

See also: Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784

net

multiz

The first important point to consider for multiz results is that these alignments are classified as reference based alignments. Only alignments between the query sequences and the target sequence are included in the final result of the multiple alignment. The inputs to multiz are only the pair-wise alignments of the query to target sequence alignments. There is no indication of potential relationships between the query sequences with the other query sequences. There could be important alignment relationships between query sequences which will not be seen in the multiz result.

The second important point about the multiz alignments is the choice of chain/net used as input to multiz. The choice is one of three types of chain/net results:

  1. syntenic net - used for higher quality genome assemblies, not too phylogentically distant
  2. reciprocal best net - used when the query genome is poor quality, high contig count, lower N50 size
  3. net - the fundamental net alignment, used for phylogenetically distant species

The type of chain/net used in the alignment is captured in the UCSC Multiple Alignments page.

  • The syntenic net and reciprocal best net are subsets of the fundamental net alignment.
  • The syntenic net eliminates alignments that would be interruptions to synteny in the query to target alignment.
  • The reciprocal best net selects only those alignments between target and query that are reciprocal. Only the single best alignment between query and target is included in the result. This eliminates many to many alignments from query to target.